kill aws-sdk-cpp with fire #272

Open
opened 2024-05-06 05:09:39 +00:00 by jade · 11 comments
Owner

lix can't just remove it and use the libcurl builtin s3 support because handling aws credential files is hard

challenge accepted. i would even go so far as saying we could just take the aws sdk code for credential files, vendor it, then toss the entire rest in the bin, if that's all we need.

see also #246

> lix can't just remove it and use the libcurl builtin s3 support because handling aws credential files is hard - @artemist challenge accepted. i would even go so far as saying we could just take the aws sdk code for credential files, vendor it, then toss the entire rest in the bin, if that's all we need. see also https://git.lix.systems/lix-project/lix/issues/246
Member

please... cross-compiling aws-cpp-sdk from linux to freebsd is a huge pain

please... cross-compiling aws-cpp-sdk from linux to freebsd is a huge pain
Owner

Related: using the curl support to add s3 capability https://gerrit.lix.systems/c/lix/+/926

Related: using the curl support to add s3 capability https://gerrit.lix.systems/c/lix/+/926
Author
Owner

https://github.com/rusoto/rusoto/tree/master/rusoto/credential maybe of interest? could we replace this (actually very complex) creds code with statically linking more rust into our process?

https://github.com/rusoto/rusoto/tree/master/rusoto/credential maybe of interest? could we replace this (actually very complex) creds code with statically linking more rust into our process?
Member

We discussed about this topic yesterday with @puck and one thing that came up is that entirely replacing aws-sdk-cpp with a custom implementation is likely out of the question. In particular, we know some users rely on features like "fetching instance credentials from IMDS for EC2 machines", and... yeah, you can reimplement that, but it starts getting hairy. Long term Lix could make the decision to drop those features, but this is something that should be carefully assessed.

(Rusoto implements that too, but only if you also depend on stuff like Hyper, and uh do we really want another HTTP client statically linked? Doubtful that it would do much good to the compile times / dependency chains at this point.)

IMO what we should however do is:

  1. Support "simple" and explicit credentials loading natively (e.g. s3://mycache?creds-file=/run/secrets/creds.txt). This has other benefits, like not relying on ambient environment (IMO a pretty bad misfeature in most cases...).
  2. Making the aws-sdk-cpp dependency lighter by making it an ambient creds loader only and using curl instead of S3Client (#246) for s3:// URLs.

The combination of (1) and (2) allows for Lix builds without aws-sdk-cpp that can still do basic S3 usage, especially in non-AWS environments.

I'm writing a short design to describe all that and make sure we agree on the details and unless that conflicts with anyone else's plans I'll probably work on it in the following days/weeks.

We discussed about this topic yesterday with @puck and one thing that came up is that entirely replacing aws-sdk-cpp with a custom implementation is likely out of the question. In particular, we know some users rely on features like "fetching instance credentials from IMDS for EC2 machines", and... yeah, you *can* reimplement that, but it starts getting hairy. Long term Lix could make the decision to drop those features, but this is something that should be carefully assessed. (Rusoto implements that too, but only if you also depend on stuff like Hyper, and uh do we really want another HTTP client statically linked? Doubtful that it would do much good to the compile times / dependency chains at this point.) IMO what we should however do is: 1. Support "simple" and explicit credentials loading natively (e.g. `s3://mycache?creds-file=/run/secrets/creds.txt`). This has other benefits, like not relying on ambient environment (IMO a pretty bad misfeature in most cases...). 2. Making the aws-sdk-cpp dependency lighter by making it an ambient creds loader only and using curl instead of S3Client (#246) for s3:// URLs. The combination of (1) and (2) allows for Lix builds without aws-sdk-cpp that can still do basic S3 usage, especially in non-AWS environments. I'm writing a short design to describe all that and make sure we agree on the details and unless that conflicts with anyone else's plans I'll probably work on it in the following days/weeks.
Member

AWS authentication is really complex and feature-rich. There's SSO, MFA, IMDS, IAM... aws-vault does a good job describing those. Maintaining that ourselves is impossible, vendoring is problematic long-term, and including aws-sdk-cpp is overkill.

I'd like to see a detailed proposal. But roughly speaking, I think we should support three auth schemas:

  1. No auth
  2. Basic auth
  3. External auth (read creds from file? launch external process for auth?)

The one that puzzles me is number 3. With AWS S3, it's quite easy: you can just export AWS env vars and run your S3 queries. Not sure if it works for other S3-compatible solutions.

AWS authentication is really complex and feature-rich. There's SSO, MFA, IMDS, IAM... [aws-vault](https://github.com/99designs/aws-vault/blob/master/USAGE.md) does a good job describing those. Maintaining that ourselves is impossible, vendoring is problematic long-term, and including aws-sdk-cpp is overkill. I'd like to see a detailed proposal. But roughly speaking, I think we should support three auth schemas: 1. No auth 2. Basic auth 3. External auth (read creds from file? launch external process for auth?) The one that puzzles me is number 3. With AWS S3, it's quite easy: you can just export AWS env vars and run your S3 queries. Not sure if it works for other S3-compatible solutions.
Member

Here is a concrete proposal how we can delete basically everything except for a dependency on aws-c-auth for the credential chain:

https://github.com/NixOS/nix/issues/13084#issue-3018269771

The s3-binary-cache-store (especially substitution) is extremely buggy. Meanwhile our http substituter is not buggy and way more battle-tested

Proposed solution

Use http-binary-cache-store to talk to S3 directly

libcurl has aws-sigv4 authentication built in these days: https://curl.se/libcurl/c/CURLOPT_AWS_SIGV4.html

So we can simple use the existing FileTransfer implementation to push to and pull from S3. As S3 is simply REST semantics that map to what http-binary-cache-store already does

The only thing that we need to keep is the AWS credential chain to actually fetch the credentials to pass to curl but for that we only need to depend on https://github.com/awslabs/aws-crt-cpp or even smaller https://github.com/awslabs/aws-c-auth

This also solves the problem of people complaining that we link against aws-cpp-sdk as aws-c-auth is a way lighter dependency

aws-c-auth suffers from the same problem as https://github.com/NixOS/nix/issues/5947 but now we only need one library to enable BYO_CRYPTO instead of a whole bunch of them. So it makes things significantly easier.

Something like this in filetransfer should work. We already special case s3:// URLs in FileTransfer so we can use that to do the following instead of calling out to the S3 SDK:


// making a request to   s3.${region}.amazonaws.com/${bucket}/${key}

// TODO: get awsAccessKeyId and friends from aws-c-auth 

curl_easy_setopt(req,  CURLOPT_HTTPAUTH, CURLAUTH_AWS_SIGV4);
curl_easy_setopt(req,  CURLOPT_USERNAME, awsAccessKeyId);
curl_easy_setopt(req,  CURLOPT_PASSWORD, awsSecretKey);

if (awsSessionToken) {
  struct curl_slist *list = NULL;
  curl_slist_append(list, "x-amz-security-token", awsSessionToken)
  curl_easy_setopt(req, CURLOPT_HTTPHEADER, list);
}

Now all the HTTP PUT/GET/POST/GET operations should work as expected.

Alternative solutions

Fix all the weird bugs with the current S3 substituter

Additional context

Checklist


Add 👍 to issues you find important.

Here is a concrete proposal how we can delete basically everything except for a dependency on `aws-c-auth` for the credential chain: https://github.com/NixOS/nix/issues/13084#issue-3018269771 ## Is your feature request related to a problem? The s3-binary-cache-store (especially substitution) is extremely buggy. Meanwhile our http substituter is not buggy and way more battle-tested * https://github.com/NixOS/nix/issues/12671 * https://github.com/NixOS/nix/issues/11748 * https://github.com/NixOS/nix/issues/12403 ## Proposed solution Use `http-binary-cache-store` to talk to S3 directly `libcurl` has aws-sigv4 authentication built in these days: https://curl.se/libcurl/c/CURLOPT_AWS_SIGV4.html So we can simple use the existing FileTransfer implementation to push to and pull from S3. As S3 is simply REST semantics that map to what `http-binary-cache-store` already does The only thing that we need to keep is the AWS credential chain to actually fetch the credentials to pass to curl but for that we only need to depend on https://github.com/awslabs/aws-crt-cpp or even smaller https://github.com/awslabs/aws-c-auth This also solves the problem of people complaining that we link against `aws-cpp-sdk` as `aws-c-auth` is a way lighter dependency `aws-c-auth` suffers from the same problem as https://github.com/NixOS/nix/issues/5947 but now we only need one library to enable `BYO_CRYPTO` instead of a whole bunch of them. So it makes things significantly easier. Something like this in filetransfer should work. We already special case `s3://` URLs in `FileTransfer` so we can use that to do the following instead of calling out to the S3 SDK: ``` // making a request to s3.${region}.amazonaws.com/${bucket}/${key} // TODO: get awsAccessKeyId and friends from aws-c-auth curl_easy_setopt(req, CURLOPT_HTTPAUTH, CURLAUTH_AWS_SIGV4); curl_easy_setopt(req, CURLOPT_USERNAME, awsAccessKeyId); curl_easy_setopt(req, CURLOPT_PASSWORD, awsSecretKey); if (awsSessionToken) { struct curl_slist *list = NULL; curl_slist_append(list, "x-amz-security-token", awsSessionToken) curl_easy_setopt(req, CURLOPT_HTTPHEADER, list); } ``` Now all the HTTP PUT/GET/POST/GET operations should work as expected. ## Alternative solutions Fix all the weird bugs with the current S3 substituter ## Additional context <!-- Add any other context or screenshots about the feature request here. --> ## Checklist <!-- make sure this issue is not redundant or obsolete --> - [ ] checked [latest Nix manual] \([source]) - [ ] checked [open feature issues and pull requests] for possible duplicates [latest Nix manual]: https://nixos.org/manual/nix/unstable/ [source]: https://github.com/NixOS/nix/tree/master/doc/manual/source [open feature issues and pull requests]: https://github.com/NixOS/nix/labels/feature --- Add :+1: to [issues you find important](https://github.com/NixOS/nix/issues?q=is%3Aissue+is%3Aopen+sort%3Areactions-%2B1-desc).

@delroth wrote in #272 (comment):

In particular, we know some users rely on features like "fetching instance credentials from IMDS for EC2 machines", and... yeah, you can reimplement that, but it starts getting hairy.

'Interestingly', running lix & lix-hydra 2.93 against garage, I seem to see ~10 second delays as aws-sdk tries to contact 169.254.169.254 (apparently for IMDS), unless I set:
systemd.services.hydra-server.serviceConfig.Environment = [ "AWS_EC2_METADATA_DISABLED=true" ]
This feels non-obvious, and another reason to try to kill as much of aws-sdk as possible?

@delroth wrote in https://git.lix.systems/lix-project/lix/issues/272#issuecomment-3712: > In particular, we know some users rely on features like "fetching instance credentials from IMDS for EC2 machines", and... yeah, you _can_ reimplement that, but it starts getting hairy. 'Interestingly', running lix & lix-hydra 2.93 against garage, I seem to see ~10 second delays as aws-sdk tries to contact 169.254.169.254 (apparently for IMDS), unless I set: `systemd.services.hydra-server.serviceConfig.Environment = [ "AWS_EC2_METADATA_DISABLED=true" ]` This feels non-obvious, and another reason to try to kill as much of aws-sdk as possible?
Member

That's the part of the SDK that I don't want to kill. Did you configure your Lix to have credentials for garage correctly? If there are no env vars with AWS credentials then AWS will try contacting the IMDS for credentials. But only if it couldn't find credentials in the first place.

Note that both the nix daemon needs access to the env vars

That's the part of the SDK that I don't want to kill. Did you configure your Lix to have credentials for garage correctly? If there are no env vars with AWS credentials then AWS will try contacting the IMDS for credentials. But only if it couldn't find credentials in the first place. Note that both the nix daemon needs access to the env vars
Member

That's the part of the SDK that I don't want to kill. As I rely on IMDS at work :) Did you configure your Lix to have credentials for garage correctly? If there are no env vars with AWS credentials then AWS will try contacting the IMDS for credentials. But only if it couldn't find credentials in the first place.

That's the part of the SDK that I don't want to kill. As I rely on IMDS at work :) Did you configure your Lix to have credentials for garage correctly? If there are no env vars with AWS credentials then AWS will try contacting the IMDS for credentials. But only if it couldn't find credentials in the first place.

Yeah I saw that, but I figured if that was the only bit of the SDK left standing it might be easier to selectively disable it. Pretty sure the creds must be set correctly, as garage doesn't allow anonymous access? I don't remember seeing this problem ~12 months ago, I wonder if the SDK behaviour has changed in a relatively recent version.

Edit: yeah, creds are set in /var/lib/hydra/queue-runner/.aws/credentials, /root/.aws/credentials (and /var/lib/hydra/.aws/credentials, but I don't think that's needed.) Hmm, now wondering if the search order is different if creds are set with an env var? Edit2: no :(

Yeah I saw that, but I figured if that was the only bit of the SDK left standing it might be easier to selectively disable it. Pretty sure the creds must be set correctly, as garage doesn't allow anonymous access? I don't remember seeing this problem ~12 months ago, I wonder if the SDK behaviour has changed in a relatively recent version. **Edit:** yeah, creds are set in `/var/lib/hydra/queue-runner/.aws/credentials`, `/root/.aws/credentials` (and `/var/lib/hydra/.aws/credentials`, but I don't think that's needed.) Hmm, now wondering if the search order is different if creds are set with an env var? **Edit2:** no :(
Member
https://github.com/NixOS/nix/pull/13752
Sign in to join this conversation.
No milestone
No project
No assignees
8 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: lix-project/lix#272
No description provided.