owned this note
owned this note
Published
Linked with GitHub
# Network level authentication flow when accessing resources behind earthdata login
This document tries to document the exact set of authentication procedures
needed to access data behind earthdata login, documented at the network level.
This means explicitly stating, for example, that HTTP Basic Auth is used -
or a particular s3 based flow needs to be performed.
## Resources to draw on
https://docs.google.com/document/d/18GyoMZj0I2HKAXwqyeziO0ISbOwHxo1TN4eAlR4mH3U/view
https://github.com/nasa/cumulus/blob/master/packages/api/endpoints/s3credentials-readme/instructions/index.md
https://github.com/asfadmin/thin-egress-app/blob/master/docs/s3access.md
## When data is on AWS
### When user is on AWS us-west-2
#### What protocol is allowed?
Both HTTPS:// and S3:// is available when you're working from AWS `us-west-2`.
* HTTPS:// -- Typically, when you make an HTTP request, you'll get redirected to urs for authentication, and then you'll be redirected (eventually) to an signed-S3 HTTP link.
* This can sometime be challenging, depending on the tooling used. Generally, the EDL authentication header gets forward along with each redirect. However, AWS will *reject* your connection if both EDL auth headers and the S3 acces key is provided (this is *correct* behavior). Some tools (e.g., curl) will include both, and some will not (e.g., wget)
* This is particularly problematic, as I think this is the issue we kept running into. It crashes in very unclear ways as well. Documenting what the *right* thing to do here is very important. Right now, it feels like you can't 'just' use HTTPS because of this issue.
* Getting this behavior 'fixed' somehow would be really really helpful!
* Figuring out what extra headers are being sent is very helpful. Currently it's just 403
* I think a useful case here is 'curl fails from inside us-west-2, but not from outside'. And write it up, and go from there with details.
* TODO: Write up a step by step of 'curl fails from inside us-west-2, but succeeds elsewhere. This is how you make curl succeed from inside us-west-2'
* S3:// -- To use S3 you'll need AWS access keys. Typically you can request S3 access key from a DAAC endpoint via HTTP. When you have the S3 access keys, you can use AWS-aware tools like normal. However, those keys will expire in 1 hour, and so you'll need to request keys regularly.
It does **not** count against the egress cap for NASA regardless of the protocol being used.
### When user is *not* on AWS us-west-2
#### What protocol is allowed?
**Only** HTTPS is allowed, not `s3://`.
## When data is not on AWS
#### What protocol is allowed?
**Only** HTTPS is allowed.
#### What is the authentication protocol?
Automated *OAuth2* is used!
So each
https://urs.earthdata.nasa.gov/documentation/faq#How%20do%20I%20encode%20a%20username%20and%20password%20for%20HTTP%20basic%20authentication? Kinda hilarious that the examples for earth data basic access are in… perl
Other example from GES DISC: https://disc.gsfc.nasa.gov/data-access
### Is the EDL session valid across multiple DAACs? meaning if we originally authenticate to PODAAC and then want to use the same cookie to access data from NSIDC
suggests we use standard HTTP Basic Authentication
## how does the OAuth2 protocol works?
(Link to Luis' notebook here)
1. oauth client is *per DAAC*? So client_id is per-daac. So client_secret is with the DAAC?
2. If you are hitting nsidc, it authenticates with earthdata, gets oauth token. Then something at nsidc will redirect you to appropriate URL.
3. But when it's on the cloud, there is an *additional cookie* from Amazon.
Redirect loop:
Original URL -> Earthdata OAuth -> Original URL -> CloudFront -> Final URL
Not everything is behind [TEA](https://github.com/asfadmin/thin-egress-app)! Rewrite of Fat Egress app. It primarily does egress here. It is also possible that we don't know how to do this in 'one go', as there are probably many different things that are going.
The goal is to find a way to get
**Clear action item:**
To go through this redirect loop and identify exactly the part where curl breaks. (Using curl as an example is good because very widespread). Find where curl breaks but wget doesn't, and figure out the fix for that. (This is a strategic approach too since "not new package")
Helping the team that maintains TEA in the long run, less piecemeal in the future. "So that end-users won't have to hear the term US-West2".
Additionally:
- Luis: Let's get an inventory of the most-used datasets per DAAC to showcase the widespreadness of the problem (not a corner case). We need an inventory: "This uses EDL and this uses XYZ". Brianna's doc is helpful:
- Brianna: there's a TOP 50 list; this is the order that we move things to the cloud
- Sean: "We'd love to have a DAAC agnostic credential endpoint"
- Joe: so a package could help
**Next Steps**
(Ignoring EULA part for now; tackle that separately)
## Summary of next steps (Yuvi)
We went through and tried to document the urls that are happening and why people have trouble accessing from us-west-2 and get an unhelpful 404 error (not to mention that as a user you first need to know what uswest 2 means). This isn't a problem when you're on prem, only on the cloud. This has a technical fix, not a social-political fix.
We have some understanding of why the above might be.
We're coming from the idea that TEA is a problem; we'll tackle this this upstream (TEA). It's not the end-all solution because there are performance problems but this will fix the problem of access and we can punt the issue or what happens with large-scale datasets and performance until later, since that is a more narrow usecase (accessing data is the first hurdle to tackle).
We'll frame this interms of curl, the most widespread https client.
- Works with data not hosted on aws
- Doesn't work when X
- Does work when Y
^ this is a bug that should be fixed. Frame this as a small fix that this team can do and contributes to a larger scale fix. Yuvi helps id a minimum technical fix
Yuvi - writes a few GH issues
Joe: Some DAACs use TEA, others use Cumulus. Joe will try to inventory who uses what for which dataset - Brianna, Luis and I can do this internally; what is being to serve data
## Next steps for fsspec
If we concluded to fix TEA for HTTP maybe there isn't need for fsspec
Joe: https://github.com/nasa/cumulus/blob/master/packages/api/endpoints/s3credentials-readme/instructions/index.md
vs
https://github.com/asfadmin/thin-egress-app/blob/master/docs/s3access.md
Alexey: This is the opposite problem than we've been talking about: https://gist.github.com/ashiklom/6d3cf6e12ea2582221e9e7446bc94f6a > the "failing" link works for Brianna
If we can specifically document [ ] this would be a win. This is where we got blocked, we tried it on Pangeo Forge
Issues:
1. document: why don't my XYZ work?
2. documenting the 403 error
3. curl issue - Yuvi
4. inventory - Joe + Brianna