owned this note changed 2 years ago
Published Linked with GitHub

Network level authentication flow when accessing resources behind earthdata login

This document tries to document the exact set of authentication procedures needed to access data behind earthdata login, documented at the network level. This means explicitly stating, for example, that HTTP Basic Auth is used - or a particular s3 based flow needs to be performed.

Resources to draw on

https://docs.google.com/document/d/18GyoMZj0I2HKAXwqyeziO0ISbOwHxo1TN4eAlR4mH3U/view https://github.com/nasa/cumulus/blob/master/packages/api/endpoints/s3credentials-readme/instructions/index.md https://github.com/asfadmin/thin-egress-app/blob/master/docs/s3access.md

When data is on AWS

When user is on AWS us-west-2

What protocol is allowed?

Both HTTPS:// and S3:// is available when you're working from AWS us-west-2.

  • HTTPS:// Typically, when you make an HTTP request, you'll get redirected to urs for authentication, and then you'll be redirected (eventually) to an signed-S3 HTTP link.

    • This can sometime be challenging, depending on the tooling used. Generally, the EDL authentication header gets forward along with each redirect. However, AWS will reject your connection if both EDL auth headers and the S3 acces key is provided (this is correct behavior). Some tools (e.g., curl) will include both, and some will not (e.g., wget)
      • This is particularly problematic, as I think this is the issue we kept running into. It crashes in very unclear ways as well. Documenting what the right thing to do here is very important. Right now, it feels like you can't 'just' use HTTPS because of this issue.
      • Getting this behavior 'fixed' somehow would be really really helpful!
      • Figuring out what extra headers are being sent is very helpful. Currently it's just 403
      • I think a useful case here is 'curl fails from inside us-west-2, but not from outside'. And write it up, and go from there with details.
      • TODO: Write up a step by step of 'curl fails from inside us-west-2, but succeeds elsewhere. This is how you make curl succeed from inside us-west-2'
  • S3:// To use S3 you'll need AWS access keys. Typically you can request S3 access key from a DAAC endpoint via HTTP. When you have the S3 access keys, you can use AWS-aware tools like normal. However, those keys will expire in 1 hour, and so you'll need to request keys regularly.

It does not count against the egress cap for NASA regardless of the protocol being used.

When user is not on AWS us-west-2

What protocol is allowed?

Only HTTPS is allowed, not s3://.

When data is not on AWS

What protocol is allowed?

Only HTTPS is allowed.

What is the authentication protocol?

Automated OAuth2 is used!

So each https://urs.earthdata.nasa.gov/documentation/faq#How do I encode a username and password for HTTP basic authentication? Kinda hilarious that the examples for earth data basic access are in… perl

Other example from GES DISC: https://disc.gsfc.nasa.gov/data-access

suggests we use standard HTTP Basic Authentication

how does the OAuth2 protocol works?

(Link to Luis' notebook here)

  1. oauth client is per DAAC? So client_id is per-daac. So client_secret is with the DAAC?
  2. If you are hitting nsidc, it authenticates with earthdata, gets oauth token. Then something at nsidc will redirect you to appropriate URL.
  3. But when it's on the cloud, there is an additional cookie from Amazon.

Redirect loop:

Original URL -> Earthdata OAuth -> Original URL -> CloudFront -> Final URL

Not everything is behind TEA! Rewrite of Fat Egress app. It primarily does egress here. It is also possible that we don't know how to do this in 'one go', as there are probably many different things that are going.

The goal is to find a way to get

Clear action item:

To go through this redirect loop and identify exactly the part where curl breaks. (Using curl as an example is good because very widespread). Find where curl breaks but wget doesn't, and figure out the fix for that. (This is a strategic approach too since "not new package")

Helping the team that maintains TEA in the long run, less piecemeal in the future. "So that end-users won't have to hear the term US-West2".

Additionally:

  • Luis: Let's get an inventory of the most-used datasets per DAAC to showcase the widespreadness of the problem (not a corner case). We need an inventory: "This uses EDL and this uses XYZ". Brianna's doc is helpful:
    • Brianna: there's a TOP 50 list; this is the order that we move things to the cloud
  • Sean: "We'd love to have a DAAC agnostic credential endpoint"
    • Joe: so a package could help

Next Steps (Ignoring EULA part for now; tackle that separately)

Summary of next steps (Yuvi)

We went through and tried to document the urls that are happening and why people have trouble accessing from us-west-2 and get an unhelpful 404 error (not to mention that as a user you first need to know what uswest 2 means). This isn't a problem when you're on prem, only on the cloud. This has a technical fix, not a social-political fix.

We have some understanding of why the above might be. We're coming from the idea that TEA is a problem; we'll tackle this this upstream (TEA). It's not the end-all solution because there are performance problems but this will fix the problem of access and we can punt the issue or what happens with large-scale datasets and performance until later, since that is a more narrow usecase (accessing data is the first hurdle to tackle).

We'll frame this interms of curl, the most widespread https client.

  • Works with data not hosted on aws
  • Doesn't work when X
  • Does work when Y

^ this is a bug that should be fixed. Frame this as a small fix that this team can do and contributes to a larger scale fix. Yuvi helps id a minimum technical fix

Yuvi - writes a few GH issues

Joe: Some DAACs use TEA, others use Cumulus. Joe will try to inventory who uses what for which dataset - Brianna, Luis and I can do this internally; what is being to serve data

Next steps for fsspec

If we concluded to fix TEA for HTTP maybe there isn't need for fsspec

Joe: https://github.com/nasa/cumulus/blob/master/packages/api/endpoints/s3credentials-readme/instructions/index.md vs https://github.com/asfadmin/thin-egress-app/blob/master/docs/s3access.md

Alexey: This is the opposite problem than we've been talking about: https://gist.github.com/ashiklom/6d3cf6e12ea2582221e9e7446bc94f6a > the "failing" link works for Brianna

If we can specifically document [ ] this would be a win. This is where we got blocked, we tried it on Pangeo Forge

Issues:

  1. document: why don't my XYZ work?
  2. documenting the 403 error
  3. curl issue - Yuvi
  4. inventory - Joe + Brianna
Select a repo