Try   HackMD

Supervisor Upstream Token Refresh

Problem Statement

The Pinniped Supervisor is an OIDC Issuer. It supports the OIDC offline_access scope with the refresh_token grant type as a way to refresh a user's tokens without requiring user interaction. Upon logging in to pinniped, users are granted a refresh token that is valid for 9 hours.

Currently Pinniped users' initial authentication is with the external identity provider, but the Supervisor does not verify user information with the external IDP when refreshing tokens. It only checks whether it recognizes the token and whether it is within the 9 hours.

When the refresh flow occurs, the supervisor should verify that the user still exists in the upstream identity provider and check for any changes in group membership.

Use Cases

Password Change

  1. User logs in to their Kubernetes clusters via Pinniped in the morning. They receive a token with their current username and groups based on information from their external identity provider.
  2. The user changes their password halfway through the day.
  3. When the user runs a kubectl command that triggers the refresh flow, the Pinniped supervisor checks with the external IDP and prompts the user to log in again.

Updated Groups

  1. User logs in to their Kubernetes clusters via Pinniped in the morning. They receive a token with their current username and groups based on information from their external identity provider.
  2. An admin in the IDP adds or removes the user from groups in the external identity provider.
  3. When the user runs a kubectl command that triggers the refresh flow, the Pinniped supervisor checks with the external IDP and sees that the user still exists and (if applicable) has a valid session, but their groups have changed. The user is issued a new token with the updated groups.

How do we validate that a user is still logged in/still has valid credentials?

The method of checking the upstream IDP will be different depending on the IDP type. Currently, we support 3 different upstream identity provider types, OIDCIdentityProvider, LDAPIdentityProvider and ActiveDirectoryIdentityProvider.

OIDCIdentityProvider

During the initial token request flow to the upstream IDP, we should request the offline_access scope so that we get an upstream refresh token. Then we can use that refresh token to get a new access, id, and refresh token when the downstream refresh happens.
We should also be able to use the new upstream access token to check for group information from the UserInfo endpoint.

We need to start sending prompt=consent (used by google and possibly other idps)along with access_type=offline to always get a refresh token from the upstream OIDC IDP (or at least, signal clearly that we want one). In the absence of a refresh token, we would just use the access token to gate the session lifetime (which may be really short).

We had some good conversation in regards to the prompt parameter:

https://github.com/vmware-tanzu/pinniped/pull/850
https://kubernetes.slack.com/archives/C01BW364RJA/p1632242482109100

tldr is that we need to make it clear that the pinniped supervisor does not support the user or OAuth client specifying a prompt value as we need control over it in the backend to get refresh tokens issued.

Our current approach of only sending access_type=offline results in us only getting a refresh token on the very first login for a particular user instead of on every login. Setting both prompt and access_type gives us a refresh token on each login as expected. I confirmed that with https://accounts.google.com set as an OIDC IDP, we can have multiple refresh tokens in flight and use each independently to get access tokens.

Another aspect to consider: if we ask for refresh tokens from the IDP, we should ideally "logout" those sessions (via the revocation_endpoint, see https://datatracker.ietf.org/doc/html/rfc7009) once the Pinniped session expires (refresh tokens from Google are valid for 6 months). This should happen even if the user never does a refresh flow with Pinniped. This would imply that we need to hold the refresh tokens in plaintext on the server side.

Specific IDP considerations:

  • Google requires an access_type param to be set to offline to get a refresh token, unlike the more common scope of offline_access. Then you can request new tokens using grant_type refresh_token (making sure we request the openid scope so we get a new id token).

Note that there are limits on the number of refresh tokens that will be issued; one limit per client/user combination, and another per user across all clients. You should save refresh tokens in long-term storage and continue to use them as long as they remain valid. If your application requests too many refresh tokens, it may run into these limits, in which case older refresh tokens will stop working.

Seems like the limit in question is 50 per google account per client id, according to google's documentation. So we should be fine.
Google does not appear to invalidate the existing refresh token and grant a new one when you use it. Here is a sample response (note that there isn't a refresh token in it):

{
  "access_token": "1/fFAGRNJru1FTz70BzhT3Zg",
  "expires_in": 3920,
  "scope": "https://www.googleapis.com/auth/drive.metadata.readonly",
  "token_type": "Bearer"
}
  • Okta uses the default OIDC behavior of requesting theoffline_access scope to get a refresh token, which you can refresh by sending a token request with grant_type refresh_token. You have to request the openid scope to get a new id token.
    Valid responses will always return a refresh token, it depends on okta configuration whether it's a new one or the old one.
    Okta has some weird, complicated behavior related to the prompt param and consent dialogs.

  • Gitlab uses a pretty standard OIDC flow. The old refresh token is always revoked and a new one issued during the refresh flow.

Does requesting access_type=offline param break integrations with other IDPs?
Does requesting offline_access scope break integration with Google? Hopefully not, according to section 2.4 of the OIDC spec, "Scope values used that are not understood by an implementation SHOULD be ignored", so even if it's not used it shouldn't hurt.

LDAPIdentityProvider

LDAP doesn't have a built in refresh flow. We could find out if the groups changed with a query, but for the use case where you changed your password, the only way to find out that the bind credentials you gave us don't work anymore is by binding, which requires passing the username and password back to the external ldap identity provider.
It's hard to find out whether an account has been disabled using other means, because it's common advice to just change the user's password to a different value without changing any other attributes, 1 2. Some ldap providers have attributes such as pwdAccountLockedTime or nsAccountLock, but there's no standard.

ActiveDirectoryIdentityProvider

Since Active Directory communicates using LDAP, the same considerations that apply to LDAP apply to active directory. We could perform other checks because, unlike generic ldap, the schema of Active Directory is pretty well known. For example we could see if the pwdChangedTime was after the time that you logged in, or whether the userAccountControl attribute means that it's disabled.

How should we store the credentials we need?

Each of the idps requires storing some extra information about the upstream idp in order to validate user information upon refresh refresh token for OIDC, username and password for LDAP. Refresh tokens are not as sensitive as LDAP credentials.

We already store downstream tokens as Kubernetes secrets on the Supervisor. But storing LDAP credentials would have a bigger potential downside than storing just Pinniped tokens, because of privilege escalation. If a malicious user had access to read secrets on the supervisor, they would be able to get the upstream credentials of all of the currently logged in users. We could encrypt the secrets, but where would we store the encryption key? If it were stored as a secret it may as well not be encrypted.
Storing OIDC refresh tokens on the Supervisor cluster may be necessary so that we can revoke them after some time.

We could keep the credentials on the user's own machine rather than on the supervisor, so each client only stores their own credentials. This requires that we pass the information back to the client, get them to store it, and pass it back to us upon refresh, all in an OIDC compliant way. It would be nice if it was fairly standard so that other clients could use it in the future. This would be done by making the downstream refresh token be based on upstream refresh token/bind information. We could encrypt the credentials (using some ephemeral key), pass them back to the user as a refresh token, and decrypt them to check against upstream when they pass them back. We would have to store the encryption key on the supervisor, probably as a secret so that each Supervisor pod can access it. Alternatively, we could do the opposite, where we pass the key back to the user and keep the encrypted credentials.

How long will downstream sessions be?

OIDC

OIDC has a built-in concept of a session length. If the upstream refresh token works, you should be able to refresh your supervisor tokens.
However, it's possible that the upstream session length isn't desirable. For example, if the upstream IDP session is valid for a week, but the pinniped admin wants sessions to only last for a day, that should be configurable. In that case, refresh should work if a) upstream refresh works and b) the user-configured refresh time has not elapsed. We could add a field OIDCIdentityProvider.spec.refresh.sessionLength to allow this to be configured.
What if you want longer tokens than your upstream IDP allows? We won't be able to validate the upstream IDP session length, so we can't necessarily disallow this, but we should make it clear in the docs that this is not intended behavior.

LDAP/Active Directory

There's no concept of a session length. This would have to be set in Pinniped, whether that's a user configuration option or a preset default.
The default could remain 9 hours, since that's what we have right now.
We could add a field LDAPIdentityProvider.spec.refresh.sessionLength to be configured.

What about garbage collection?

If a user logs in using OIDC, and then doesn't make any kubectl commands, they won't trigger the downstream refresh flow, so we won't trigger the upstream refresh flow and learn if your session is still valid. If we only tie session length to upstream refresh, some tokens could just lie around forever. We should implement an idle timeout to delete these secrets. This isn't security critical (once they try to use their credentials, we will check upstream and find out whether it's allowed), so the length could be long, like a day. It could also be user configurable.

Can users opt out?

Current Pinniped users may expect that they can have a 9 hour session, regardless of their upstream IDP settings. It's possible that users would want to continue using Pinniped this way.
It's also possible that even with the precautions we take to encrypt them, users will be skeeved out by us storing user LDAP credentials.
Opting out would require an additional field (spec.refresh.disableUpstreamRefresh) on the identityprovider to either opt-in or opt-out. It's probably better to make it opt-out, to keep it safe. Opt-out would change behavior for upgrading users, but that seems like a good thing.

How would this work if we had multiple idps?

We would likely need to encode information about which idp the refresh token was for upon refresh. However, if we don't do that now we should still be fine in the future, the worst that could happen is that the refresh tokens don't work and we prompt the user to log in again after upgrade. We should cross that bridge when we come to it.

Strawdog Proposal

OIDC

API changes to OIDCIdentityProvider

  • Add field spec.refresh.sessionLength, to allow users to set a shorter session length than their upstream. Default is to defer to upstream.
  • Add field spec.refresh.idleTimeout, to allow users to choose how long to keep tokens around as secrets in the supervisor before garbage collecting them. Default is 24 hours.

Implementing the flow

  • Upon upstream login, always include prompt=consent, access_type=offline, and scope includes offline_access.
  • Downstream login should return an error if prompt=none,login or select_account are requested, do not ignore them.
  • Upon login, receiving a refresh token back from the upstream, we should store it on the Supervisor as a secret, and pass back an unrelated refresh token that we create to be the downstream refresh token (saving it as a secret just as we do now)
  • When a user refreshes their downstream session:
    • get the upstream refresh token out of storage
    • request a new access and id token from the upstream
    • make a new request to the user info endpoint
    • If these requests succeed, pass a new id token and refresh token back to the user.
    • If we got a new refresh token back (like from Gitlab)we will replace the old one in storage.
    • If we get an error response, delete secrets and return an error downstream
  • If a user is idle for more than the idleTimeout, make a revocation request to the upstream to delete the refresh token, then garbage collect their associated secrets.

LDAP

API changes to LDAPIdentityProvider

  • Add field spec.refresh.sessionLength to allow users to set session length. Default is 9 hours.
  • Add field spec.refresh.disableUpstreamRefresh to allow users to opt out of upstream refresh.

Implementation

  • Upon upstream login, generate a key and encrypt LDAP username and password with it to generate an opaque refresh token. Save the key as a secret on the supervisor. Pass back the opaque token to the user.
  • When a user refreshes their downstream session, first check if we're past login time + sessionLength, if not, get the key out of storage and use it to decrypt the refresh token, then attempt a bind.

API changes to ActiveDirectoryIdentityProvider

  • Add field spec.refresh.sessionLength to allow users to set session length. Default is 9 hours.

Implementation

  • Upon upstream login, generate an unrelated token and pass it back to the user. Store the login time in supervisor secrets.
  • Upon refresh, query active directory to ensure that neither a) userAccountControl=2 (the code that represents a disabled account) and b) pwdChangedTime > login time (meaning the password has changed since we last logged in).