Try   HackMD

Fixing "500 Certificate in an error state. Fix any errors, and then click Retry." Blocking Renewal In cert-manager (vcert#269)

The problem: When using cert-manager's Venafi built-in issuer or when running vcert enroll with TPP, people get "stuck" with an error of the like:

​​​​500 Certificate \VED\Policy\Test\foo.com has encountered an error while processing,
​​​​Status: This certificate cannot be processed while it is in an error state. Fix any
​​​​errors, and then click Retry., Stage: 700.

This message occurs when a past enrollment has failed or an enrollment was still in progress for that certificate. The current workaround is to call to POST /reset with Restart=False, and then re-run the command vcert enroll (or renew the certificate in cert-manager).

Resolution Progress

cert-manager

cert-manager Fixed? How was it solved?
1.9.*
1.10.*
1.11.0 Solution 3 built into VCert 4.23.0 but fails 50% of the time (https://github.com/Venafi/vcert/issues/273)
1.11.1 — 1.11.4 Solution 2 using VCert fork
1.11.5 Solution 2 ad-hoc using ResetCertificate in VCert 5.0.0 Doesn't work, see #6397
1.12.0 — 1.12.2 Solution 2 using VCert fork
1.12.3 — 1.12.5 Solution 2 ad-hoc using ResetCertificate in VCert 5.0.0 Doesn't work, see #6397
1.12.6 and up Solution 2 ad-hoc using ResetCertificate in VCert 5.0.0
1.13.0 — 1.13.1 Solution 2 ad-hoc using ResetCertificate in VCert 5.0.0 Doesn't work, see #6397
1.13.2 and up Solution 2 ad-hoc using ResetCertificate in VCert 5.0.0
1.14.0 and up Solution 2 ad-hoc using ResetCertificate in VCert 5.0.0

cert-manager 1.11.0 (Jan 2023) fixed this issue by using Solution 3 by using VCert v4.23.0.

cert-manager 1.12.0 (May 2023) changed to Solution 2 by using a patched version of VCert. The decision was taken after the fix had been reverted in VCert 4.24.0. The team knew that this patch would slow down the issuance of certificates by 9% in case of heavy load on TPP, but deemed the risk of not being able to renew too high.

cert-manager 1.11.5, 1.12.3, and 1.13.0 changed to calling ResetCertificate manually with Solution 2 thanks to VCert 5.0.0. But these versions were found buggy: they don't call /reset when they should as detailed in #6397.

VCert

VCert 4.23.0 (Dec 2022) fixed the problem using Solution 3.

VCert 4.24.0 (Feb 2023) removed the fix after a discussion between the cert-manager and VCert teams. Solution 3 was changing VCert's behavior in a weird way, it was thus reverted. You can read the meeting notes below to know more.

VCert 5.0.0 (Aug 2023) added the ResetCertificate Go function. It was decided that the problem of "500 Fix any errors, and then click Retry" would have to be fixed by the users of VCert. The VCert CLI wasn't fixed with this change.

Solutions Considered

Solution 1: reset if retrieve returns 500, then request

Pseudo-algorithm:

resp = tpp.retrieve()
if resp is 500
    tpp.reset()

tpp.request()

A prototype of this solution was implemented in commit 0de76500.

In terms of number of calls, this solution adds a new call to POST /retrieve, and also add POST /reset when a past enrollment exists.

In terms of implementation complexity, this solution is complex due to the many possible results of POST /retrieve. In the below table, I describe each possible case that I was able to trigger. It is read as "given this state and given this result of POST /retrieve, here is the result of POST /reset that we can expect".

State Result of POST /retrieve Result of POST /reset
No existing cert and past enrollment was reset* 202 {"Stage": 0, "Status": "Not yet available"} 400 {"Error": "Reset is not completed. No reset is required for the certificate."}
Existing cert and no past enrollment 200 {"CertificateData":"..."} 400 {"Error": "Reset is not completed. No reset is required for the certificate."}
Existing cert and past enrollment is pending** 202 {"Stage": 500, "Status": "Post CSR"} 200 {"ProcessingResetCompleted": true}
Existing cert and past enrollment is failed** 500 {"Stage": 500, "Status": "Post CSR failed with error: Cannot connect to the certificate authority (CA)."} 200 {"ProcessingResetCompleted": true}
Existing cert and past enrollment has timed out due to WorkToDoTimeout 500 {"Stage": 500, "Status": "???"} (unsure) 200 {"ProcessingResetCompleted": true}
Past enrollment is failed and another cert was requested 500 {"Stage":500, "Status":"WebSDK CertRequest Module Requested Certificate"} or {"Stage":500,"Status":"This certificate cannot be processed while it is in an error state. Fix any errors, and then click Retry."} 200 {"ProcessingResetCompleted": true}
?? 500 {"Stage":-1,"Status":"Not yet available"}
?? 202 {"Stage":-1,"Status":"WebSDK CertRequest Module Requested Certificate"}

*The "empty cert, empty enrollment" is a special case where a certificate gets created, but the enrollment doesn't get through and is reset.

**Enrollments can't get stuck at stage 0 (except for the special case "no existing cert and past enrollment was reset". For example, when a user-provided CSR is submitted with an email SAN but the policy folder says that email SANs are forbidden, POST /request will fail immediately.

🔥 This solution adds one additional call to the happy path, and that is not acceptable due to the knowledge that TPP customers already struggle with lengthy enrollments when many concurrent "enrollments" (e.g., vcert enroll calls) happen simultanously.

Solution 2: reset, then request

Pseudo-algorithm:

tpp.reset(restart=false)

tpp.request()

A prototype of this solution was implemented in commit aad355f3.

In terms of the performance impact, this solution adds one call to POST /reset every time vcert enroll is called.

In terms of complexity, the implementation is very simple since we blindly call POST /reset.

🔥 This solution relies on the assumption that calling POST /reset is always cheap and doesn't trigger any expensive backgroud task on TPP's side. This solution also adds one additional call to the happy path, and that is not acceptable due to the knowledge that TPP customers already struggle with lengthy enrollments when many concurrent "enrollments" (e.g., vcert enroll calls) happen simultanously.

🔥 Also, if no certificate issuance was in progress, POST /reset returns 400, which may confuse users. Herre is what it would look like in the cert-manager logs:

vCert: Got 400 Bad request. Check the error in the response for details. status for POST https://tpp.example.net/vedsdk/certificates/reset
vCert: Got 200 OK status for POST https://tpp.example.net/vedsdk/certificates/request

Solution 3: request, then reset if retrieve returns "Click Retry" or "WebSDK CertRequest"

Solution 3 was hinted by Ryan in https://github.com/Venafi/vcert/issues/239#issuecomment-1242109588.

Pseudo-algorithm for requesting the certificate (credit goes to Dmitry Philimonov, see internal message):

resp = tpp.request()
if resp is not 200:
    return error
  
while true:
    resp = tpp.retrieve()
    if first retrieve call and resp is 500 and resp.Status is one of
    ["WebSDK CertRequest Module Requested Certificate", "This certificate cannot be processed while it is in an error state. Fix any errors, and then click Retry."):
        if tpp.reset(restart=true) is 200:
            continue
        else:
            return error
    else if resp is 202:
        continue
    else if resp is 200:
      return resp.cert

Performance-wise, this solution avoids the need for additional calls in the happy path.

Complexity-wise, this solution is as complex as solution 2, and less complex than solution 3, in which we would need to understand every possible value returned by POST /retrieve.

🔥 This solution relies on a very specific behavior of POST /retrieve: the endpoint MUST either of the two following Status messages if and only if the past enrollment is failed:

This certificate cannot be processed while it is in an error state. Fix any errors, and then click Retry.

or

WebSDK CertRequest Module Requested Certificate

If the TPP slightly changes these messages, or doesn't show them at all, then vcert will stop resetting, but it won't break.

Update 1 Dec 2022: We have seen a case where TPP 22.04 would return the message "WebSDK CertRequest Module Requested Certificate" even though no prior enrollment existed. That happened when one of the Windows services was down (we think it was the Logging service). We were not able to reproduce this issue, and assume it is a case of "misbehaving TPP" and won't happen in production TPP instances.

Could we rely solely on the "Click retry" message? Here are the results of attempting to retrieve after requesting when there is an already-failing enrollment:

Case 1 occurs 60% of the time:

{"Stage":500,"Status":"This certificate cannot be processed while it is in an error state. Fix any errors, and then click Retry."}
{"Stage":500,"Status":"This certificate cannot be processed while it is in an error state. Fix any errors, and then click Retry."}
{"Stage":500,"Status":"This certificate cannot be processed while it is in an error state. Fix any errors, and then click Retry."}
{"Stage":500,"Status":"This certificate cannot be processed while it is in an error state. Fix any errors, and then click Retry."}
{"Stage":500,"Status":"This certificate cannot be processed while it is in an error state. Fix any errors, and then click Retry."}

Case 2 occurs 30% of the time:

{"Stage":500,"Status":"WebSDK CertRequest Module Requested Certificate"}
{"Stage":500,"Status":"This certificate cannot be processed while it is in an error state. Fix any errors, and then click Retry."}
{"Stage":500,"Status":"This certificate cannot be processed while it is in an error state. Fix any errors, and then click Retry."}
{"Stage":500,"Status":"This certificate cannot be processed while it is in an error state. Fix any errors, and then click Retry."}
{"Stage":500,"Status":"This certificate cannot be processed while it is in an error state. Fix any errors, and then click Retry."}

Case 3 occurs 10% of the time:

{"Stage":500,"Status":"WebSDK CertRequest Module Requested Certificate"}
{"Stage":500,"Status":"WebSDK CertRequest Module Requested Certificate"}
{"Stage":500,"Status":"WebSDK CertRequest Module Requested Certificate"}
{"Stage":500,"Status":"WebSDK CertRequest Module Requested Certificate"}
{"Stage":500,"Status":"WebSDK CertRequest Module Requested Certificate"}

We cannot only solely on "Click Retry": we would miss 10% of the cases that we are trying to fix (cf. case 3).

Can we skip the "WebSDK CertRequest" messages until we see the "Click Retry" message? This is not idea: 10% of the time (case 3), we would end up waiting until the end of the timeout, and not reset.

Solution 3.1:

Solution 4 (doesn't work): request, then reset if request returned an error

resp = tpp.request()
if resp is not 200:
    tpp.reset(restart=true)

Solution 4 was found to not work. The problem is that POST /request always succeeds as long as the given CSR or certificate parameters are valid. Calling POST /request doesn't allow you to know whether POST /reset needs to be called or not.

Update: Resetting in RetrieveCertificate was a bad idea

Meeting notes for Feb 7 2023. Attending: Atanas, Dima, Mael, Tim.

  • RetrieveCertificate should not do an operation on the certificate object in TPP, it should only "query" and not "command". Users of VCert will find it unexpected that RetrieveCertificate does a reset call.
  • The suggestion is to revert https://github.com/Venafi/vcert/pull/269, and then create a public function Reset that will do reset(restart=false) followed by renew(worktodotimeout=30s) as well as a way for users of VCert to know if they need to reset after receiving an error on RetrieveCertificate or RequestCertificate(WorkToDoTimeout).
  • We also want to improve VCert so that it watches intead of polling. Watching is possible by using WorkToDoTimeout=30s when calling RetrieveCertificate, RequestCertificate, and RequestCertificate.
  • The performance team has reported that the average duration for the issuance of a certificate in TPP is 7 seconds. Using a watch of 30 seconds seems appropriate.