The problem: When using cert-manager's Venafi built-in issuer or when running vcert enroll
with TPP, people get "stuck" with an error of the like:
500 Certificate \VED\Policy\Test\foo.com has encountered an error while processing,
Status: This certificate cannot be processed while it is in an error state. Fix any
errors, and then click Retry., Stage: 700.
This message occurs when a past enrollment has failed or an enrollment was still in progress for that certificate. The current workaround is to call to POST /reset
with Restart=False, and then re-run the command vcert enroll
(or renew the certificate in cert-manager).
cert-manager | Fixed? | How was it solved? |
---|---|---|
1.9.* | ❌ | |
1.10.* | ❌ | |
1.11.0 | ❌ | Solution 3 built into VCert 4.23.0 but fails 50% of the time (https://github.com/Venafi/vcert/issues/273) |
1.11.1 — 1.11.4 | ✅ | Solution 2 using VCert fork |
1.11.5 | ❌ | ResetCertificate in VCert 5.0.0 |
1.12.0 — 1.12.2 | ✅ | Solution 2 using VCert fork |
1.12.3 — 1.12.5 | ❌ | ResetCertificate in VCert 5.0.0 |
1.12.6 and up | ✅ | Solution 2 ad-hoc using ResetCertificate in VCert 5.0.0 |
1.13.0 — 1.13.1 | ❌ | ResetCertificate in VCert 5.0.0 |
1.13.2 and up | ✅ | Solution 2 ad-hoc using ResetCertificate in VCert 5.0.0 |
1.14.0 and up | ✅ | Solution 2 ad-hoc using ResetCertificate in VCert 5.0.0 |
cert-manager 1.11.0 (Jan 2023) fixed this issue by using Solution 3 by using VCert v4.23.0.
cert-manager 1.12.0 (May 2023) changed to Solution 2 by using a patched version of VCert. The decision was taken after the fix had been reverted in VCert 4.24.0. The team knew that this patch would slow down the issuance of certificates by 9% in case of heavy load on TPP, but deemed the risk of not being able to renew too high.
cert-manager 1.11.5, 1.12.3, and 1.13.0 changed to calling ResetCertificate
manually with Solution 2 thanks to VCert 5.0.0. But these versions were found buggy: they don't call /reset
when they should as detailed in #6397.
VCert 4.23.0 (Dec 2022) fixed the problem using Solution 3.
VCert 4.24.0 (Feb 2023) removed the fix after a discussion between the cert-manager and VCert teams. Solution 3 was changing VCert's behavior in a weird way, it was thus reverted. You can read the meeting notes below to know more.
VCert 5.0.0 (Aug 2023) added the ResetCertificate
Go function. It was decided that the problem of "500 Fix any errors, and then click Retry" would have to be fixed by the users of VCert. The VCert CLI wasn't fixed with this change.
Pseudo-algorithm:
A prototype of this solution was implemented in commit 0de76500.
In terms of number of calls, this solution adds a new call to POST /retrieve
, and also add POST /reset
when a past enrollment exists.
In terms of implementation complexity, this solution is complex due to the many possible results of POST /retrieve
. In the below table, I describe each possible case that I was able to trigger. It is read as "given this state and given this result of POST /retrieve
, here is the result of POST /reset
that we can expect".
State | Result of POST /retrieve |
Result of POST /reset |
---|---|---|
No existing cert and past enrollment was reset* | 202 {"Stage": 0, "Status": "Not yet available"} |
400 {"Error": "Reset is not completed. No reset is required for the certificate."} |
Existing cert and no past enrollment | 200 {"CertificateData":"..."} |
400 {"Error": "Reset is not completed. No reset is required for the certificate."} |
Existing cert and past enrollment is pending** | 202 {"Stage": 500, "Status": "Post CSR"} |
200 {"ProcessingResetCompleted": true} |
Existing cert and past enrollment is failed** | 500 {"Stage": 500, "Status": "Post CSR failed with error: Cannot connect to the certificate authority (CA)."} |
200 {"ProcessingResetCompleted": true} |
Existing cert and past enrollment has timed out due to WorkToDoTimeout |
500 {"Stage": 500, "Status": "???"} (unsure) |
200 {"ProcessingResetCompleted": true} |
Past enrollment is failed and another cert was requested | 500 {"Stage":500, "Status":"WebSDK CertRequest Module Requested Certificate"} or {"Stage":500,"Status":"This certificate cannot be processed while it is in an error state. Fix any errors, and then click Retry."} |
200 {"ProcessingResetCompleted": true} |
?? | 500 {"Stage":-1,"Status":"Not yet available"} |
|
?? | 202 {"Stage":-1,"Status":"WebSDK CertRequest Module Requested Certificate"} |
*The "empty cert, empty enrollment" is a special case where a certificate gets created, but the enrollment doesn't get through and is reset.
**Enrollments can't get stuck at stage 0 (except for the special case "no existing cert and past enrollment was reset". For example, when a user-provided CSR is submitted with an email SAN but the policy folder says that email SANs are forbidden, POST /request
will fail immediately.
🔥 This solution adds one additional call to the happy path, and that is not acceptable due to the knowledge that TPP customers already struggle with lengthy enrollments when many concurrent "enrollments" (e.g., vcert enroll
calls) happen simultanously.
Pseudo-algorithm:
A prototype of this solution was implemented in commit aad355f3.
In terms of the performance impact, this solution adds one call to POST /reset
every time vcert enroll
is called.
In terms of complexity, the implementation is very simple since we blindly call POST /reset
.
🔥 This solution relies on the assumption that calling POST /reset
is always cheap and doesn't trigger any expensive backgroud task on TPP's side. This solution also adds one additional call to the happy path, and that is not acceptable due to the knowledge that TPP customers already struggle with lengthy enrollments when many concurrent "enrollments" (e.g., vcert enroll
calls) happen simultanously.
🔥 Also, if no certificate issuance was in progress, POST /reset
returns 400, which may confuse users. Herre is what it would look like in the cert-manager logs:
Solution 3 was hinted by Ryan in https://github.com/Venafi/vcert/issues/239#issuecomment-1242109588.
Pseudo-algorithm for requesting the certificate (credit goes to Dmitry Philimonov, see internal message):
Performance-wise, this solution avoids the need for additional calls in the happy path.
Complexity-wise, this solution is as complex as solution 2, and less complex than solution 3, in which we would need to understand every possible value returned by POST /retrieve
.
🔥 This solution relies on a very specific behavior of POST /retrieve
: the endpoint MUST either of the two following Status
messages if and only if the past enrollment is failed:
This certificate cannot be processed while it is in an error state. Fix any errors, and then click Retry.
or
WebSDK CertRequest Module Requested Certificate
If the TPP slightly changes these messages, or doesn't show them at all, then vcert
will stop resetting, but it won't break.
Update 1 Dec 2022: We have seen a case where TPP 22.04 would return the message "WebSDK CertRequest Module Requested Certificate" even though no prior enrollment existed. That happened when one of the Windows services was down (we think it was the Logging service). We were not able to reproduce this issue, and assume it is a case of "misbehaving TPP" and won't happen in production TPP instances.
Could we rely solely on the "Click retry" message? Here are the results of attempting to retrieve after requesting when there is an already-failing enrollment:
Case 1 occurs 60% of the time:
Case 2 occurs 30% of the time:
Case 3 occurs 10% of the time:
We cannot only solely on "Click Retry": we would miss 10% of the cases that we are trying to fix (cf. case 3).
Can we skip the "WebSDK CertRequest" messages until we see the "Click Retry" message? This is not idea: 10% of the time (case 3), we would end up waiting until the end of the timeout, and not reset.
Solution 4 was found to not work. The problem is that POST /request
always succeeds as long as the given CSR or certificate parameters are valid. Calling POST /request
doesn't allow you to know whether POST /reset
needs to be called or not.
Meeting notes for Feb 7 2023. Attending: Atanas, Dima, Mael, Tim.
Reset
that will do reset(restart=false)
followed by renew(worktodotimeout=30s)
as well as a way for users of VCert to know if they need to reset after receiving an error on RetrieveCertificate or RequestCertificate(WorkToDoTimeout).