# Fixing "500 Certificate in an error state. Fix any errors, and then click Retry." Blocking Renewal In cert-manager ([vcert#269](https://github.com/Venafi/vcert/pull/269)) **The problem:** When using cert-manager's Venafi built-in issuer or when running `vcert enroll` with TPP, people get "stuck" with an error of the like: 500 Certificate \VED\Policy\Test\foo.com has encountered an error while processing, Status: This certificate cannot be processed while it is in an error state. Fix any errors, and then click Retry., Stage: 700. This message occurs when a past enrollment has failed or an enrollment was still in progress for that certificate. The current workaround is to call to `POST /reset` with Restart=False, and then re-run the command `vcert enroll` (or renew the certificate in cert-manager). ## Resolution Progress ### cert-manager | cert-manager | Fixed? | How was it solved? | |--------------|--------|-----------------------------------------------------------| | 1.9.* | ❌ | | | 1.10.* | ❌ | | | 1.11.0 | ✅ | Solution 3 built into VCert 4.23.0 | | 1.11.1 | ✅ | Solution 3 built into VCert 4.23.0 | | 1.11.2 | ✅ | Solution 3 built into VCert 4.23.0 | | 1.11.3 | ✅ | Solution 3 built into VCert 4.23.0 | | 1.11.4 | ✅ | Solution 3 built into VCert 4.23.0 | | 1.11.5 | ❌ | ~~Solution 2 ad-hoc using `ResetCertificate` in VCert 5.0.0~~ Doesn't work, see [#6397][] | | 1.12.0 | ✅ | Solution 2 built into fork of VCert | | 1.12.1 | ✅ | Solution 2 built into fork of VCert | | 1.12.2 | ✅ | Solution 2 built into fork of VCert | | 1.12.3 | ❌ | ~~Solution 2 ad-hoc using `ResetCertificate` in VCert 5.0.0~~ Doesn't work, see [#6397][] | | 1.13.0 | ❌ | ~~Solution 2 ad-hoc using `ResetCertificate` in VCert 5.0.0~~ Doesn't work, see [#6397][] | | 1.13.1 | ❌ | ~~Solution 2 ad-hoc using `ResetCertificate` in VCert 5.0.0~~ Doesn't work, see [#6397][] | [#6397]: https://github.com/cert-manager/cert-manager/issues/6397 cert-manager 1.11.0 (Jan 2023) fixed this issue by using Solution 3 by using VCert v4.23.0. cert-manager 1.12.0 (May 2023) changed to Solution 2 by using a patched version of VCert. The decision was taken after the fix had been reverted in VCert 4.24.0. The team knew that this patch would slow down the issuance of certificates by 9% in case of heavy load on TPP, but deemed the risk of not being able to renew too high. cert-manager 1.11.5, 1.12.3, and 1.13.0 changed to calling `ResetCertificate` manually with Solution 2 thanks to VCert 5.0.0. But these versions were found buggy: they don't call `/reset` when they should as detailed in [#6397][]. ### VCert VCert 4.23.0 (Dec 2022) fixed the problem using Solution 3. VCert 4.24.0 (Feb 2023) removed the fix after a discussion between the cert-manager and VCert teams. Solution 3 was changing VCert's behavior in a weird way, it was thus reverted. You can read the [meeting notes below](#Update-Resetting-in-RetrieveCertificate-was-a-bad-idea) to know more. VCert 5.0.0 (Aug 2023) added the `ResetCertificate` Go function. It was decided that the problem of "500 Fix any errors, and then click Retry" would have to be fixed by the users of VCert. The VCert CLI wasn't fixed with this change. ## Solutions Considered ### Solution 1: reset if retrieve returns 500, then request Pseudo-algorithm: ```python resp = tpp.retrieve() if resp is 500 tpp.reset() tpp.request() ``` A prototype of this solution was implemented in commit [0de76500](https://github.com/Venafi/vcert/pull/269/commits/0de765000234f1c4a0fc919df20b2e18db5cd436). In terms of number of calls, this solution adds a new call to `POST /retrieve`, and also add `POST /reset` when a past enrollment exists. In terms of implementation complexity, this solution is complex due to the many possible results of `POST /retrieve`. In the below table, I describe each possible case that I was able to trigger. It is read as "given this state and given this result of `POST /retrieve`, here is the result of `POST /reset` that we can expect". | State | Result of `POST /retrieve` | Result of `POST /reset` | | --- | --- | --- | | No existing cert and past enrollment was reset\* | 202 `{"Stage": 0, "Status": "Not yet available"}` | 400 `{"Error": "Reset is not completed. No reset is required for the certificate."}` | | Existing cert and no past enrollment | 200 `{"CertificateData":"..."}` | 400 `{"Error": "Reset is not completed. No reset is required for the certificate."}` | | Existing cert and past enrollment is pending\*\* | 202 `{"Stage": 500, "Status": "Post CSR"}` | 200 `{"ProcessingResetCompleted": true}` | | Existing cert and past enrollment is failed\*\* | 500 `{"Stage": 500, "Status": "Post CSR failed with error: Cannot connect to the certificate authority (CA)."}` | 200 `{"ProcessingResetCompleted": true}` | | Existing cert and past enrollment has timed out due to `WorkToDoTimeout` | 500 `{"Stage": 500, "Status": "???"}` (unsure) | 200 `{"ProcessingResetCompleted": true}` | | Past enrollment is failed and another cert was requested | 500 `{"Stage":500, "Status":"WebSDK CertRequest Module Requested Certificate"}` or `{"Stage":500,"Status":"This certificate cannot be processed while it is in an error state. Fix any errors, and then click Retry."}` | 200 `{"ProcessingResetCompleted": true}` | | ?? | 500 `{"Stage":-1,"Status":"Not yet available"}` | | | ?? | 202 `{"Stage":-1,"Status":"WebSDK CertRequest Module Requested Certificate"}` | | \*The "empty cert, empty enrollment" is a special case where a certificate gets created, but the enrollment doesn't get through and is reset. \*\*Enrollments can't get stuck at stage 0 (except for the special case "no existing cert and past enrollment was reset". For example, when a user-provided CSR is submitted with an email SAN but the policy folder says that email SANs are forbidden, `POST /request` will fail immediately. 🔥 This solution adds one additional call to the happy path, and that is not acceptable due to the knowledge that TPP customers already struggle with lengthy enrollments when many concurrent "enrollments" (e.g., `vcert enroll` calls) happen simultanously. ### Solution 2: reset, then request Pseudo-algorithm: ```python tpp.reset(restart=false) tpp.request() ``` A prototype of this solution was implemented in commit [aad355f3](https://github.com/Venafi/vcert/pull/269/commits/aad355f3d712fd870b5b29ead57441d2e949da60). In terms of the performance impact, this solution adds one call to `POST /reset` every time `vcert enroll` is called. In terms of complexity, the implementation is very simple since we blindly call `POST /reset`. 🔥 This solution relies on the assumption that calling `POST /reset` is always cheap and doesn't trigger any expensive backgroud task on TPP's side. This solution also adds one additional call to the happy path, and that is not acceptable due to the knowledge that TPP customers already struggle with lengthy enrollments when many concurrent "enrollments" (e.g., `vcert enroll` calls) happen simultanously. ### Solution 3: request, then reset if retrieve returns "Click Retry" or "WebSDK CertRequest" Solution 3 was hinted by Ryan in https://github.com/Venafi/vcert/issues/239#issuecomment-1242109588. Pseudo-algorithm for requesting the certificate (credit goes to Dmitry Philimonov, see [internal message](https://venafi.slack.com/archives/CELJXR94H/p1669078060661549)): ```python resp = tpp.request() if resp is not 200: return error while true: resp = tpp.retrieve() if first retrieve call and resp is 500 and resp.Status is one of ["WebSDK CertRequest Module Requested Certificate", "This certificate cannot be processed while it is in an error state. Fix any errors, and then click Retry."): if tpp.reset(restart=true) is 200: continue else: return error else if resp is 202: continue else if resp is 200: return resp.cert ``` Performance-wise, this solution avoids the need for additional calls in the happy path. Complexity-wise, this solution is as complex as solution 2, and less complex than solution 3, in which we would need to understand every possible value returned by `POST /retrieve`. 🔥 This solution relies on a very specific behavior of `POST /retrieve`: the endpoint MUST either of the two following `Status` messages if and only if the past enrollment is failed: > This certificate cannot be processed while it is in an error state. Fix any errors, and then click Retry. or > WebSDK CertRequest Module Requested Certificate If the TPP slightly changes these messages, or doesn't show them at all, then `vcert` will stop resetting, but it won't break. **Update 1 Dec 2022:** We have seen a case where TPP 22.04 would return the message "WebSDK CertRequest Module Requested Certificate" even though no prior enrollment existed. That happened when one of the Windows services was down (we think it was the Logging service). We were not able to reproduce this issue, and assume it is a case of "misbehaving TPP" and won't happen in production TPP instances. **Could we rely solely on the "Click retry" message?** Here are the results of attempting to retrieve after requesting when there is an already-failing enrollment: Case 1 occurs 60% of the time: ``` {"Stage":500,"Status":"This certificate cannot be processed while it is in an error state. Fix any errors, and then click Retry."} {"Stage":500,"Status":"This certificate cannot be processed while it is in an error state. Fix any errors, and then click Retry."} {"Stage":500,"Status":"This certificate cannot be processed while it is in an error state. Fix any errors, and then click Retry."} {"Stage":500,"Status":"This certificate cannot be processed while it is in an error state. Fix any errors, and then click Retry."} {"Stage":500,"Status":"This certificate cannot be processed while it is in an error state. Fix any errors, and then click Retry."} ``` Case 2 occurs 30% of the time: ``` {"Stage":500,"Status":"WebSDK CertRequest Module Requested Certificate"} {"Stage":500,"Status":"This certificate cannot be processed while it is in an error state. Fix any errors, and then click Retry."} {"Stage":500,"Status":"This certificate cannot be processed while it is in an error state. Fix any errors, and then click Retry."} {"Stage":500,"Status":"This certificate cannot be processed while it is in an error state. Fix any errors, and then click Retry."} {"Stage":500,"Status":"This certificate cannot be processed while it is in an error state. Fix any errors, and then click Retry."} ``` Case 3 occurs 10% of the time: ``` {"Stage":500,"Status":"WebSDK CertRequest Module Requested Certificate"} {"Stage":500,"Status":"WebSDK CertRequest Module Requested Certificate"} {"Stage":500,"Status":"WebSDK CertRequest Module Requested Certificate"} {"Stage":500,"Status":"WebSDK CertRequest Module Requested Certificate"} {"Stage":500,"Status":"WebSDK CertRequest Module Requested Certificate"} ``` We cannot only solely on "Click Retry": we would miss 10% of the cases that we are trying to fix (cf. case 3). **Can we skip the "WebSDK CertRequest" messages until we see the "Click Retry" message?** This is not idea: 10% of the time (case 3), we would end up waiting until the end of the timeout, and not reset. ### Solution 3.1: ### Solution 4 (doesn't work): request, then reset if request returned an error ```python resp = tpp.request() if resp is not 200: tpp.reset(restart=true) ``` Solution 4 was found to not work. The problem is that `POST /request` always succeeds as long as the given CSR or certificate parameters are valid. Calling `POST /request` doesn't allow you to know whether `POST /reset` needs to be called or not. <!-- **Self-review:** - I am unsure whether adding this "reset before requesting" can have side effects (e.g., hit the rate limit of any external service if the policy relies on an adaptable workflow that calls to that service, or the problem of "resetting" an enrolment that includes a manual approval). - Checking for "500" isn't enough I think, that's why I also check that the body corresponds to the expected JSON blob. - I went with a "mock" HTTP server due to the difficulty to get the 202 and 500 HTTP status codes "on demand" from TPP. - **Live test:** add a "live test" to `connector_test.go`, but (1) `connector_test.go` rather seems to be only made of smoke tests as opposed to fine-grained tests, and (2) I don't know how to consistently force TPP to return a 500 without RDP'ing into the VM and putting a PowerShell script (it can't be a policy check such as the domain, as this would be a "Stage 0" error which is the only stage number which gets reset upon requesting a new certificate). - **Fake server:** use a fake server to test it, but there is currently no fake server (note that @wallrj started to write a fake server in #262). The fake server, as opposed to the mock server, contains some logic. - **Mock responses:** use mock HTTP responses with a mock HTTP server. --> # Update: Resetting in RetrieveCertificate was a bad idea Meeting notes for Feb 7 2023. Attending: Atanas, Dima, Mael, Tim. - RetrieveCertificate should not do an operation on the certificate object in TPP, it should only "query" and not "command". Users of VCert will find it unexpected that RetrieveCertificate does a reset call. - The suggestion is to revert https://github.com/Venafi/vcert/pull/269, and then create a public function `Reset` that will do `reset(restart=false)` followed by `renew(worktodotimeout=30s)` as well as a way for users of VCert to know if they need to reset after receiving an error on RetrieveCertificate or RequestCertificate(WorkToDoTimeout). - We also want to improve VCert so that it watches intead of polling. Watching is possible by using WorkToDoTimeout=30s when calling RetrieveCertificate, RequestCertificate, and RequestCertificate. - The performance team has reported that the average duration for the issuance of a certificate in TPP is 7 seconds. Using a watch of 30 seconds seems appropriate.