Couchbase Transaction Stages

# Couchbase Transaction Stages These are written in a pseudo-code form to make them easier to translate into an implementation. ## Building Blocks These small chunks of logic are reused throughout the algorithm, so DRYing them here. ### DoneCheck Pre-ExtThreadSafety: If the transaction has been committed or rolled back already, any other operations on this AttemptContext should be rejected with an `Error(ec=FAIL_OTHER, rollback=false)` and a cause of a suitable platform-specific exception to indicate bad state or invalid operation. EXT_QUERY: The check is done by checking an `isDone` parameter held on the `AttemptContext` Else: The check is done by checking the attempt’s state. If it’s NOT_WRITTEN or PENDING, it’s not done. All other states are regarded as Done. ExtThreadSafety replaces all logic. TODO: specify that logic... ### SelectATR If this is the first mutation in a transaction attempt, then select an ATR for the attempt. If ExtCustomMetadataCollection is supported and the TransactionConfig (or `TransactionOptions` as of ExtSDKIntegration) specifies a custom metadata collection, then use that. Else, use the default collection of the bucket of the mutated document. The ATR should be on the same vbucket as the document being mutated. This is done by first finding the vbucket for the document’s id. The Java implementation for this looks like: ```java public static int vbucketForKey(String id, int numPartitions) { CRC32 crc32 = new CRC32(); crc32.update(id.getBytes(StandardCharsets.UTF_8)); long rv = (crc32.getValue() >> 16) & 0x7fff; return (int) rv &numPartitions - 1; } ``` This will return a 0-based vbucket which can be used as an index into the ATR List, detailed below. ExtCustomMetadataCollection: The vbucket for a key ignores the collection id encoding, so the above vbucket logic should continue to work with non-default collections. ### ATR List By default there are 1,024 ATRs, one per vbucket. The Java transactions library (1.0.0+) includes an experimental setting “numATRs” to allow the application to change this. The maximum possible is 20 * 1,024 = 20,480. This is supported by a resource file of ATR ids in the format: ``` // Block 1 of 20 _txn:atr-0-#14 _txn:atr-1-#10b6 _txn:atr-2-#cc8 ... _txn:atr-1023-#10c2 // Block 2 of 20 _txn:atr-0-#5e _txn:atr-1-#113e ... ``` Each non-comment line is the key of an ATR. The “-1023-” part of the id identifies the vbucket, e.g. document “_txn:atr-1023-#10c2” will exist on the 1024th vbucket. The resource file is [here](https://github.com/couchbase/couchbase-transactions-java/blob/master/src/main/resources/ATRids.txt) and should be copied into the implementation project. It may either be distributed with the library and loaded at runtime, or converted into implementation code and compiled into the library - implementor's choice. The implementation does not need to implement the “numATRs” config parameter. If it does not, it is free to only load and use the first block of 1,024 ATRs if desired. This is the default block that will be used by the Java implementation if “numATRs” has not been changed. Future note: Future releases will likely change the ATR handling, likely putting them in a systems collection when that feature is available, and possibly increasing the default number. The experimental “numATRs” parameter may be formalised, but if so, it should be as a cluster UI option (as all implementations should use the same value). ### InitATRIfNeeded This takes a parameter `mayNeedToWriteATR: Boolean`, which supports some optimistic concurrency allowing us to avoid a lock if the ATR has already been written. ExtThreadSafety: it is of course vital that a maximum of one ATR entry is created per attempt. The algo: * If !mayNeedToWriteATR, just return as the ATR is already written. * ExtThreadSafety: [Lock](#Lock). * Check if the ATR still needs writing, by checking `state` == `NOT_STARTED`. If not, [Unlock](#Unlock) (if ExtThreadSafety) and return. * [SelectATR](#SelectATR) and [SetATRPending](https://hackmd.io/wE1KU1RKQtONGR8tXbpA1Q#SetATRPending). If that succeeds the `state` is now `PENDING`. * ExtThreadSafety: [Unlock](#Unlock). ### CheckWriteWriteConflict This logic checks and handles a document X previously read inside this transaction, A, being involved in another transaction B. It takes a: * `TransactionGetResult gr` variable * A forward compatibility interaction point `ip`. * ExtThreadSafety: a nullable/optional `StagedMutation existingMutation`, which indicates if we already had an existing mutation for this document in `stagedMutations`. This resolves some situations where the same document is being written concurrently at the same time. This is seen as an application bug, as it will probably fail, and at best it's a race. The algorithm: * If `gr` has no transaction [Metadata](#Metadata), it's fine to proceed. * Else if the transactionId staged in `gr` equals the transactionId of this transaction: * Can get here after a retry on a FAIL_AMBIGUOUS on a staged replace. Or if there's a concurrent modification of the same document in the same attempt. * If ExtThreadSafety: * If the attemptId staged in `gr` equals the attemptId of this attempt: * If `existingMutation` is present: * If the CAS of `existingMutation` != the CAS of `gr`: The same document is being concurrently modified by the same attempt (this operation read the document before the concurrent op modified it). Raise `Error(ec = FAIL_CAS_MISMATCH, cause = ConcurrentOperationsDetectedOnSameDocument)` * Else: The same document is being concurrently modified by the same attempt (as we have KV evidence that another op is happening, but it's not in `stagedMutations` y. Raise `Error(ec = FAIL_CAS_MISMATCH, cause = ConcurrentOperationsDetectedOnSameDocument)` * If we reach here (ExtThreadSafety or not), it's fine to overwrite. E.g. return success from this method and allow the document's staged data to be overwritten. * Else there's a write-write conflict. * There's multiple considerations here: * [TXNJ-246](https://issues.couchbase.com/browse/TXNJ-246): don't want to retry the transaction too often. * But, do want to periodically retry to unlock any locked documents, which will resolve some livelock scenarios. * In the most common case, B is expected to complete fairly quickly, and just needs to be given a little time. * [TXNJ-86](https://issues.couchbase.com/browse/TXNJ-86): don't want to wait for cleanup of a transaction when we could proceed immediately. * The algo: * Do a [ForwardCompatibilityCheck](#ForwardCompatibilityCheck) with any `fc` forward compatibility field in the document's metadata, and interaction point `ip`. * If the transaction has expired, enter [ExpiryOvertimeMode](#ExpiryOvertimeMode) and raise `Error(ec=FAIL_EXPIRY, raise=TRANSACTION_EXPIRED)`. * Call hook `beforeCheckATREntryForBlockingDoc`, passing this AttemptContext and the ATR's key. * Do a lookupIn call to fetch the ATR entry for B, including fetching vattr `$vbucket.HLC`. Use a set [RetryStrategy](#RetryStrategy). * If the ATR entry does not exist, or the ATR is no longer present, then cleanup has occurred. It is ok to proceed. * Note this logic is essential to handle the case of a crashed PENDING transaction, which can only remove the ATR entry and has to leave the documents. * There is a race here where B could have been cleaned up inbetween us fetching the document and checking the ATR, and the document could have changed - been removed or rolled back or committed. But this is fine, it will be caught with a FAIL_CAS_MISMATCH as soon as we try to modify the document, and trigger a retry. * Note the ATR could have been deleted accidentally by the user, in which case a future transaction will likely recreate it. So perhaps the transaction _wasn't_ cleaned up, and could have been in any state when it was deleted. But we have no way of knowing if that occurred, so we need to proceed. This is a case that can only be remedied by having the ATRs siloed off in a system collection. See discussions on [CBD-3705](https://issues.couchbase.com/browse/CBD-3705). * Else do a [ForwardCompatibilityCheck](#ForwardCompatibilityCheck) with any `fc` forward compatibility field in the ATR entry and interaction point "WW_R". * If not BF-CBD-3791: Else if the ATR entry has expired then it is ok to proceed. See [CheckHLC](https://hackmd.io/Eaf20XhtRhi8aGEn_xIH8A?both#CheckHLC) for the logic. * Else if the ATR entry's state is COMPLETED or ROLLED_BACK then it is also ok to proceed, no need to wait for cleanup. * (It's important not to proceed on COMMITTED or ABORTED states, as that could race with B.) * Else, retry this algo from the point "The algo:", above, after exponential backoff (50 to 500 millis, which results in ~5 ATR entry fetches per second) * (ExtUnknownATRStates: this branch is followed if the ATR entry's state is UNKNOWN. Based on the assumption that the future transaction will ultimately get to a state we do know how to handle if we keep polling. The future protocol can use the forward compatibility map if this is incorrect.) * If a) we've been retrying for more than a hardcoded 1 second, or b) any error `err` occurs during the lookup: raise `Error(ec=FAIL_WRITE_WRITE_CONFLICT, cause=err, retry=true)` ### Transcoder The user has provided some application data `content`. The type of this is platform-dependent, with the goal being to let them supply anything. Something like `Object`/`Any`/`dynamic`/`T` - whatever makes most sense, and most aligned with other operations in the SDK. #### Application data If ExtBinarySupport is implemented: * If the user has supplied a `Transcoder` as part of the relevant options block, `content` will now be passed through its `encode` method. This is detailed in [Couchbase Transactions API](/cHIcXWZSQOi23qpJJAa8OQ). * That `Transcoder` will output a byte array encoded content and an int32 userFlags, as normal. These should be passed along to the next stage of transactions logic. If ExtBinarySupport is not implemented, or if no `Transcoder` is provided: The global JsonSerializer from the ClusterEnvironment is used to convert the application's data into a byte array. Pass that to the next stage of transactions logic. ExtBinarySupport: also pass int32 userFlags with common format flags set indicating it is a JSON document. #### Non-application data Whenever not working with the application's data, e.g. with the ATRs or client records, use a default JsonSerializer. This avoids issues if the user has specified a JsonSerializer with unusual behaviour. While this sounds a niche concern, it has been a real-world issue for the Java implementation, where a serializer that added unexpected fields due to a misconfigured Jackson library module caused multiple hard-to-debug problems with the metadata documents. `T content/value` is used throughout this doc to, like SDK3, indicate a platform-specific way of passing JSON content. Internal discussion on transcoding is [here](https://couchbase.slack.com/archives/C014WB8U2MQ/p1628682117040100). ExtSerializer doesn't change the above behaviour, it simply puts it behind a flag so it can be tested in appropriate versions. ExtBinarySupport also does not change the behaviour of this "Non-application data" section. ### OpRetryDelay Just to avoid any CPU-locking tight loops from bugs: on retrying an operation, wait a small amount of time. 3 milliseconds seems reasonable. ### OpRetryBackoff To avoid stressing the server unduly, apply some exponential backoff on retrying the operation. Want to strike a balance between the transaction still completely quickly, and not overwhelming the server. From 1 to 100 milliseconds. ### ExpiryOvertimeMode See [the expiration section in the main design document](https://hackmd.io/foGjnSSIQmqfks2lXwNp8w#Expiration). ### Durability Durability is set in the global `TransactionConfig` / `TransactionsConfig`, and may be overridden on a per-transaction basis with `PerTransactionConfig` / `TransactionOptions`. If a mutation does not specify a durability, assume that it uses the durability from these two. ### Timeouts A KV timeout is specified in `TransactionConfig`. If a mutation does not specify a timeout explicitly, assume that it uses the timeout from this. ExtSDKIntegration: this was removed from the config. The SDK's timeouts should be used. During the 'Rage' work we will be reworking how KV timeouts operate. ### RetryStrategy Transactions must always use a `BestEffortRetryStrategy`, the protocol is written assuming that is the case. An alternative strategy like FastFail could have any number of undesired knock-on effects - in particular we won't have automatic retrying of ATR contention, which could cause many more transactional attempts being created. So the transactions library must not rely on what has been configured on the underlying SDK, but instead, create a `BestEffortRetryStrategy` and use it for all SDK operations. This will be made explicit on each operation throughout these specifications, but if it has been missed anyway, assume the `BestEffortRetryStrategy` should be used. If it is ever not required for a particular operation, that will be made very explicit. ### Upsert-and-remove The algorithm sometimes does a Sub-Document mutateIn that upserts a field to null, then removes that field. This idiom means that the operation succeeds regardless of that field existing. It is used to avoid adding additional branching and variables to handle FAIL_PATH_NOT_EXISTS. This idiom is largely phased out now. ### SaveErrorWrapper Added to protocol 2.0 under [TXNJ-249](https://issues.couchbase.com/browse/TXNJ-249). If a pre-commit operation goes wrong and raises an `ErrorWrapper`, then save that to an internal thread-safe list `errors` member of the `AttemptContext`. ExtThreadSafety replaces this, as it removes `error`. ### CheckErrors Before performing any operation, including commit, check if the `errors` member is non-empty. If so, raise an `Error(ec=FAIL_OTHER, cause=PreviousOperationFailed)`. ### ForwardCompatibilityCheck Refer to the [FC section of the main doc first](https://hackmd.io/foGjnSSIQmqfks2lXwNp8w?view#Forwards-Compatibility). A sample forward compatibility map could look like this: ``` "fc": [ "G": [{"e":"SD", "b":"r", "ra":500}], "CL_E": [{"p":"3.0", "b":"f"}] ] ``` Given: * An interaction point `ip`, e.g. "CL_E". * The protocol version `protocol` supported by this implementation, e.g. "2.0". * The set of extension codes `supported` supported by this implementation, e.g. {"TI","DC"}. * A JSON array `fc` from a document or ATR. The forward compatibility check algo is: * If `fc` is missing or an empty array, continue. * Else for each check `c` inside `fc`, where the `c`'s key matches `ip`: * If `c.p` exists, this is a protocol check. * If `protocol` >= `c.p`, continue. * Otherwise, execute the failure behaviour (below). * Else if `c.e` exists, this is an extension check. * If `supported` contains `c.e`, continue. * Otherwise, execute the failure behaviour (below). * If the check failed, execute the failure behaviour: * If `c.b` == `r`: * If `c.ra` is present, first wait for that long in milliseconds. * Raise an `Error(FAIL_OTHER, cause=ForwardCompatibilityFailure, retry = true)`. * Else: * Raise `Error(FAIL_OTHER, cause=ForwardCompatibilityFailure)`. ### CheckHLC When checking elapsed times, for example to check if a transaction has expired, this is done based on a `${Mutation.CAS}` field previously written to an ATR entry, being compared with the ATR's current CAS. More accurately, the ATR's CAS _as if it had been written at that point_. This is achieved via functionality added in MB-35388 in Couchbase 6.6, to add a `$vbucket.HLC` vattr (protocol 2.0 has a dependency on Couchbase 6.6). The algo is, assuming we are reading doc X which has field `y` written previously as a `${Mutation.CAS}`: * As part of the lookupIn of X, fetch `$vbucket.HLC`. Use a set [RetryStrategy](#RetryStrategy). * This returns a JSON object. Read the "now" field of this as a string. This is the time since Unix epoch, as seconds, e.g. "1582117992". Convert it to an integer in nanoseconds, e.g. 1582117992000000000, call it `z`. This is X's current 'timestamp' - e.g. what "${Mutation.CAS}" would evaluate to if we were to write the document right now (we used to do such a dummy write in protocol 1, this $vbucket.HLC was added explicitly to avoid that). * `z` - `y` gives us the elapsed time. ### Time Remaining Calculate the time remaining in this transaction as: * T1 = Time the overall transaction (not this attempt) started, using the application/client clock, in millis. * T2 = Current time, using the application/client clock, in millis. * T_Elapsed = T2 - T1 * TC = Configuration expiration time for overall transaction, in millis. (15000 by default) * T_Remaining = TC - T_Elapsed * "exp" = Math.max(Math.min(T_Remaining, TC), 0) * This bounds the value between 0 and TC. It should always be here anyway, but this protects us against the application/client clock changing mid-transaction. ### Unlock Unlock the AttemptContext mutex. ### Lock Lock the AttemptContext mutex, using the remaining time in the transaction as a timeout. If timeout occurs, raise `Error(ec = FAIL_EXPIRY, raise = TRANSACTION_EXPIRED, rollback = false, cause= AttemptExpired)`. As usual this sets state bits. ### LockAndAddKVOp Perform [Lock](#Lock). Add one operation to the `kvOps` wait group. ### UnlockOnError An error has occurred. Unlock the AttemptContext mutex. See details of the mutex [here](https://hackmd.io/foGjnSSIQmqfks2lXwNp8w?view#AttemptContext), particularly regarding the double-unlock behaviour. ### KvOpInit ExtThreadSafety slightly adjusts the order of the basic checks for all KV operations so that a) they are amenable to refactoring this shared logic and b) queryMode can be checked under lock. This takes a passed `stageName: String` The algo: * [LockAndAddKVOp](#LockAndAddKVOp). * Perform [DoneCheck](#DoneCheck). * If transaction has expired (using `stageName` for the FIT expiry insertion), set [ExpiryOvertimeMode](#ExpiryOvertimeMode) then raise an `Error(FAIL_EXPIRY, raise=TRANSACTION_EXPIRED)`. ### WaitForKVAndLock This is waiting for all current KV operations to finish (included those initiated in `queryMode`), and then locking so that further KV operations will block waiting for the lock. The algo: * Wait the `kvOps` wait-group. Use a timeout equal to the remaining time in the transaction. If timeout occurs, raise `Error(ec = FAIL_EXPIRY, raise = TRANSACTION_EXPIRED, rollback = false, cause= AttemptExpired)`. As usual this sets state bits. * [Lock](#Lock). * Check `kvOps` again to see if any new KV ops arrived in between those last two operations. If so: * [Unlock](#Unlock). * Recursively repeat [WaitForKVAndLock](#WaitForKVAndLock). ### SupportsReplaceBodyWithXattr Added in ExtReplaceBodyWithXattr. Takes a `bucketName: String`. Wait for the bucket config for `bucketName` to be loaded. Use the remaining time of the transaction as the timeout for this. It's undefined what error should be raised by the SDK if this does timeout. This spec expects it to raise an error that will be classified as FAIL_OTHER. In the methods that call this one, that will cause the transaction to fast-fail. If your SDK does not raise such an error, then convert to one that does (e.g. TimeoutException). Then return true iff the bucket capabilities contains "subdoc.ReviveDocument" (added in 7.1). (Do NOT use "subdoc.ReplaceBodyWithXattr", added in 7.0. This version of the feature contained a significant bug.) ### Sub-document Binary Flag This section added with ExtBinarySupport. Server 7.6.2 adds the ability to read and write a sub-document spec as binary. This is done using the same spec-level field as used to indicate the spec is an xattr. The SDK needs to OR this field with 0x20 to specify binary. It does this on both the mutateIn and lookupIn pathds. #### mutateIn Compatibility If the negotiated HELLO flags do not contain SubdocBinaryXattr (0x21), and the spec specifies that this binary flag should be sent, then instead the transaction will raise `Error(cause = FeatureNotAvailable())`, with the FeatureNotAvailable's description indicating that binary documents are only supported from 7.6.2 on. #### lookupIn Compatibility LookupIn is trickier. We do not know that the user wants to read a binary document, as this is not how the `Transcoder` interface works. And we have no way of knowing pre-fetch if the document has JSON or binary data staged. So, we need to read both `txn.op.bin` (with the binary spec-level flag set) and `txn.op.stgd` (without the flag). However - the binary spec-level flag, when set to a pre 7.6.2 server, will lead to the operation (not individual spec) failing with INVALID_REQUEST. This is difficult to deal with, as in many SDKs there is no real option to conditionally include the `txn.op.bin` spec based on the SubdocBinaryXattr HELLO flag, since the request is often being constructed at a level far removed from the connection-level HELLO flags. (And no cluster or bucket cap is available for this feature, as it is being released in a patch release). The only reasonable solution is to silently discard the binary spec-level flag on the lookupIn path, when encoding the server request, if the SubdocBinaryXattr HELLO flag is present. For transactions, this causes no issues. The `txn.op.bin` field will not exist anyway, as this server does not support creating that binary field. And the transactions logic of course already needs to support `txn.op.bin` not being present, as that is the general case. For standard KV lookupIn, it would cause issues. But binary support is not being exposed in that API, at least at the time of writing, and certainly not in this spec. If it ever is exposed, then internally the SDK will need to use different logic for that path, and not silently discard the binary flag on it. ### Removing Binary Field This section added with ExtBinarySupport. We now use `txn.op.stgd` for JSON data, and `txn.op.bin` for binary data. It's one of the axioms of the extension that only one exists in the transaction metadata at any time. So when writing either field, it's important we remove the other first. This handles cases where the same document is being written twice in the same transaction. We upsert the field before removing it, as generally it will not be there, and this initial upsert step allows the sub-doc operation to succeed regardless. It would be possible to only do the upsert when required, but at a cost of additional branches and complexity. ### Staging Binary Data Assumptions: ExtBinarySupport is supported, and this section is being referenced from logic that is performing a Sub-Document mutateIn, and needs to add specs to handle writing staged content (which could be JSON or binary). Note we upsert & remove `txn.op.stgd` or `txn.op.bin` here. See [Removing Binary Field](#Removing-Binary-Field) for why. We have in scope: * `content`: a transcoded or serialized array of bytes. * `stagedUserFlags`: int32. All sub-doc specs below need to set the xattr spec-level flag. The algo: * If the common flags of `stagedUserFlags` indicates it's binary: * Add upsert spec "txn.op.stgd" = `null` (4 bytes) * Add remove spec "txn.op.stgd" * Add upsert spec "txn.op.bin" = the transcoded bytes. * The SDK must set the [Sub-Document binary flag](#Sub-document-Binary-Flag) for this spec. * Else if the common flags of `stagedUserFlags` indicates it's JSON: * Add upsert spec "txn.op.bin" = `null` (4 bytes) * Add remove spec "txn.op.bin" * Add upsert spec "txn.op.stgd" = the transcoded bytes. * Else: * Raise `Error(FAIL_OTHER, EncodingFailureException())`. The `userFlags` of the document also need to be set. Here we are explicitly _not_ using the passed `stagedUserFlags` - these correspond to the user's post-transaction content, and must only be set when we commit the document. We must use the existing user flags of the document. ### Forward Compatibility Map for ExtBinarySupport ExtBinarySupport writes binary fields that older clients have no way of interpreting or interacting with. So, those older clients are blocked, using the [forward compatibility map](https://hackmd.io/bJzZMt3ASfe8b7309lzK1A?view#Forwards-Compatibility). This is only written if the common flags of the staged user flags indicate that the content is binary. So regular JSON documents incur no extra metadata. This is written into the per-document transactional metadata: ``` "fc": { "CL_E": [{"b": "f", "e": "BS"}], "G": [{"b": "f", "e": "BS"}], "WW_I": [{"b": "f", "e": "BS"}], "WW_IG": [{"b": "f", "e": "BS"}] } ``` This blocks older clients (those that do not support the ExtBinarySupport "BS" extension) from interacting with the document at most stages. They will fast-fail ("f"). It is possible to make them retry instead, but even if this blocking transaction has committed by then, the document is likely to be binary and so the blocked transaction will continue to have no way of making progress. So, we make them fast-fail instead. Note we don't block removes and replaces directly ("WW_RM" and "WW_RP"), as they will be blocked at the "G" point. A future SDK could conceivably support get-less removes and replaces - but will also support ExtBinarySupport. ## Replaces The algo: * ExtThreadSafety: * [KVOpInit](#KVOpInit), passing stage name "replace". * Generate `operationId`, a UUID4 that's a unique identifier for this operation. * ExtQuery: If in `queryMode`, instead of doing any of the following, perform the logic in the [query spec](https://hackmd.io/eMPHhhd9SJqjO3s9dT9p7A) instead. * Else (!ExtThreadSafety): * ExtQuery: If in `queryMode`, instead of doing any of the following, perform the logic in the [query spec](https://hackmd.io/eMPHhhd9SJqjO3s9dT9p7A) instead. * Perform [DoneCheck](#DoneCheck). * Perform [CheckErrors](#CheckErrors). * `mayNeedToWriteATR` = `state` == `AttemptStates.NOT_STARTED` * Check if this document already exists in `stagedMutation` and store in `existingDoc: StagedMutation` if so. * ExtThreadSafety: * Call debug hook `beforeUnlockReplace`, passing this AttemptContext and the documents's key. * [Unlock](#Unlock). * ExtAllKVCombinations: If `existingDoc`, and it is a REMOVE, raise an `Error(ec=FAIL_DOC_NOT_FOUND,cause=DocumentNotFoundException)`. * If !ExtThreadSafety: If transaction has expired (use stage name "replace"), set [ExpiryOvertimeMode](#ExpiryOvertimeMode) then raise an `Error(FAIL_EXPIRY, raise=TRANSACTION_EXPIRED)`. * [CheckWriteWriteConflict](#CheckWriteWriteConflict) with `ip`="WW_RP", passing `existingDoc` if ExtThreadSafety. * [InitATRIfNeeded], passing `mayNeedToWriteATR`. * If ExtAllKVCombinations and `existingDoc` and it is an INSERT: * Create a staged insert, following the logic under 'Creating Staged Inserts (Protocol 2.0 version)' -> Algorithm, passing `operationId` if ExtThreadSafety, and the CAS from `existingDoc` (which will make it replace the existing tombstone). * Future improvement: this will hit the 'doc already exists' logic, then proceed to overwrite it. It's possible to do this more efficiently in one operation. * Else: * Create staged replace (see below), passing `operationId` if ExtThreadSafety. * ExtThreadSafety: * If any error propagates after locking, [UnlockOnError](#UnlockOnError). * Whatever happens (success or error), the final step before returning from ctx.replace() is [RemoveKVOp](#RemoveKVOp). ### Creating Staged Replace This method takes these parameters: * `TransactionGetResult doc` * `T content`. See [Transcoding](#Transcoder) for details of `T`. * Protocol 2.0 version: `boolean accessDeleted`. It will be true iff `doc` was a tombstone (as determined by `doc.links().isDeleted()`). * ExtThreadSafety: `String operationId` The algorithm: * Call debug hook `beforeStagedReplace`, passing this AttemptContext and the documents's key. * There is no expiry check here - there is no need, as replace operations are not retried. * Transcode `content` to bytes. See [Transcoding](#Transcoder). * ExtBinarySupport: transcoding will now also output `userFlags` int32. * Perform the mutation: * This is done with a Sub-Document mutateIn operation, using the CAS of `doc`, with default StoreSemantics (e.g. REPLACE). Durability and timeouts are set as in [Durability](#Durability) and [Timeouts](#Timeouts). Use a set [RetryStrategy](#RetryStrategy). * Protocol 2.0 version: Set `accessDeleted()` flag if `accessDeleted` is true. * If ExtAllKVCombinations: Write these fields. All are written as xattrs and are upsert specs. * "txn", a JSON blob with these contents, and written with createPath(): * "txn.id.txn" = the transactionId. * "txn.id.atmpt" = the attemptId. * ExtThreadSafety: "txn.id.op" = the `operationId`. * "txn.atr.id" = the ATR's key. * "txn.atr.bkt" = the ATR's bucket's name. * "txn.atr.scp" = the ATR's scope's name * "txn.atr.coll" = the ATR's collection's name * "txn.op.type" = "replace" * If `dm` is `doc`'s documentMetadata(), and it is present: * If `dm.cas()` is present -> "txn.restore.CAS" = `dm.cas()` * If `dm.revid()` is present -> "txn.restore.revid" = `dm.revid()` * If `dm.exptime()` is present -> "txn.restore.exptime" = `dm.exptime()` * ExtBinarySupport: add the [Forward Compatibility Map for ExtBinarySupport](#Forward-Compatibility-Map-for-ExtBinarySupport). * If ExtBinarySupport supported: * See [Staging Binary Data](#Staging-Binary-Data) * Else (no ExtBinarySupport support): * "txn.op.stgd" = the serialized bytes. * "txn.op.crc32" = MutateInMacro.VALUE_CRC_32C * Else (not ExtAllKVCombinations): Write these fields. All are written as xattrs and are upsert specs. * ExtTransactionId version: "txn.id.txn" = the transactionId. Set createPath() on this. * "txn.id.atmpt" = the attemptId. Set createPath() on this. * "txn.atr.id" = the ATR's key. Set createPath() on this. * "txn.atr.bkt" = the ATR's bucket's name. * Protocol 2 * "txn.atr.scp" = the ATR's scope's name * "txn.atr.coll" = the ATR's collection's name * Protocol 1 * "txn.atr.coll" = the ATR's scope's name, followed by ".", followed by the ATR's collection's name. E.g. "scope_name.collection_name". * "txn.op.type" = "replace" * If `dm` is `doc`'s documentMetadata(), and it is present: * If `dm.cas()` is present -> "txn.restore.CAS" = `dm.cas()` * If `dm.revid()` is present -> "txn.restore.revid" = `dm.revid()` * If `dm.exptime()` is present -> "txn.restore.exptime" = `dm.exptime()` * If ExtBinarySupport supported: * See [Staging Binary Data](#Staging-Binary-Data) * Else (no ExtBinarySupport support): * "txn.op.stgd" = the serialized bytes. * "txn.op.crc32" = MutateInMacro.VALUE_CRC_32C * On success call debug hook `afterStagedReplaceComplete`, passing this AttemptContext and the documents's key. * On success: * ExtThreadSafety: [Lock](#Lock). * Add replaced doc to `stagedMutations` list, with its new CAS, `StagedMutationType`=`REPLACE`, and `operationId` if ExtThreadSafety. * ExtReplaceBodyWithXattr: before this, call and wait for [SupportsReplaceBodyWithXattr](#SupportsReplaceBodyWithXattr). Store or not the staged contents in the created `StagedMutation` based on this. * ExtAllKVCombinations: * This must replace any existing operation in `stagedMutations` on the same document. * Else: * If doc is already in `stagedMutations` as a REPLACE, then overwrite it. * If doc is already in `stagedMutations` as an INSERT, then remove that, and add this op as a new INSERT. * ExtThreadSafety: [Unlock](#Unlock). * Return created doc as a TransactionGetResult * On error err (from any of the preceding items in this section), classify as error class ec then: * (Note there is no [ExpiryOvertimeMode](#ExpiryOvertimeMode) handling here, as we can't get here in that mode.) * `FAIL_EXPIRY` -> set [ExpiryOvertimeMode](#ExpiryOvertimeMode) and raise `Error(ec, AttemptExpired(err), raise=TRANSACTION_EXPIRED)` * Else `FAIL_DOC_NOT_FOUND || FAIL_CAS_MISMATCH` -> Doc was modified in-between get and replace. `Error(ec, err, retry=true)` Note we retry on FAIL_DOC_NOT_FOUND because user may be using getOptional. * Else `FAIL_DOC_NOT_FOUND || FAIL_TRANSIENT || FAIL_AMBIGUOUS` -> `Error(ec, err, retry=true)` * `FAIL_HARD` -> `Error(ec, err, rollback=false)` * Else -> `Error(ec, err)` * On any item above that raises an `Error`, before doing so, [SaveErrorWrapper](#SaveErrorWrapper). ## Removes (These are very similar to [Replaces](#Replaces).) * ExtThreadSafety: * [KVOpInit](#KVOpInit), passing stage name "remove". * Generate `operationId`, a UUID4 that's a unique identifier for this operation. * ExtQuery: If in `queryMode`, instead of doing any of the following, perform the logic in the [query spec](https://hackmd.io/eMPHhhd9SJqjO3s9dT9p7A) instead. * Else (!ExtThreadSafety): * ExtQuery: If in `queryMode`, instead of doing any of the following, perform the logic in the [query spec](https://hackmd.io/eMPHhhd9SJqjO3s9dT9p7A) instead. * Perform [DoneCheck](#DoneCheck) * Perform [CheckErrors](#CheckErrors). * If transaction has expired (use stage name "remove"), set [ExpiryOvertimeMode](#ExpiryOvertimeMode) then raise an `Error(FAIL_EXPIRY, raise=TRANSACTION_EXPIRED)`. * `mayNeedToWriteATR` = `state` == `AttemptStates.NOT_STARTED` * Check if this document already exists in `stagedMutation` and store in `existingDoc` if so. * ExtThreadSafety: * Call debug hook `beforeUnlockRemove`, passing this AttemptContext and the documents's key. * [Unlock](#Unlock). * If `existingDoc: * ExtAllKVCombinations: * If it is a REMOVE, raise a `Error(ec=FAIL_DOC_NOT_FOUND,cause=DocumentNotFoundException)`. * Else if it is an INSERT: * Follow the [Remove staged insert](#Remove staged insert) logic. * If this raises an `err: TransactionOperationFailed`, add `err` to `errors`. * Do not execute the rest of this block - e.g. do not do WWC check, do not created staged remove. (If ExtThreadSafety then do perform the unlock and KV ops steps). * Else (ExtAllKVCombinations not supported): * If it is an INSERT, raise Error(FAIL_OTHER, cause=IllegalStateException [or platform-specific equivalent]). * [CheckWriteWriteConflict](#CheckWriteWriteConflict) with `ip`="WW_RM", passing `existingDoc` if ExtThreadSafety. * [InitATRIfNeeded], passing `mayNeedToWriteATR`. * Create staged remove (see below), passing `operationId` if ExtThreadSafety. * ExtThreadSafety: * If any error propagates after locking, [UnlockOnError](#UnlockOnError). * Whatever happens (success or error), the final step before returning from ctx.replace() is [RemoveKVOp](#RemoveKVOp). ### Creating Staged Remove This method takes these parameters: * `TransactionGetResult doc` * Protocol 2.0 version: `boolean accessDeleted`. It will be true iff `doc` was a tombstone (as determined by `doc.links().isDeleted()`). * ExtThreadSafety: `String operationId` The algo: * Call debug hook `beforeStagedRemove`, passing this AttemptContext and the documents's key. * There is no expiry check here - there is no need, as remove operations are not retried. * Perform the mutation: * This is done with a Sub-Document mutateIn operation, using the CAS of `doc`, with default StoreSemantics (e.g. REPLACE). Durability and timeouts are set as in [Durability](#Durability) and [Timeouts](#Timeouts). Use a set [RetryStrategy](#RetryStrategy). * Protocol 2.0 version: Set `accessDeleted()` flag if `accessDeleted` is true. * If ExtAllKVCombinations: Write these fields. All are written as xattrs and are upsert specs. * "txn", a JSON blob written with createPath() and containing these: * "txn.id.txn" = the transactionId. * "txn.id.atmpt" = the attemptId. * ExtThreadSafety: "txn.id.op" = the `operationId`. * "txn.atr.id" = the ATR's key. * "txn.atr.bkt" = the ATR's bucket's name. * "txn.atr.scp" = the ATR's scope's name * "txn.atr.coll" = the ATR's collection's name * "txn.op.type" = "remove" * If `dm` is `doc`'s documentMetadata(), and it is present: * If `dm.cas()` is present -> "txn.restore.CAS" = `dm.cas()` * If `dm.revid()` is present -> "txn.restore.revid" = `dm.revid()` * If `dm.exptime()` is present -> "txn.restore.exptime" = `dm.exptime()` * "txn.op.crc32" = MutateInMacro.VALUE_CRC_32C * Else (not ExtAllKVCombinations): Write these fields. All are written as xattrs and are upsert specs. * ExtTransactionId version: "txn.id.txn" = the transactionId. Set createPath() on this. * "txn.id.atmpt" = the attemptId. Set createPath() on this. * "txn.atr.id" = the ATR's key. Set createPath() on this. * "txn.atr.bkt" = the ATR's bucket's name. * Protocol 2 * "txn.atr.scp" = the ATR's scope's name * "txn.atr.coll" = the ATR's collection's name * Protocol 1 * "txn.atr.coll" = the ATR's scope's name, followed by ".", followed by the ATR's collection's name. E.g. "scope_name.collection_name". * "txn.op.stgd" = "\<\<REMOVE\>\>" * "txn.op.type" = "remove" * If `dm` is `doc`'s documentMetadata(), and it is present: * If `dm.cas()` is present -> "txn.restore.CAS" = `dm.cas()` * If `dm.revid()` is present -> "txn.restore.revid" = `dm.revid()` * If `dm.exptime()` is present -> "txn.restore.exptime" = `dm.exptime()` * "txn.op.crc32" = MutateInMacro.VALUE_CRC_32C * On success call debug hook `afterStagedRemoveComplete`, passing this AttemptContext and the documents's key. * On error err (from any of the preceding items in this section), classify as error class ec then: * (Note there is no [ExpiryOvertimeMode](#ExpiryOvertimeMode) handling here, as we can't get here in that mode.) * Else `FAIL_EXPIRY` -> set [ExpiryOvertimeMode](#ExpiryOvertimeMode) and raise `Error(ec, AttemptExpired(err), raise=TRANSACTION_EXPIRED)` * Else `FAIL_DOC_NOT_FOUND || FAIL_CAS_MISMATCH` -> Doc was modified in-between get and remove. `Error(ec, err, retry=true)`. Note we retry on FAIL_DOC_NOT_FOUND because user may be using getOptional. * Else `FAIL_TRANSIENT || FAIL_AMBIGUOUS` -> `Error(ec, err, retry=true)` * Else `FAIL_HARD` -> `Error(ec, err, rollback=false)` * Else -> `Error(ec, err)` * On any item above that raises an `Error`, before doing so, [SaveErrorWrapper](#SaveErrorWrapper). * On success * ExtThreadSafety: [Lock](#Lock). * Add removed doc to `stagedMutations` list, with its new CAS, `StagedMutationType`=`REMOVE`, and `operationId` if ExtThreadSafety. This must replace any existing operation in `stagedMutations` on the same document. * Protocol 1: If doc is already in `stagedMutations` list as an INSERT, throw `IllegalStateException` or equivalent. For reasons (TXNJ-35) this is challenging to support. (In protocol 2 this is moved to an earlier step). * ExtThreadSafety: [Unlock](#Unlock). ### Remove staged insert This is called when a document that was previously inserted in this transaction is now being removed. It takes a `doc: TransactionGetResult`. **Initial** * If transaction has expired in stage "removeStagedInsert", raise `Error(ec=FAIL_EXPIRY, exception=TRANSACTION_EXPIRED, rollback=false, cause=AttemptExpired)`. (There is deliberately no ExpiryOvertimeMode, as this is being phased out of new code.) **The algorithm** * Call debug hook `beforeRemoveStagedInsert`, passing doc.id(). * Perform a mutateIn on doc.id(), passing doc.cas(), with a single MutateInSpec.remove removing the `txn` xattr. Durability and timeouts are set as in [Durability](#Durability) and [Timeouts](#Timeouts). Use a set [RetryStrategy](#RetryStrategy). Set `accessDeleted` flag. * Call debug hook `afterRemoveStagedInsert`, passing doc.id(). * If everything was successful: * Set `doc`'s CAS to be what was returned from the mutateIn. * Remove `doc` from `stagedMutations`. **Error handling** On error `err` of any of the above under 'The algorithm', classify as ErrorClass `ec`: * If `ec` == `TRANSACTION_OPERATION_FAILED`: Reraise `err`. * Else if `ec` == `FAIL_HARD`: Raise `Error(ec, rollback=false, cause=err)`. * Else: Raise `Error(ec, retry=true, cause=err)`. ## Inserts * ExtThreadSafety: * [KVOpInit](#KVOpInit), passing stage name "insert". * Generate `operationId`, a UUID4 that's a unique identifier for this operation. * ExtQuery: If in `queryMode`, instead of doing any of the following, perform the logic in the [query spec](https://hackmd.io/eMPHhhd9SJqjO3s9dT9p7A) instead. * Else (!ExtThreadSafety): * ExtQuery: If in `queryMode`, instead of doing any of the following, perform the logic in the [query spec](https://hackmd.io/eMPHhhd9SJqjO3s9dT9p7A) instead. * Perform [DoneCheck](#DoneCheck) * Perform [CheckErrors](#CheckErrors). * `mayNeedToWriteATR` = `state` == `AttemptStates.NOT_STARTED` * Check if this document already exists in `stagedMutation` and store in `existingDoc` if so. Also if so: * ExtAllKVCombinations: * If it is an INSERT or REPLACE, raise a `DocumentExistsException`. * (Handling of REMOVEs is elsewhere.) * Else (ExtAllKVCombinations not supported): * Raise `Error(FAIL_OTHER, cause=IllegalStateException [or platform-specific equivalent])`. * If !ExtThreadSafety: transaction has expired (use stage name "insert"), set [ExpiryOvertimeMode](#ExpiryOvertimeMode) then raise an `Error(FAIL_EXPIRY, raise=TRANSACTION_EXPIRED)`. * If this is the first mutation (`state` == `AttemptStates.NOT_STARTED`), [InitATRIfNeeded], passing true. * ExtThreadSafety: * Call debug hook `beforeUnlockInsert`, passing this AttemptContext and the documents's key. * [Unlock](#Unlock). * ExtAllKVCombinations: * If `existingDoc` and it is a REMOVE `sm`: * Perform this as a replace rather than an insert, by calling the [Creating Staged Replace](#Creating-Staged-Replace) logic instead of the "Create staged insert" logic, passing `operationId` if ExtThreadSafety. * This logic takes `doc` and `content` params. Pass the doc from `sm` to ensure the correct CAS is being used, and the `content` passed into this insert. * Create staged insert (see below), passing `operationId` if ExtThreadSafety. * ExtThreadSafety: * If any error propagates after locking, [UnlockOnError](#UnlockOnError). * Whatever happens (success or error), the final step before returning from ctx.insert() is [RemoveKVOp](#RemoveKVOp). ### Creating Staged Inserts This is the protocol 2.x version. The protocol 1.x version has been [moved](https://hackmd.io/g3PWQrHHRiew0nGF6iFgiA). This is the version to be used by 2.0 implementations. Here, staged inserts are created as "shadow documents" (tombstones), that are invisible to any actor that is not explicitly looking for them. As Couchbase Server 6.6 is now a hard requirement for 2.0, there is no need to support falling back to creating non-tombstone versions of inserts. #### Shadow Documents vs Tombstones Technically, a tombstone is created, containing user xattrs that contain the transactional metadata. The name "shadow documents" is preferred for this, as it better encapsulates that this is a 'real' document that is simply hidden (or "shadowed"). #### Dependencies It requires Couchbase Server 6.6 or above. There is a level of co-operation required with the underlying SDK as well. The SDK will negotiate this feature with the cluster and memcached. If the feature was not negotiated, and the transactions library requests a shadow document be created, the SDK should raise a FeatureNotFoundException. This is required because the cluster must be fully upgraded to 6.6 before any shadow documents are created, due to an issue with pre-6.6 downstream DCP consumers silently dropping any user xattrs from tombstones. The required SDK work is specified separately in [SDK-RFC 65](https://docs.google.com/document/d/1QccFEvHWEL2-ldS_aTfjphYJGB4YVmMKrMB-UyL2KFI/edit#). Hence, there are two dependencies for this feature: * Couchbase Server 6.6 or above. (But if on a lower version then the algorithm can gracefully fallback to the older method of staging inserts in empty documents). * An SDK that a) supports the CreateAsDeleted flag and b) raises FeatureNotFound if that feature was not present on all nodes of the cluster. #### Algorithm This method takes these parameters: * `Collection collection` * `String id` * `T content`. See [Transcoding](#Transcoder) for details of `T`. * `Optional[Long] cas`. If set, the 'insert' is actually done as a replace of an existing tombstone, with CAS. Default is empty. * ExtThreadSafety: `String operationId` Algorithm: * Call debug hook `beforeStagedInsert`, passing this AttemptContext and the documents's key. * If not in ExpiryOvertimeMode: if transaction has expired (stage name is "createdStageInsert"), raise an AttemptExpired, which will be handled by the error block in this section. We require an expiry check here (unlike with replaces and removes), because this operation can be retried in some cases. * Transcode `content` to bytes. See [Transcoding](#Transcoder). * Perform the mutation: * This is done with a Sub-Document mutateIn operation. Durability and timeouts are set as in [Durability](#Durability) and [Timeouts](#Timeouts). Use a set [RetryStrategy](#RetryStrategy). * If `cas` is empty, StoreSemantics=INSERT (e.g. the ADD flag). * Else, StoreSemantics=REPLACE, and provide the CAS of `doc`. * accessDeleted=true (to make memcached access any existing tombstone) * createAsDeleted=true (to make memcached create the document as a tombstone) * If ExtAllKVCombinations: Write these fields. All are written as xattrs and are upsert specs. * "txn", a JSON blob with these contents, and written with createPath(): * "txn.id.txn" = the transactionId. * "txn.id.atmpt" = the attemptId. * ExtThreadSafety: "txn.id.op" = the `operationId`. * "txn.atr.id" = the ATR's key. * "txn.atr.bkt" = the ATR's bucket's name. * "txn.atr.scp" = the ATR's scope's name * "txn.atr.coll" = the ATR's collection's name * "txn.op.type" = "insert" * ExtBinarySupport: add the [Forward Compatibility Map for ExtBinarySupport](#Forward-Compatibility-Map-for-ExtBinarySupport). * The content spec(s): * If ExtBinarySupport is implemented: * See [Staging Binary Data](#Staging-Binary-Data) * Else (ExtBinarySupport not implemented): * "txn.op.stgd" = the transcoded bytes. * "txn.op.crc32" = MutateInMacro.VALUE_CRC_32C * Else (!ExtAllKVCombinations): Write these fields. All are written as xattrs and are upsert specs. * "txn", a JSON blob with these contents, and written with createPath(): * ExtTransactionId version: "txn.id.txn" = the transactionId. Set createPath() on this. * "txn.id.atmpt" = the attemptId. Set createPath() on this. * "txn.atr.id" = the ATR's key. Set createPath() on this. * The content spec(s): * If ExtBinarySupport is implemented: * See [Staging Binary Data](#Staging-Binary-Data) * Else (ExtBinarySupport not implemented): * "txn.op.stgd" = the transcoded bytes. * "txn.atr.bkt" = the ATR's bucket's name. * Protocol 2 * "txn.atr.scp" = the ATR's scope's name * "txn.atr.coll" = the ATR's collection's name * Protocol 1 * "txn.atr.coll" = the ATR's scope's name, followed by ".", followed by the ATR's collection's name. E.g. "scope_name.collection_name". * "txn.op.type" = "insert" (we're NOT adding a new "insert_shadow" for 2.0 because a) backup tools use this field and b) we can tell it's a tombstone when get we the document). * "txn.op.crc32" = MutateInMacro.VALUE_CRC_32C * Note: Unlike restores and replaces, a “restore” section is not required here as there is no pre-transaction version of the document for backup to restore. * On success call debug hook `afterStagedInsertComplete`, passing this AttemptContext and the documents's key. * On success: * ExtThreadSafety: [Lock](#Lock). * Add inserted doc to `stagedMutations` list, with its new CAS, and `operationId` if ExtThreadSafety. Store it as type INSERT. This must replace any existing operation in `stagedMutations` on the same document. * ExtReplaceBodyWithXattr: before this, call and wait for [SupportsReplaceBodyWithXattr](#SupportsReplaceBodyWithXattr). Store or not the staged contents in the created `StagedMutation` based on this. * ExtThreadSafety: [Unlock](#Unlock). * Return created doc as a TransactionGetResult * On error `err` (from any of the preceding items in this section), classify as error class ec then: * If `err` is FeatureNotFoundException, then this cluster does not support creating shadow documents. (Unfortunately we cannot perform this check at the `Transactions.create` point, as we may not have a cluster config available then). Raise `Error(ec=FAIL_OTHER, cause=err)` to terminate the transaction. * If in [ExpiryOvertimeMode](#ExpiryOvertimeMode) -> `Error(FAIL_EXPIRY, AttemptExpired(err), rollback=false, raise=TRANSACTION_EXPIRED)` * Else `FAIL_EXPIRY` -> set [ExpiryOvertimeMode](#ExpiryOvertimeMode) and raise `Error(ec, AttemptExpired(err), raise=TRANSACTION_EXPIRED)` * `FAIL_AMBIGUOUS` -> Ambiguously inserted marker documents can cause complexities during rollback and retry, so aim to resolve the ambiguity now by retrying this operation from the top of this section, after [OpRetryDelay](#OpRetryDelay). If this op had succeeded then this will cause FAIL_DOC_EXISTS. * `FAIL_TRANSIENT` -> `Error(ec, err, retry=true)` * `FAIL_HARD` -> `Error(ec, err, rollback=false)` * `FAIL_CAS_MISMATCH` -> We're trying to overwrite an existing tombstone, and it's changed. Could be an external actor (such as another transaction), or could be that a `FAIL_AMBIGUOUS` happened and actually succeeded. Either way, do the `FAIL_DOC_ALREADY_EXISTS` logic below again. * `FAIL_DOC_ALREADY_EXISTS` -> The handling for this is complex. See 'Doc already exists when staging insert' below. * Else -> `Error(ec, err)` * On any item above that raises an `Error`, before doing so, [SaveErrorWrapper](#SaveErrorWrapper). #### Doc already exists when staging insert There are multiple reasons for this: * Most likely, the document simply existed before, as a regular doc. In which case we want to fast fail. * Update for ExtInsertExisting: actually we want to tell the user this constaint violation happened and give them the option to fail (by propagating the error) or continue the transaction. * Another transaction B could have staged that document (either as a tombstone or not) then crashed in PENDING state. The cleanup process will not be able to resolve the document, it will only be able to remove B's ATR entry. We can't get blocked here, we want to overwrite the document if B's ATR entry is removed. However, if the document is not a staged insert, it's an existing doc and hence is still blocking our insert (CBD-3787). * This same transaction attempt could have performed the same insert, perhaps concurrently. Or even a concurrent insert that was then followed up by one or more removes and replaces. * We got FAIL_AMBIGUOUS on the previous attempt (which could have been staging as a tombstone or not), which actually succeeded, so our retry of the insert hits this. * We got FAIL_AMBIGUOUS but it actually failed and another transaction has managed to either stage or commit the same document (as a tombstone or not). Somewhat edge-casey, but means we can't rely on some sort of `isAmbiguityResolution` flag. * (In both FAIL_AMBIGUOUS cases, we want to proceed as successful only if the document is staged with this transaction's metadata. Else we want to fast fail.) The logic to resolve all these cases is to get the document, and check its metadata and tombstone status to see whether to proceed. The exact algo: * Call hook `beforeGetDocInExistsDuringStagedInsert`, passing this AttemptContext and the documents's key. * Get the doc as a lookupIn operation with the accessDeleted flag set (which will fetch any existing tombstone), and fetching all transaction metadata. Use a set [RetryStrategy](#RetryStrategy). * On error `err` of either of those operations: * FAIL_DOC_NOT_FOUND -> The doc did exist, and now doesn't. (This is extremely unlikely given that we're fetching tombstones.) `Error(ec, retry=true)` * ~~FAIL_PATH_NOT_FOUND -> This is what will happen if the doc has been deleted, or if it's simply a normal tombstone. We will get the tombstone and it will be missing the "txn" xattr metadata.~~ `Error(ec, retry=true)` This logic is redundant - it can never trigger - and this line will be removed soon. * FAIL_TRANSIENT -> Let's try all this again. `Error(ec, retry=true)` * Else -> Bailout. `Error(ec, cause=err)` * On success * Do a [ForwardCompatibilityCheck](#ForwardCompatibilityCheck) with any `fc` forward compatibility field in the doc's metadata, and interaction point "WW_IG". * If the doc is a tombstone and not in any transaction -> It's ok to go ahead and overwrite. Perform this algorithm from the top with `cas`=the cas from the get. * Else if the doc is not in a transaction -> * If ExtInsertExisting: raise `DocumentExistsException` directly (no `TransactionOperationFailed`). Do not set any internal state (e.g. no `errors` list for ExtQuery, no state bits for ExtThreadSafety). The raised DocumentExistsException should be catchable/ignorable. * Else: Raise `Error(FAIL_DOC_ALREADY_EXISTS, cause=DocumentExistsException)`. This ultimately results in fast-failing the transaction. * Else: (The doc is a tombstone and in a transaction) * ExtThreadSafety: * If the doc's staged attemptId matches this attemptId: * If the doc's staged operationId (which must be present) matches this operationId, we must have successfully resolved a FAIL_AMBIGUOUS that really succeeded. We are done, proceed as success: * [Lock](#Lock). * Add inserted doc to `stagedMutations` list, with the fetched doc's CAS. Store it as type INSERT. * [Unlock](#Unlock). * Return fetched doc as a TransactionGetResult. * Else we are racing with a concurrent attempt to insert the same document in the same attempt. Raise `Error(EC_CAS_MISMATCH, cause=ConcurrentOperationsDetectedOnSameDocument, retry=false)`. * BF-CBD-3787: * If the document is not a staged insert, as determined by `op.type`, we cannot overwrite. * If ExtInsertExisting: raise `DocumentExistsException` directly (no `TransactionOperationFailed`). Do not set any internal state (e.g. no `errors` list for ExtQuery, no state bits for ExtThreadSafety). The raised DocumentExistsException should be catchable/ignorable. * Else: Raise `Error(FAIL_DOC_ALREADY_EXISTS, cause=DocumentExistsException)`. * Call the [CheckWriteWriteConflict](#CheckWriteWriteConflict) logic with `ip`="WW_I", which conveniently does everything we need to handle the above cases. Pass an existingMutation of null/empty (we already checked for concurrent races using the operationId). * If this logic succeeds, we are ok to overwrite the doc: * BF-CBD-3787: If the document is a staged insert but also is not a tombstone (e.g. it is from protocol 1.0), it must be deleted first: * Call hook `beforeOverwritingStagedInsertRemoval`, passing this AttemptContext and the documents's key. * Perform a regular KV remove of the document, using the CAS from the get. Durability and timeout are set as in [Durability](#Durability) and [Timeouts](#Timeouts). * On error `err` of either of those operations, classify as `ec`, then: * FAIL_DOC_NOT_FOUND | FAIL_CAS_MISMATCH | FAIL_TRANSIENT -> Let's try all this again. `Error(ec, cause=err, retry=true)` * Else -> Bailout. `Error(ec, cause=err)` * Retry the staged insert, by performing this algorithm from the top, passing the CAS from the KV remove. * Else (either BF-CBD-3787 is not supported, or it is but the doc does not need to be deleted): * Proceed to overwrite the doc as a replace operation, by performing this algorithm from the top with `cas`=the cas from the get. ## GetOptional * ExtThreadSafety: * [KVOpInit](#KVOpInit), passing stage name "get". * ExtQuery: If in `queryMode`, instead of doing any of the following, perform the logic in the [query spec](https://hackmd.io/eMPHhhd9SJqjO3s9dT9p7A) instead. * Else (!ExtThreadSafety): * ExtQuery: If in `queryMode`, instead of doing any of the following, perform the logic in the [query spec](https://hackmd.io/eMPHhhd9SJqjO3s9dT9p7A) instead. * Perform [DoneCheck](#DoneCheck) * Perform [CheckErrors](#CheckErrors). * If transaction has expired (stage name is "get"), set [ExpiryOvertimeMode](#ExpiryOvertimeMode) then raise an `Error(FAIL_EXPIRY, raise=TRANSACTION_EXPIRED)`. * Check `stagedMutations`. * If the doc already exists in there as a REPLACE or INSERT: * ExtReplaceBodyWithXattr: _and_ the StagedMutation has content. (E.g. if we are not on a ReplaceBodyWithXattr-compatible cluster then we will fallback to reading our RYOW from the cluster). * Return its post-transaction content in a `TransactionGetResult`. ExtBinarySupport: plus the `userFlags`. * Protocol 2.0 amendment: and `TransactionGetResult::links().isDeleted()` reflecting whether it is a tombstone or not. * Else if the doc already exists in there as a remove, return empty. * ExtThreadSafety: * Call debug hook `beforeUnlockGet`, passing this AttemptContext and the documents's key. * [Unlock](#Unlock). * Do `Get a document with MAV logic` section * ExtThreadSafety: * If any error propagates after locking, [UnlockOnError](#UnlockOnError). * Whatever happens (success or error), the final step before returning from ctx.replace() is [RemoveKVOp](#RemoveKVOp). ### Get a Document With MAV Logic This should be implemented as a routine taking an String parameter `resolvingMissingATREntry`, which defaults to null/empty. Its use will be detailed below. * Call hook `beforeDocGet`, passing this AttemptContext and the documents's key. * Do a Sub-Document lookup, getting all transactional metadata, the "$document" virtual xattr, and the document's body. Timeout is set as in [Timeouts](#Timeouts). * ExtBinarySupport: add a spec for getting field `txn.op.bin`, with the [Sub-document Binary Flag](#Sub-document-Binary-Flag) and xattr flag set. * Added for Protocol 2.0: set `accessDeleted`=true. * On error `err` of any of the above, classify as ErrorClass ec then: * FAIL_DOC_NOT_FOUND -> return empty * Else FAIL_HARD -> Error(ec, err, rollback=false) * Else FAIL_TRANSIENT -> Error(ec, err, retry=true) * Else -> raise `Error(ec, cause=err)` * Else if success, check the transational metadata to see if the doc is already involved in a transaction. * If not -> * Added for Protocol 2.0: If doc is a tombstone, return empty. * Else return the doc. * Else if ExtReplaceBodyWithXattr and the attemptId in the transactional metadata == our attempt id: * This is a RYOW, and we can optimise here by not looking up the document's ATR. * If the document is being removed, then return empty. * Else return the post-transaction version. * Else if `resolvingMissingATREntry` == the attemptId in the transactional metadata: * This is our second attempt getting the document, and it's in the same state as before (see `ActiveTransactionRecordEntryNotFound` discussion below). The blocking transaction must have been cleaned up in PENDING state, leaving this document with metadata. We are fine to return the body of the document to the app. Except if it is a staged insert (e.g. it's a tombstone - protocol 2 - or has a null body - protocol 1), in which case return empty. * Else -> we need to resolve the state of that transaction. Here is where we do the "Monotonic Atomic View" (MAV) logic: look at the single point of truth for that transaction (its ATR entry), and use that to determine whether the pre-transaction or post-transaction version of the document should be returned. The pre-transaction content is in the body, the post-transaction content in the xattrs. The algo: * Do a Sub-Document lookup of the transaction's ATR entry. * On error err, classify as ec then: * (Historical note: On errors getting the ATR entry, an older version of this spec returned the document but with a status of AMBIGUOUS. However, Couchbase is philosophically CP rather than AP, so the behaviour was changed to the below.) * FAIL_DOC_NOT_FOUND -> Raise a ActiveTransactionRecordNotFound. * FAIL_PATH_NOT_FOUND -> Raise a ActiveTransactionRecordEntryNotFound. * Else -> raise `Error(ec, cause=err)`. * Else (success): * If the ATR entry's attemptId is the same as this attemptId (RYOW): * Future optimisation note: there's actually no need for an ATR read in this case, we could have just used the document's metadata for this check. (This optimisation now added above in ExtReplaceBodyWithXattr). * If the document is being removed, then return empty. * Else return the post-transaction version. * Else do a [ForwardCompatibilityCheck](#ForwardCompatibilityCheck) with any `fc` forward compatibility field in the ATR's metadata, and interaction point "G_A". * If that passes: * If the ATR entry's status is COMMITTED or COMPLETED * If the doc is a staged removal, return empty, e.g. no doc. * Else, return the post-transaction version. * Else: * (ExtUnknownATRStates: this branch is followed if the ATR state is unknown. As it seems preferable to assume the unknown state is pre-COMMIT and allow this transaction to proceed, rather than blocking it. The fowards compatibility map can be used by the future protocol if that turns out to not be the case.) * If the doc is a staged insert (e.g. it's a tombstone - protocol 2 - or has a null body - protocol 1), then return empty, e.g. no doc. * Else, return the pre-transaction version. * On anything above erroring with error `err`, classify `err` as `ec`, then: * ActiveTransactionRecordNotFound -> * If BF-CBD-3705: * Perform the same logic as the ActiveTransactionRecordEntryNotFound case below. * Note: This isn't perfectly efficient, as we're redoing the doc read knowing full well it's going to be in the same state. But, this is a real edge case, and it's more desirable to have an already tested path than perfect efficiency. It also isn't perfectly correct - it's going to end up assuming that T1 was a PENDING transaction, when really, we don't know. But the only true solution there is to have the ATRs protected from the user, and it is better in the present to align the ActiveTransactionRecordNotFound and ActiveTransactionRecordEntryNotFound behaviours, since the ATRs will usually be recreated and they should end up being equivalent. See discussion on [CBD-3705](https://issues.couchbase.com/browse/CBD-3705) for more. * Else: Raise an `Error(ec, cause=err)` to fail the transaction. * ActiveTransactionRecordEntryNotFound -> * Discussion moved to 'ActiveTransactionRecordEntryNotFound discussion' below. * Recursively call this section from the top, passing resolvingMissingATREntry set to the attemptId of the blocking transaction. * FAIL_HARD -> Raise `Error(ec, cause=err, rollback=false)` * FAIL_TRANSIENT -> Raise `Error(ec, cause=err, retry=true)` * Else Error(ec, err) * On any item above that raises an `Error`, before doing so, [SaveErrorWrapper](#SaveErrorWrapper). * If everything above succeeds: * Call hook `afterGetComplete` before returning the result. * If that succeeds, do a [ForwardCompatibilityCheck](#ForwardCompatibilityCheck) with any `fc` forward compatibility field in the document's metadata, and interaction point "G". #### ActiveTransactionRecordEntryNotFound discussion * Tricky case. The ATR exists but the ATR entry does not. Most likely explanation is that transaction was cleaned up, though can't rule out something like the ATRs being recreated. Pertinent point is, we don't know whether the transaction was committed or not. But we can resolve the ambiguity. Either the transaction was in PENDING state or not. * Another case that can lead here: If we had a FAIL_AMBIGUOUS on a replace, we retry. Rollback will remove the ATR entry (in EXT_REMOVE_COMPLETED) but cannot rollback the doc (it's not in stagedMutations). * If it was in PENDING state, then cleanup will have removed its ATR entry, but had to leave the documents with staged data. If we reread the doc, it will still have staged data. * Else, if it was in any other state, then cleanup will have both removed its ATR entry and being able to unstage the doc. Because we know it will only remove the ATR entry _afer_ successfully unstaging all documents. E.g. we must have hit an edge case where we read the doc, just before or during cleanup, and then read the ATR entry post-cleanup. So if we reread the doc, the staged content will either have gone, or maybe another transaction will have staged content on it. * Update: I'm not sure this logic holds in advent of the ATRs being recreated. E.g. T1 could have committed and then crashed, and then the ATR been deleted and recreated. T2 now tries to read it, and with the logic above, will assume incorrectly (because the document hasn't changed), that T1 was a PENDING transaction. But this seems an impossible situation to resolve with the current setup - the ATRs must be protected from the user in e.g. a system collection. ## Get * Get simply calls GetOptional, and if that returns empty, raises `Error(FAIL_DOC_NOT_FOUND)` ## Unstaging Inserts and Replaces This is the protocol 2.x version. The protocol 1.x version has been [moved](https://hackmd.io/g3PWQrHHRiew0nGF6iFgiA). Unstaging an insert or replace is to write the post-transaction content into the document's body, and remove its transactional metadata. This is done as a single atomic operation. * Execute section `FetchIfNeededBeforeUnstage`. * If that succeeds, returning `bytes` and `cas`, pass these to next section `UnstageInsertOrReplace`. ### FetchIfNeededBeforeUnstage This method takes these parameters: * `StagedMutation sm` - saved in-memory from the staged mutation. It contain's the document's content if in `ExtTimeOptUnstaging` mode. and returns these on succeess: * `int cas` - the document's CAS * `byte[] bytes` - the document's content The algorithm: * `ExtTimeOptUnstaging` mode, or ExtReplaceBodyWithXattr and `sm` has no content (indicating this cluster supports ExtReplaceBodyWithXattr): * (Note this is the path most implementations will follow) * Return `bytes` and `cas` from `sm` * Else if `ExtMemoryOptUnstaging` mode: * The idea is to refetch the document to get the staged data, and then commit it. * The algo: * Call debug hook `beforeDocUnstageRefetch`, passing this AttemptContext and the documents's key. * If transaction has expired, and not in [ExpiryOvertimeMode](#ExpiryOvertimeMode), set ExpiryOvertimeMode. * Do a Sub-Document lookup, getting all transactional metadata - but not the document's body or $document. Timeout is set as in [Timeouts](#Timeouts). Use a set [RetryStrategy](#RetryStrategy). * If `sm` is an INSERT, set `accessDeleted`=true. * On error `err` of any of the above, classify as ErrorClass `ec` then: * If in ExpiryOvertimeMode -> `Error(FAIL_EXPIRY, AttemptExpired(err), rollback=false, raise=TRANSACTION_FAILED_POST_COMMIT)` * `FAIL_DOC_NOT_FOUND` -> Doc was removed in-between stage and unstaging. * Something has broken the co-operative model. There's nothing we can do to proceed for this document. (This marks one difference between ExtMemoryOptUnstaging and ExtTimeOptUnstaging modes - the latter will proceed to insert the document here). * Publish an `IllegalDocumentState` event to the application. * Return `Error(ec, err, rollback=false, raise=TRANSACTION_FAILED_POST_COMMIT)` * Else -> Raise `Error(ec, cause=err, rollback=false, raise=TRANSACTION_FAILED_POST_COMMIT)`. This will result in success being return to the application, but `result.unstagingCompleted()` will be false. * On success: * If the document has transactional metadata`tm`, and it's still in the same attempt as this one: * If the `tm`'s attemptId does not match this attempt, something unrecoverable has happened. `Error(ec, cause=IllegalDocumentState, rollback=false, raise=TRANSACTION_FAILED_POST_COMMIT)` * If the document's CAS does not match the CAS from `sm`: * Publish an `IllegalDocumentState` event to the application, as the co-operative model has been broken. * If `txn`, `txn.op` or `txn.op.stgd` fields are missing from the transactional metadata, we cannot continue. `Error(ec, cause=IllegalDocumentState, rollback=false, raise=TRANSACTION_FAILED_POST_COMMIT)` * Return `bytes` from the content stored in xattr `txn.op.stgd` and `cas` from the fetched document. * Else: * The document has somehow become involved in a different transaction. This should not be possible. From [CheckWriteWriteConflict](#CheckWriteWriteConflict) The only way T1 can overwrite a locked document is if T2's ATR entry has expired, COMPLETED or ROLLED_BACK, or been removed. So we should only reach here if there's a bug, or if something extremely odd has happened and T1 has being cleaned up while it's still committing. * So, assume that has happened, and bailout, leaving the rest of the transaction for cleanup (which appears to have already happened). * Raise `Error(FAIL_OTHER, rollback=false, raise=TRANSACTION_FAILED_POST_COMMIT)` ### UnstageInsertOrReplace This method takes these parameters: * `StagedMutation sm` - saved in-memory from the staged mutation. * `long cas` - the first write of the document is done as a replace, with the CAS from staging it (or from fetching it, in ExtMemoryOptUnstaging mode). This lets us detect if another actor has modified the document, e.g. it has broken the co-operative model, and raise an event alerting the application. However, the transaction must commit, so the operation will be retried as an upsert (e.g. CAS of 0). This is also used for FAIL_DOC_NOT_FOUND handling. * `boolean insertMode` - if we get FAIL_DOC_NOT_FOUND, we retry the operation as an insert. (It's not allowed with Sub-Document to do an upsert that also removes a field, unfortunately.) Defaults to true if it's a shadow-document-staged-insert, else false. * `boolean ambiguityResolutionMode` - used for FAIL_AMBIGUOUS handling. Defaults to false. * `byte[] bytes` - the document content. * `ExtBinarySupport`: `int32 stagedUserFlags`, from `stagedMutations`. The algorithm: * If transaction has expired (stage name is "commitDoc"), and not in [ExpiryOvertimeMode](#ExpiryOvertimeMode), set ExpiryOvertimeMode. * Call debug hook `beforeDocCommitted`, passing this AttemptContext and the documents's key. * Perform the mutation: * If in `insertMode`: * ExtReplaceBodyWithXattr and the contents of `sm` are null (e.g. the server also supports ReplaceBodyWithXattr): * Perform a Sub-Document mutateIn. Durability and timeouts are set as in [Durability](#Durability) and [Timeouts](#Timeouts). Use a set [RetryStrategy](#RetryStrategy). Set `reviveDocument` and `accessDeleted` flags to true. * These specs are performed: * REPLACE_BODY_WITH_XATTR, as specified on [SDK-RFC 69](https://docs.google.com/document/d/1pafGxfhhg4Nmw_huvjJoGuUBWA39kCCXpkMoIb1MtW0/edit). This is an xattr field. * ExtBinaryMetadata: if document is binary use field `txn.op.bin` as the source, and set the [Sub-document Binary Flag](#Sub-document-Binary-Flag). If not binary, use field `txn.op.stgd`. * Remove "txn". This is an xattr field. * Use CAS of `cas`. * ExtBinaryMetadata: set the document's user flags to `stagedUserFlags`. * Else: * Perform a regular KV insert, using content `bytes`. Durability and timeout are set as in [Durability](#Durability) and [Timeouts](#Timeouts). * ExtBinaryMetadata: set the document's user flags to `stagedUserFlags`. * Else: * ExtReplaceBodyWithXattr and the contents of `sm` are null: * Perform a Sub-Document mutateIn. Durability and timeouts are set as in [Durability](#Durability) and [Timeouts](#Timeouts). Use a set [RetryStrategy](#RetryStrategy). * These specs are performed: * REPLACE_BODY_WITH_XATTR, as specified on [SDK-RFC 69](https://docs.google.com/document/d/1pafGxfhhg4Nmw_huvjJoGuUBWA39kCCXpkMoIb1MtW0/edit). This is an xattr field. * ExtBinaryMetadata: if document is binary use field `txn.op.bin` as the source, and set the [Sub-document Binary Flag](#Sub-document-Binary-Flag). If not binary, use field `txn.op.stgd`. * Remove "txn". This is an xattr field. * Use CAS of `cas`. * ExtBinaryMetadata: set the document's user flags to `stagedUserFlags`. * Else: * Perform a Sub-Document mutateIn. Durability and timeouts are set as in [Durability](#Durability) and [Timeouts](#Timeouts). Use a set [RetryStrategy](#RetryStrategy). * These specs are performed: * Upsert "txn" as null. This is an xattr field. * Remove "txn". This is an xattr field. See [Upsert-and-remove](#Upsert-and-remove) for an explanation of the idiom here. * Replace "" (the document body) with `bytes`. * ExtBinaryMetadata: do not set [Sub-document Binary Flag](#Sub-document-Binary-Flag), that is not used for full-doc operations. * Use CAS of `cas`. * ExtBinaryMetadata: set the document's user flags to `stagedUserFlags`. * Call debug hook `afterDocCommittedBeforeSavingCAS`, passing this AttemptContext and the documents's key. * On error err (from any of the preceding items in this section), classify as error class ec then: * If in ExpiryOvertimeMode -> `Error(FAIL_EXPIRY, AttemptExpired(err), rollback=false, raise=TRANSACTION_FAILED_POST_COMMIT)` * Else `FAIL_AMBIGUOUS` -> Retry this section from the top, after [OpRetryDelay](#OpRetryDelay), with `ambiguityResolutionMode`=true and the current `casZero`. * Else `FAIL_CAS_MISMATCH` -> Doc has changed in-between stage and unstaging. * If `ambiguityResolutionMode`=true, then our previous attempt likely succeeded. Unfortunately, we cannot continue as our returned mutationTokens wouldn't include this mutation. Raise `Error(ec, err, rollback=false, raise=TRANSACTION_FAILED_POST_COMMIT)`. * Else, another actor has changed the document, breaking the co-operative model (it should not be possible for another transaction to do this). Publish an `IllegalDocumentState` event to the application (the details of event publishing are platform-specific: on Java, it uses the Java SDK's event bus). Run this section again from the top, with `casZeroMode`=true and the current `ambiguityResolutionMode` and `insertMode`. * Else `FAIL_DOC_NOT_FOUND` -> Doc was removed in-between stage and unstaging. * Something has broken the co-operative model. The transaction must commit, so we will insert the document. * Publish an `IllegalDocumentState` event to the application. * Run this section again from the top, after [OpRetryDelay](#OpRetryDelay), with `insertMode`=true, and the current `ambiguityResolutionMode` and `cas`. This will insert the document. * Else `FAIL_DOC_ALREADY_EXISTS` -> * If in ambiguityResolutionMode: * Probably we inserted a doc over a shadow document, and it raised `FAIL_AMBIGUOUS` but was successful. Resolving this is perhaps impossible - we don't want to retry the operation, as the document is now committed and another transaction (or KV op) may have started on it. We could reread the doc and check it contains the expected content - but again it may have been involved in another transaction. So, raise `Error(ec, cause=err, rollback=false, raise=TRANSACTION_FAILED_POST_COMMIT)`. * Else: * Seems the co-operative model has been broken. The transaction must commit, so we will replace the document. * Publish an `IllegalDocumentState` event to the application. * Run this section again from the top, after [OpRetryDelay](#OpRetryDelay), with `insertMode`=false, the current `ambiguityResolutionMode`, and `cas`=0. * Else -> Raise `Error(ec, cause=err, rollback=false, raise=TRANSACTION_FAILED_POST_COMMIT)`. This will result in success being return to the application, but `result.unstagingCompleted()` will be false. * On success * Save the mutation token to return in the `TransactionResult`. ## Unstaging Removes Unstaging a remove is to simply delete the document. This method takes a boolean `ambiguityResolutionMode` parameter, defaulting to false. * Call debug hook `beforeDocRemoved`, passing this AttemptContext and the documents's key. * If transaction has expired (stage name is "removeDoc"), and not in [ExpiryOvertimeMode](#ExpiryOvertimeMode), set ExpiryOvertimeMode. * Perform the mutation: * This is simply a KV remove. Durability and timeout are set as in [Durability](#Durability) and [Timeouts](#Timeouts). CAS is 0. * Side-note: We *could* do it with CAS so we could raise an event to the application that the transactional model has been broken (e.g. there has been a change to the document by an another actor, in between stage and unstage), but a) it would make FAIL_AMBIGUOUS handling a little trickier and b) it seems uncessary. The document has been successfully removed, so that a mutation has been lost does not seem relevant information to raise. * Call debug hook `afterDocRemovedPreRetry`, passing this AttemptContext and the documents's key. * On error err (from any of the preceding items in this section), classify as error class ec then: * If in ExpiryOvertimeMode -> `Error(FAIL_EXPIRY, AttemptExpired(err), rollback=false, raise=TRANSACTION_FAILED_POST_COMMIT)` * Else `FAIL_AMBIGUOUS` -> Retry this section from the top, after [OpRetryDelay](#OpRetryDelay), with `ambiguityResolutionMode`=true. * Else `FAIL_DOC_NOT_FOUND` -> Doc was removed in-between stage and unstaging. * Either we are in in `ambiguityResolutionMode` mode, and our second attempt found that the first attempt was successful. Or, another actor has removed the document. Either way, we cannot continue as the returned mutationTokens won't contain this removal. Raise `Error(ec, err, rollback=false, raise=TRANSACTION_FAILED_POST_COMMIT)`. * Note: The ambiguityResolutionMode logic is redundant currently. But we may have better handling for this in the future, e.g. a mode where the application can specify that mutationTokens isn't important for this transaction. * Else -> Raise `Error(ec, cause=err, rollback=false, raise=TRANSACTION_FAILED_POST_COMMIT)`. This will result in success being return to the application, but `result.unstagingCompleted()` will be false. See "Unstaging Inserts & Replaces" for the logic behind this. * On success * Save the mutation token to return in the `TransactionResult`. * Call debug hook `afterDocRemovedPostRetry`. ## SetATRPending ## SetATRCommit ## SetATRRolledBack ## SetATRAborted ## SetATRComplete This content has been (moved)[https://hackmd.io/wE1KU1RKQtONGR8tXbpA1Q] as this document hit size limits. ## Commit This is the result of either an explicit `ctx.commit()`, or an implicit commit (e.g. reaching the end of the lambda). The transaction is not regarded as actually committed until the ATR entry has unambiguously been changed to COMMITTED. So errors before that point will still be treated as any other error inside the lambda. Once the COMMITTED point has been reached it is a point of no return for the transaction. ExtThreadSafety: commit is done completely under the lock, after waiting for all KV operations to complete. The algo: * If ExtThreadSafety: * [WaitForKVAndLock](#WaitForKVAndLock). * If CommitNotAllowed BehaviourFlag is set, raise `Error(ec=FAIL_OTHER, cause=CommitNotPermitted)`. Do not update BehaviourFlags or FinalErrorToRaise internal state from this error, as it is purely information and should not adjust transactions behaviour. (Don't send rollback=false as it affects too many FIT tests to be worth it - it is informational anyway.) * Perform [DoneCheck](#DoneCheck). * [SetStateBits](https://hackmd.io/foGjnSSIQmqfks2lXwNp8w?view#SetStateBits) passing newBehaviourFlags = CommitNotAllowed | RollbackNotAllowed. * If !ExtThreadSafety: * If the `errors` member is non-empty: * If we get here then either: * The application incorrectly caught and did not propagate an error. Errors should cause the lambda to immediately fail. * Or, the application is doing parallel operations and one of them failed. * Either way, we will not proceed to commit, and instead need to figure out what to do next. * We have potentially multiple `ErrorWrapper`s, and will construct a new single `ErrorWrapper` `err` from them using this algo: * err.retryTransaction is true iff it's true for ALL `errors`. * err.rollback is true iff it's true for ALL `errors`. * err.cause = `PreviousOperationFailed`, with that exception taking and providing access to the causes of all the `errors` * Then, raise `err` * Perform [DoneCheck](#DoneCheck) * ExtQuery: If in `queryMode`, instead of doing any of the following, perform the logic in the [query spec](https://hackmd.io/eMPHhhd9SJqjO3s9dT9p7A) instead. * If transaction has expired (stage name is "commit"), set [ExpiryOvertimeMode](#ExpiryOvertimeMode) then raise an `Error(FAIL_EXPIRY, raise=TRANSACTION_EXPIRED)`. * If ExtQuery: set isDone to true. * If no ATR has been selected then no mutation has been performed. Return success. This will leave `state` as `NOTHING_WRITTEN`, * Perform the `SetATRCommit` step. * If that succeeded, unstaging each document (e.g. loop through `stagedMutations`): * If it's of type `REMOVE` -> Perform step 'Unstaging Inserts and Replaces (Protocol 2.0 version)' * Else -> Perform step 'Unstaging Removes' * If that succeeded, perform the `SetATRComplete` step. * ExtThreadSafety: Whatever happens (success or error), the final step is [Unlock](#Unlock). ### Errors During Commit Once we have reached the COMMIT point, all transaction-aware actors will see the post-transaction values of all documents - irrespective of whether they are individually unstaged. So for many application cases (e.g. those not doing N1QL consistentWith), errors after the COMMIT point are not an error at all. And it seems unwise to repeatedly retry operations if the cluster is currently overwhelmed. So instead, success is returned. This makes transactions more resilient in the face of e.g. rebalances. The commit will still be completed by the async cleanup process later. The returned `TransactionResult.unstagingComplete()` will be false to indicate that, while the transaction successfully reached the COMMIT point, it was not able to unstage (commit) all documents. Note that it may seem tempting to continue committing the rest of the documents, and _then_ return with `result.unstagingComplete()` as false. But recall that this is going to duplicate some work that the cleanup process will need to do anyway, as cleanup will read and possibly commit _all_ documents if transaction only reached COMMITTED. It is more efficient to bailout. Future note: On the flip-side, issues that happen 'in process' are much easier to debug than errors in cleanup. In practice, there is a sliding scale of where it makes sense to retry a failure, perhaps for a limited number of times, and where it makes sense to bailout. There are parts of the current algorithm where some errors are retried and some are not, without much of a consistent approach. And the application may wish to have some control over this too. We will likely continue to iterate in this area. ## rollbackInternal This method takes a boolean parameter `isAppRollback`, indicating whether it was initiated by a `ctx.rollback()`. As discussed on [Transactions Error Handling Enhancement](/FG0x_PDdTI-yJfhGc9x8Bw), the default behaviour is for any form of rollback (app-rollback or auto-rollback) to retry most forms of the error until expiry. For one, the transaction may be aiming to retry after an auto-rollback. ExtThreadSafety: like commit, rollback is done entirely under lock, after waiting for any KV operations to complete. ExtThreadSafety: a general rule is that the user does not care about errors that occur during auto-rollback (they care about what _caused_ the rollback). So, any `TransactionOperationFailed` that are raised during auto-rollback, do _not_ set internal state bits. The algo: * If ExtThreadSafety: * [WaitForKVAndLock](#WaitForKVAndLock). * If RollbackNotAllowed BehaviourFlag is set, raise `Error(ec=FAIL_OTHER, cause=RollbackNotPermitted)`. Do not update BehaviourFlags or FinalErrorToRaise internal state from this error, as it is purely information and should not adjust transactions behaviour. (Don't send rollback=false as it affects too many FIT tests to be worth it - it is informational anyway.) * Perform [DoneCheck](#DoneCheck). * [SetStateBits](https://hackmd.io/foGjnSSIQmqfks2lXwNp8w?view#SetStateBits) passing newBehaviourFlags = CommitNotAllowed | RollbackNotAllowed. (Including on auto-rollback). * If `state` == NOT_STARTED and !queryMode, skip rollback. There is nothing to do (except in queryMode, where we don't know what state the transaction is in on the query side, so ROLLBACK for safety). * ExtQuery: If in `queryMode`, instead of doing any of the following, perform the logic in the [query spec](https://hackmd.io/eMPHhhd9SJqjO3s9dT9p7A) instead. * If not in ExpiryOvertimeMode: If transaction has expired, set ExpiryOvertimeMode. * If ExtQuery and !ExtThreadSafety: * If isDone is true, and `isAppRollback`, raise `Error(ec = FAIL_OTHER, rollback=false)` with a cause indicating in the most platform-applicable way (e.g. IllegalStateException on Java) that operations cannot be performed on a transaction after it has committed or rolled back. * Set isDone to true. * If the `state` of the context is NOT_STARTED, then return as success (there's nothing to rollback). * If not ExtQuery: If the application has already called `ctx.commit()` or `ctx.rollback()`, raise `Error(ec, rollback=false)`. * Then call [setATRAborted](#SetATRAborted). * If successful, then, for each document in `stagedMutations` (in the order they were added, purely as it makes testing easier). * If it's an insert, do the [Rollback Staged Insert](#Rollback-Staged-Insert) logic. * Else, do the [Rollback Replace or Remove](#Rollback-Replace-or-Remove) logic. * Then call [SetATRRolledBack](#SetATRRolledBack). * ExtThreadSafety: Whatever happens (success or error), the final step is [Unlock](#Unlock). ## Rollback Staged Insert There's a staged insert (an empty document with transactional metadata in xattrs), that needs to be removed. This method's parameter list includes: * `isAppRollback` - indicating whether it was initiated by a `ctx.rollback()`. * ExtThreadSafety: if false, then any `TransactionOperationFailed` raised by this section, does _not_ set internal state bits. (The user does not care about problems that happen during auto-rollback.) The algo: * If transaction has expired, and is not in ExpiryOvertimeMode, raise an AttemptExpired, which will be handled by the error block in this section. * Call debug hook `beforeRollbackDeleteInserted`, passing this AttemptContext and the document's key. * Perform the mutation: * Protocol 2.0 version: * Perform a Sub-Document mutateIn. Durability and timeouts are set as in [Durability](#Durability) and [Timeouts](#Timeouts). Use a set [RetryStrategy](#RetryStrategy). * Use cas of the staged `doc`. (This logic added in [CBD-3495](https://issues.couchbase.com/browse/CBD-3495)) * `accessDeleted` flag set * These specs are performed: * Upsert "txn" as null. This is an xattr field. * Remove "txn". This is an xattr field. See [Upsert-and-remove](#Upsert-and-remove) for an explanation of the idiom here. * Else (pre-2.0 version): * Perform a regular KV remove. Durability and timeout are set as in [Durability](#Durability) and [Timeouts](#Timeouts). * On success * Call hook `afterRollbackDeleteInserted`, passing this AttemptContext and the document's key. * Do not save the mutation token. There is no need, as a rolled-back attempt is the same as pre-transaction. * On error err (from any of the preceding items in this section), classify as error class ec then: * If ExpiryOvertimeMode -> time to bailout. Raise`Error(FAIL_EXPIRY, AttemptExpired(err), raise=TRANSACTION_EXPIRED, rollback=false)`. * Else `FAIL_EXPIRY` -> Set ExpiryOvertimeMode and retry operation, after waiting [OpRetryBackoff](#OpRetryBackoff). * Else `FAIL_DOC_NOT_FOUND` -> Possibly we retried the op on `FAIL_AMBIGUOUS` and that op had succeeded. Perhaps something odd has happened and async cleanup has rolled back this doc while we've been trying to. Either way, our work on this document is done. Continue as success. * Protocol 2.0 version: * Else `FAIL_PATH_NOT_FOUND` -> same logic as `FAIL_DOC_NOT_FOUND` for same reason. * Else `FAIL_CAS_MISMATCH` -> Either the co-operative model has been broken, or we're retrying on a previous FAIL_AMBIGUOUS that actually succeeded. We could resolve the ambiguity here, but it's somewhat expensive (would require reading the doc), and it's only rollback cleanup. So instead bailout and leave it for the cleanup process `Error(ec, err, rollback=false)` * Else `FAIL_HARD` -> Error(ec, err, rollback=false) * Else -> Default current logic is that rollback will continue in the event of failures until expiry. Retry operation, after waiting [OpRetryBackoff](#OpRetryBackoff). ## Rollback Staged Replace or Remove There's a staged replace or remove `doc`, that needs to be rolled back. This simply involves removing the transactional metadata. This method's parameter list includes: * `isAppRollback` - indicating whether it was initiated by a `ctx.rollback()`. * ExtThreadSafety: if false, then any `TransactionOperationFailed` raised by this section, does _not_ set internal state bits. (The user does not care about problems that happen during auto-rollback.) The algo: * If transaction has expired, and is not in ExpiryOvertimeMode, raise an AttemptExpired, which will be handled by the error block in this section. * Call debug hook `beforeDocRolledBack`, passing this AttemptContext and the document's key. * Perform the mutation: * This is done as a Sub-Document mutateIn. Durability and timeouts are set as in [Durability](#Durability) and [Timeouts](#Timeouts). Use a set [RetryStrategy](#RetryStrategy). * Protocol 1.0: CAS is zero. * Protocol 2.0: * Use cas of the staged `doc`. (This logic added in [CBD-3495](https://issues.couchbase.com/browse/CBD-3495)) * These specs are performed: * Upsert "txn" as null. This is an xattr field. * Remove "txn". This is an xattr field. See [Upsert-and-remove](Upsert-and-remove) for an explanation of the idiom here. * On success * Call hook `afterRollbackReplaceOrRemove`, passing this AttemptContext and the document's key. * Do not save the mutation token. There is no need, as a rolled-back attempt is the same as pre-transaction. * On error err (from any of the preceding items in this section), classify as error class ec then: * If ExpiryOvertimeMode -> time to bailout. Raise`Error(FAIL_EXPIRY, AttemptExpired(err), raise=TRANSACTION_EXPIRED, rollback=false)`. * Else `FAIL_EXPIRY` -> Set ExpiryOvertimeMode and retry operation, after waiting [OpRetryBackoff](#OpRetryBackoff). * Else `FAIL_PATH_NOT_FOUND` -> The transactional metadata already doesn't exist. Possibly we retried the op on `FAIL_AMBIGUOUS` and that op had succeeded. Perhaps something odd has happened and async cleanup has rolled back this doc while we've been trying to. Either way, our work on this document is done. Continue as success. * Else `FAIL_DOC_NOT_FOUND` -> Should not happen, likely means the co-operative model has been broken. But as it's rollback and no mutations are going to lost, do not raise an event. `Error(ec, err, rollback=false)` * Protocol 2.0: Else `FAIL_CAS_MISMATCH` -> Either the co-operative model has been broken, or we're retrying on a previous FAIL_AMBIGUOUS that actually succeeded. We could resolve the ambiguity here, but it's somewhat expensive (would require reading the doc), and it's only rollback cleanup. So instead bailout and leave it for the cleanup process `Error(ec, err, rollback=false)` * Else `FAIL_HARD` -> `Error(ec, err, rollback=false)` * Else -> Default current logic is that rollback will continue in the event of failures until expiry. Retry operation, after waiting [OpRetryBackoff](#OpRetryBackoff).

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.