# Debugging Techniques
### Build in simple mode
If you develop a new feature and see a lot of simulation failures, you can often make your life easier by initially setting the variable `TestConfig::simpleConfig` to `true`.
This will instruct the simulator to only create single-replicated clusters with exactly one TLog, one Resolver, etc.
This will make simulation significantly weaker. However, it will also make debugging much easier since you don't need to worry about the more advanced features (like complex replication policies).
### Look for Severity 40 Events
E.g., `cmr | s40`. This is assuming you have installed some bash functions from [here](https://github.com/jzhou77/foundationdb/wiki/My-src-.bashrc.local-file).
### Add More Trace Events
Adding trace events if better than `std::cout`, for additional information of simulation time and role. The trace log is also easier for repeated search/grep operations.
### Keep in mind that time is deterministic
One very basic but very powerful property of the simulator is that time is simulated as well, and it is deterministic. This allows us to do a lot of interesting things:
* We can change the behavior after a certain point in time. For example in order to verify that a bug was fixed we could keep the old behavior right until the bug would happen and then switch to the new behavior.
* We can add code to make a test crash after some time has passed (this is useful for timeout tests where we don't want to wait on the test to finish)
* We can add additional tracing just during a critical time period.
* ...
### Debug Timeout Failures
Timeout failures are often very tricky to debug. In general, what you will see is something like this:
```
<Event Severity="40" ErrorKind="Unset" Time="93.190464" DateTime="2021-12-15T22:06:54Z" Type="TestFailure" Machine="3.4.3.4:1" ID="840f29ad6f9058ab" Error="timed_out" ErrorDescription="Operation timed out" ErrorCode="1004" Reason="Error starting workload" Workload="DataLossRecovery" ThreadID="12328344666965720336" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x307110c 0x30702c0 0x30706a1 0x19e0eae 0x19e0f69 0xa07129 0xa7ec29 0xa07129 0x1ea7062 0xa07129 0x1eaa752 0x1eaac84 0xa07129 0xe5fac3 0xec0930 0x2f0d9a8 0x2f2cd30 0x2f0daad 0x2f2ceaf 0xec0930 0x300f08b 0x2f2d062 0x99c4ff 0x7f0ceb470555" LogGroup="default" Roles="TS" />
```
The problem with these is, that it's very hard to know where to start. But there are some pointers that can help. Generally there are two types of timeout failures:
1. The main test workload is never finishing, usually because it can't make progress for some reason (for example a population workload can't write to the database)
2. Some auxilary workload is timing out.
(2) is much more common than (1). The following section gives some pointers where to start looking.
#### Verify that the last Recovery was successful
When grepping for `MasterRecoveryState` the last log line should have the state `fully_recovered`. If it doesn't, the cluster got in a state where it could self-heal anymore (availability loss) even though all failure injection has been disabled.
#### Check QuietDatabase
Grep for QuietDatabase. If you see many many trace lines of this type towards the end of the test, it could mean that the database didn't quiet down.
This test is verifying multiple things and any of those could have failed. For example:
* The storage queue of a storage server doesn't go to 0 (this could mean that a storage is unable to write to `IKeyValueStore` -- but it can also have other reasons).
* Data movement doesn't make progress. One reason for this could be data loss.
* Storage recruitment doesn't quiet down. This could be caused by a storage server repeatadly dying.
* ...
#### Check for ConsistencyCheck
ConsistencyCheck is another workload that could retry forever. This can also be verified by grepping for `ConsistencyCheck`. ConsistencyCheck tests for many things (like DD, TSS etc).
### Make Timeout-Tests Fail quicker
One very annoying thing about tests that timeout is that it often takes a very long time until simulation actually times out. So adding traces and then rerunning the test takes a long time. Just estimating how long will be enough and then pressing CTRL-C requires the user to do this synchronously.
Instead, the following trick will do the job:
1. Run the test once to completion.
1. Open the traces and look for a time where the rest of the traces are not useful anymore. Usually this will be somewhere in QuietDatabase and the rest of the test will just repeat the same output to the end.
3. Change the simulator so it dies after that time.
(3) is relatively simply to do. Assume you want to kill simulation after 603 seconds. Open `sim2.actor.cpp`, go to `runLoop` and change the code to the following:
```
ACTOR static Future<Void> runLoop(Sim2* self) {
state ISimulator::ProcessInfo* callingMachine = self->currentProcess;
while (!self->isStopped) {
wait(self->net2->yield(TaskPriority::DefaultYield));
self->mutex.enter();
if (self->tasks.size() == 0) {
self->mutex.leave();
ASSERT(false);
}
// if (!randLog/* && now() >= 32.0*/)
// randLog = fopen("randLog.txt", "wt");
Task t = std::move(self->tasks.top()); // Unfortunately still a copy under gcc where .top() returns const&
self->currentTaskID = t.taskID;
self->tasks.pop();
self->mutex.leave();
self->execTask(t);
self->yielded = false;
if (self->now() > 603) {
crashAndDie();
}
}
self->currentProcess = callingMachine;
self->net2->stop();
for (auto& fn : self->stopCallbacks) {
fn();
}
return Void();
}
```
This change won't change determinism.
### Verify a Bug Fix
Generally it is very hard to verify whether a bug fix is correct or whether you simply changed determinism which causes the test to file. Oftentimes the bug is very obvious as soon as you find it, but sometimes it's good to verify that you found and fixed the right thing.
There's no way to verify a bug fix to 100% confidence. However, there is a useful trick that will give you some confidence. It uses, and you should see a pattern by now, the fact that time is deterministic. Let's say this is how your bug fix looks like:
```
ACTOR Future<Void> someActor() {
... // some code
myBugFix();
... // some more code
}
```
You know, by looking at the traces, that the bug happens at `Time = 351.87`. So now you can change the code above as follows:
```
ACTOR Future<Void> someActor() {
... // some code
if (now() > 351.8) { // right before things go bad
myBugFix();
}
... // some more code
}
```
So now up until shortly before the bug happens, the simulator will behave exactly as it did before. But right before something bad happens, your fix will be applied.
For timeout tests this is sometimes even easier. The test won't fail, but it will stop making progress. Sometimes you can now guard your fix behind a very large number of seconds and verify that the test can finish as soon as it reaches that time.
### `debugTransaction`
This is useful to track how a transaction is processed from client, i.e., NativeAPI, to server processes (e.g., proxy, tlog, and storage servers).
By calling `tr.debugTransaction(UID debugID)`, both reads and commits of the transaction will be logged. Just searching for the `debugID` in the log, and note this ID is attached to other IDs during the processing.
Note: adding this can change random seed and you may not reproduce the problem.
### `DEBUG_MUTATION`
This is useful to tracking mutations of a specific version or a specific key (range). For example, in the [data corruption example](#backup-failure-data-corruption-in-vv) below, this technique is used to confirm a key (and a backup mutation) is modified (committed).
To enable this, change `MUTATION_TRACKING_ENABLED` to `1` and in `MutationTracking.h` and modify `MutationTracking.cpp` with the version or key (range). Then look for `MutationTracking` in trace events.
### `debugRandom`
When debugging, we can use `debugRandom()` just like `deterministicRandom()` to generate random numbers, which would not change the random sequence of the existing `deterministicRandom()`.
### `fdbdecode` for Parsing Backup Mutation Logs
Sometimes we need to know mutations in the backup, then `fdbdecode` becomes handy.
The following is an example of using this tool. Note the mutation log starts at version `MinLogBeginVersion` and ends at `MaxLogEndVersion`. If the target mutation is within this range, then the tool should find it.
```
./cbuild_output/bin/fdbdecode -r file:///root/simfdb/backups/backup-1970-01-01-00-01-48.793374/
Params: ContainerURL: file:///root/simfdb/backups/backup-1970-01-01-00-01-48.793374/, FileFilter: , list_only: false
URL: file:///root/simfdb/backups/backup-1970-01-01-00-01-48.793374/
Restorable: true
Partitioned logs: false
Snapshot: startVersion=299866656 (maxLogEnd -0.00 days) endVersion=301988136 (maxLogEnd -0.00 days) totalBytes=10647884 restorable=true expiredPct=0.00
SnapshotBytes: 10647884
MinLogBeginVersion: 298765213 (maxLogEnd -0.00 days)
ContiguousLogEndVersion: 338765213 (maxLogEnd -0.00 days)
MaxLogEndVersion: 338765213 (maxLogEnd -0.00 days)
MinRestorableVersion: 301988136 (maxLogEnd -0.00 days)
MaxRestorableVersion: 338765212 (maxLogEnd -0.00 days)
Relevant files are: 2 log files
beginVersion:298765213 endVersion:318765213 blockSize:1048576 filename:logs/0000/0000/log,298765213,318765213,e0f3e48993e59f5d28b677c352640d30,1048576 fileSize:24159023 tagId: (None)
beginVersion:318765213 endVersion:338765213 blockSize:1048576 filename:logs/0000/0000/log,318765213,338765213,c0b471ad226e5ff4f95a85cd04aa823e,1048576 fileSize:2441829 tagId: (None)
304663277 code: ClearRange param1: 0000000000j param2: 0000000000nrhvkfdmjstonitcacfetcxklqxzjdxmwlqkvcoyyruvbzrvjnshatxzbu
... (elided)
```
Note each mutations is printed with their commit versions, followed by the actual mutation data.
### Recovery Stuck At all_logs_recruited
1. Check the unpopped tags from `TLogPoppedTag` and find the corresponding `ServerTag` event.
2. Check the Storage Server IDs and Storage Server IP address using `StorageMetrics`.
3. Check all Assassination trace event and try to find whether the SS are killed permanently. See `KillType`.
4. Check whether there are repeated `WaitForAllDataRemoved` of above SS.
Example https://snowflakecomputing.atlassian.net/browse/FDBCORE-34362
### Out-Of-Memory (OOM)
See this [wiki page](https://github.com/apple/foundationdb/wiki/Debug-Out-Of-Memory-(OOM)-Errors-in-Simulation).
### Unseed Mismatch
See this [doc](https://github.com/apple/foundationdb/blob/main/contrib/debug_determinism/README.md) for debugging nondeterminism issues.
### Leaked `Reference<T>`
If a run has a reference counted object (using `Reference<T>`) that is unexpectedly not being destroyed, then we have a debugging tool that can help identify what references to that object still exist. This tool first needs to be enabled for the type that you want to track by inheriting from `DebugReferenceCounted` rather than `ReferenceCounted`:
```
class TenantBalancer : public DebugReferenceCounted<TenantBalancer> {
...
};
```
Then you would also need to turn it on for each instance you want to track. This is done in one of two ways:
```
UID dbgId = ...;
Reference<TenantBalancer> myRef = makeReference<TenantBalancer>(...);
myRef.debugTrackRef(dbgId);
// Alternatively
Reference<TenantBalancer> myRef = makeDebugReference<TenantBalancer>(dbgId, ...);
```
A reference counted instance that has tracking enabled will log a trace event any time a new `Reference` is created or destroyed, and the event will include the backtrace of the action taken. It is also possible to dump the set of open references at any time by calling `myRef->debugDumpRefs()`.
Below is a sequence of events for a leaked object. Right before the expected final `Reference` to the object was expected to be destroyed, `debugDumpRefs` was called on it:
```
----
Time | Type | RefId | RefNum | RefCount
----
54.184247 | ReferenceTrackerAddRefConstruct | 91cd3546b8480117 | 0 | 1
54.184247 | ReferenceTrackerAddRefCopy | 91cd3546b8480117 | 1 | 2
54.184247 | ReferenceTrackerAddRefCopy | 91cd3546b8480117 | 2 | 3
54.666904 | ReferenceTrackerDelRefDestroy | 91cd3546b8480117 | 2 | 2
54.666904 | ReferenceTrackerAddRefCopy | 91cd3546b8480117 | 3 | 3
54.684761 | ReferenceTrackerDelRefDestroy | 91cd3546b8480117 | 3 | 2
54.684761 | ReferenceTrackerAddRefCopy | 91cd3546b8480117 | 4 | 3
54.684761 | ReferenceTrackerAddRefCopy | 91cd3546b8480117 | 5 | 4
54.684761 | ReferenceTrackerAddRefCopy | 91cd3546b8480117 | 6 | 5
54.684761 | ReferenceTrackerAddRefCopy | 91cd3546b8480117 | 7 | 6
54.684761 | ReferenceTrackerAddRefCopy | 91cd3546b8480117 | 8 | 7
54.684761 | ReferenceTrackerDelRefDestroy | 91cd3546b8480117 | 7 | 6
54.684761 | ReferenceTrackerDelRefDestroy | 91cd3546b8480117 | 6 | 5
54.721178 | ReferenceTrackerAddRefCopy | 91cd3546b8480117 | 9 | 6
54.735973 | ReferenceTrackerDelRefDestroy | 91cd3546b8480117 | 4 | 5
54.735973 | ReferenceTrackerDelRefDestroy | 91cd3546b8480117 | 5 | 4
54.735973 | ReferenceTrackerDelRefDestroy | 91cd3546b8480117 | 9 | 3
54.735973 | ReferenceTrackerDelRefDestroy | 91cd3546b8480117 | 1 | 2
54.735973 | ReferenceTrackerDump | 91cd3546b8480117 | 0 |
54.735973 | ReferenceTrackerDump | 91cd3546b8480117 | 8 |
54.735973 | ReferenceTrackerDelRefDestroy | 91cd3546b8480117 | 0 | 1
```
Here we can see a sequence of various references to the object being created and destroyed (backtraces have been removed for clarity). The final dump shows that `RefNum` 8 is still open, and looking specifically at its events we see:
```
$ grep trace.* ReferenceTracker | grep 91cd3546b8480117 | grep "\"RefNum\": \"8\""
{ "Severity": "10", "Time": "54.684761", "DateTime": "2024-06-03T22:50:18Z", "Type": "ReferenceTrackerAddRefCopy", "Machine": "[abcd::2:0:1:2]:1", "ID": "0", "RefId": "91cd3546b8480117", "RefNum": "8", "RefCount": "7", "Backtrace": "addr2line -e fdbserver.debug -p -C -f -i 0x716e0e3 0x7092abc 0x41a83f4 0x41a8191 0x41a8ae1 0x41a7d56 0x40e5726 0x4170e98 0x417038f 0x416fd06 0x416db3d 0x414a618 0x41493ce 0x4150338 0x414b1b0 0x299f3c8 0x299e689 0x6514058 0x6513f53 0x651291d 0x6511e1a 0x6514812 0x27d5ea8 0x27d4778 0x27d468d 0x29443a8 0x29445b9 0x653cd98 0x653d369 0x652da98 0x65302bd 0x653089c 0x62ca614 0x62c6da2 0x62c3a58 0x62c03f7 0x3307198 0x330671c 0x2711ad8 0x2711772 0x6d0b7b1 0x6d0b036 0x27044e8 0x6e289cb 0x6e28490 0x45...", "ThreadID": "7508392161942362406", "LogGroup": "default", "Roles": "BM,CP,CS,DD,EP,MG,MS,RK,RV,SS,TB,TL" }
```
Expanding this backtrace gives us the location the `Reference` is being held:
```
...
Reference at /mnt/ephemeral/abeamon/frostdb/flow/memory/include/flow/memory/FastRef.h:368
(inlined by) LocalMovementInfo::LocalMovementInfo(flow::memory::Reference<TenantBalancer, true>, flow::memory::Reference<metacluster::MetaclusterOperationContext<DatabaseContext>, false>, metacluster::metadata::MovementRecord, long) at /mnt/ephemeral/abeamon/frostdb/fdbserver/TenantBalancer.actor.cpp:358
...
```
### Gotcha's
* `BinaryReader` returned objects share the `Arena` created insider the `BinaryReader` object, which means the lifetime of the returned objects can't exceed the `BinaryReader` object. Be careful about memory corruption issues if the `BinaryReader` object goes out of scope.
* ASAN/TSAN/UBSAN tests may succeed on one compiler but fail on another.
* Set [_GLIBCXX_DEBUG](https://gcc.gnu.org/onlinedocs/libstdc++/manual/debug_mode_using.html) to debug STL related issue.
## Sample Debug Sessions
### Sample 1: Backup failure (data corruption) in VV
* Branch: `version-vector-prototype`
* Commit: `44fe89333`
* Seed: `-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/slow/ApiCorrectnessAtomicRestore.toml -s 600630827 -b off`
The test complains data corruption:
```
DB Range:
7: 0000000000lfp 713
8: 0000000000lfvnnyyhhbpwdgoackhmfmytftucedqhgaccfeczajfuvpkaqihjuwuieifpnwyitsstypdoiixxoilcxyzyrzjsofkelnnyjhavlnoxjewbmdwwlhvq 980
Memory Range:
7: 0000000000lfn 164
8: 0000000000lfp 713
```
So apparently key `0000000000lfn` is missing from DB.
Enabled mutation tracking for this key and found 5 clears at versions: 303674707 304036118 304488242 304663277 307827805
1 set at version: 314985785.
All these are within the backup files' range, but using `fdbdecode` didn't find the last set mutation.
Then I confirmed proxy added backup mutations with key `\xff\x02/blog/S{\xf3\xa6a5v\xa3VB\xd8Wn\xf8f&\x80\x00\x00\x00\x00\x12\xc6M9\x00\x00\x00\x00` to `\xff\x02/blog/S{\xf3\xa6a5v\xa3VB\xd8Wn\xf8f&\x80\x00\x00\x00\x00\x12\xc6M9\x00\x00\x00\x04`.
I also confirmed these backup mutations are persisted at storage servers with customized mutation tracking of the above keys.
Backup mutation is at:
```
125.538085 BackupM ID=0000000000000000 Key=\xff\x02/blog/S{\xf3\xa6a5v\xa3VB\xd8Wn\xf8f&\x80\x00\x00\x00\x00\x12\xc6M9\x00\x00\x00\x00
125.538085 BackupM ID=0000000000000000 Key=\xff\x02/blog/S{\xf3\xa6a5v\xa3VB\xd8Wn\xf8f&\x80\x00\x00\x00\x00\x12\xc6M9\x00\x00\x00\x01
125.538085 BackupM ID=0000000000000000 Key=\xff\x02/blog/S{\xf3\xa6a5v\xa3VB\xd8Wn\xf8f&\x80\x00\x00\x00\x00\x12\xc6M9\x00\x00\x00\x02
125.538085 BackupM ID=0000000000000000 Key=\xff\x02/blog/S{\xf3\xa6a5v\xa3VB\xd8Wn\xf8f&\x80\x00\x00\x00\x00\x12\xc6M9\x00\x00\x00\x03
125.538085 BackupM ID=0000000000000000 Key=\xff\x02/blog/S{\xf3\xa6a5v\xa3VB\xd8Wn\xf8f&\x80\x00\x00\x00\x00\x12\xc6M9\x00\x00\x00\x04
```
These mutations appear in `TLogCommitMessages` that have tags `0:0 0:9`
Then at `MutationTracking ID=a2b4b46f9d9a406c Label=BackupM At=fetchKeysFinalCommitInject Version=315120917 Tag=0:9`
SS applied the mutation:
```
Time="125.542238" DateTime="2021-10-22T03:50:50Z" Type="MutationTracking" Machine="2.0.1.3:1" ID="a4b37511d365cb0e" Label="BackupM" At="applyMutation" Version="314985785" Mutation="code: SetValue param1: \xff\x02/blog/S{\xf3\xa6a5v\xa3VB\xd8Wn\xf8f&\x80\x00\x00\x00\x00\x12\xc6M9\x00\x00\x00\x00 param2: \x01\x00\x01b\xb0
\x00..." ShardBegin="\xff\x02/blog/S{\xf3\xa6a5v\xa3VB\xd8Wn\xf8f&]\x00\x00\x00\x00\x12J\xf8H\x00\x00\x00\x0f" ShardEnd="\xff\x02/blog/S{\xf3\xa6a5v\xa3VB\xd8Wn\xf8f&\x8d\x00\x00\x00\x00\x12b|v\x00\x00\x00\x16"
ID=a4b37511d365cb0e Label=BackupM At=applyMutation Version=314985785
```
**Backup process**
`BackupLogRangeTaskFunc::_execute`, which divides the range into smaller ones, each calls `readCommitted()`.
Read range:
```
\xff\x02/blog/S{\xf3\xa6a5v\xa3VB\xd8Wn\xf8f&\x80\x00\x00\x00\x00\x12\xb7\x42\x80
\xff\x02/blog/S{\xf3\xa6a5v\xa3VB\xd8Wn\xf8f&\x80\x00\x00\x00\x00\x12\xc6\x84\xc0
```
Version 314000000 to 315000000
Key is missing from range read above:
`\xff\x02/blog/S{\xf3\xa6a5v\xa3VB\xd8Wn\xf8f&\x80\x00\x00\x00\x00\x12\xc6\x4d\x39\x00\x00\x00\x00`
The last key read is smaller than the missing key:
```
Time="129.662813" DateTime="2021-10-22T03:50:52Z" Type="TransactionDebug" ID="0000000000000001" Location="storageserver.getKeyValues.AfterReadRange"
...
129.664422 RCD ID=0000000000000000 Key=\xff\x02/blog/S{\xf3\xa6a5v\xa3VB\xd8Wn\xf8f&\x80\x00\x00\x00\x00\x12\xbfp\x8e\x00\x00\x00\x1b Idx=5
Time="129.664422" DateTime="2021-10-22T03:50:52Z" Type="TransactionDebug" ID="0000000000000001" Location="NativeAPI.getRange.After"
```
SS reads at `314536078 < 314985785 -- the applied version < 318803044`
```
ReqSS="[{0:0,314536078} maxversion: 314536078] calculates: CommitVersion="314536078"
Time="129.483760" DateTime="2021-10-22T05:34:56Z" Type="GetRangeInfo" Machine="2.0.1.3:1" ID="a4b37511d365cb0e" Begin="\xff\x02/blog/S{\xf3\xa6a5v\xa3VB\xd8Wn\xf8f&\x80\x00\x00\x00\x00\x12\xb7B\x80" End="\xff\x02/blog/S{\xf3\xa6a5v\xa3VB\xd8Wn\xf8f&\x80\x00\x00\x00\x00\x12\xba\xd6" Version="314536078" ReqVersion="318803044" More="1" SizeOf="9"
```
The problem is apparent, the range read is not returning the correct data. After adding some trace events,
the problem was identified at `getKeyValuesQ()` at SS, specifically these two lines:
```
Version commitVersion = getLatestCommitVersion(req.ssLatestCommitVersions, data->tag);
state Version version = wait(waitForVersion(data, commitVersion, req.version, span.context));
```
In the test, `req.ssLatestCommitVersions="[{0:0,314536078} maxversion: 314536078]` CommitVersion="314536078", and `req.version=318803044`
So the SS reads at `314536078 < 314985785 ( the applied mutation version at the SS) < 318803044 (read version)`. As a result, the mutation is not sent back to backup agent, causing mutations being missing from backup files.
The reason of the wrong VV is because proxy doesn't count backup mutations. See the fix is in [PR 5830](https://github.com/apple/foundationdb/pull/5830).
### Sample 2: VV data corruption with `CausalConsistencyError`
* commit: [b4ed47053](https://github.com/apple/foundationdb/commit/b4ed47053bc2f55ccd14c3afb2265324f6857724)
* patch: `init( PROXY_USE_RESOLVER_PRIVATE_MUTATIONS, true );`
* Seed: `-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/SwizzledRollbackSideband.toml -s 243572583 -b on`
* Compiler: `Clang`
Test failed with:
`Type="CausalConsistencyError" Machine="3.4.3.8:1" ID="10cb551589353523" MessageKey="Sideband/Message/c9037706b28b3e04" RemoteCommitVersion="841905696" LocalReadVersion="946662688"`
Enable mutation tracking on key `Sideband/Message/c9037706b28b3e04`, and saw last writes this key at version 841905696:
```
315.681248 MutationTracking ID=2c1851ccbb311b41 Label=SomeKey At=applyMutation Version=841905696 Mutation=code: SetValue param1: Sideband/Message/c9037706b28b3e04 param2: deadbeef ShardBegin= ShardEnd=Sideband/Message/e
315.681248 MutationTracking ID=0f20d996da4c284b Label=SomeKey At=ServerPeekCursor Version=841905696 Mutation=code: SetValue param1: Sideband/Message/c9037706b28b3e04 param2: deadbeef MessageTags=-2:3 0:1
```
However, the storage server saw VV using version 841885664:
```
325.140962 MutationTracking ID=2c1851ccbb311b41 Label=SomeKey At=ShardGetValue Version=841885664 Mutation=code: DebugKey param1: Sideband/Message/c9037706b28b3e04 param2: <null>
325.140962 MutationTracking ID=2c1851ccbb311b41 Label=SomeKey At=ShardGetPath Version=841885664 Mutation=code: DebugKey param1: Sideband/Message/c9037706b28b3e04 param2: 2
325.140962 MutationTracking ID=2c1851ccbb311b41 ReqVersion=946662688 VV=[{0:1,841885664} maxversion: 841885664] CommitVersion=841885664
```
Thus, VV is not update to date. Further investigation showed there is a recovery between these events:
```
309.759683 2.0.1.1:1 ID=3c39387664f7a73b 12 all_logs_recruited ThreadID=8063736322934559140 Roles=CP,DD,GP,LR,MS,RK,SS,TL
319.510407 2.0.1.1:1 ID=da60ac7999cc737f 0 reading_coordinated_state ThreadID=8063736322934559140 Roles=BK,CP,DD,GP,LR,MS,RK,SS,TL
319.517734 2.0.1.1:1 ID=da60ac7999cc737f 1 locking_coordinated_state TLogs=1 ActiveGenerations=2 MyRecoveryCount=16 ForceRecovery=0 ThreadID=8063736322934559140 Roles=BK,CP,DD,GP,LR,MS,RK,SS,TL
320.166834 2.0.1.1:1 ID=da60ac7999cc737f 3 reading_transaction_system_state ThreadID=8063736322934559140 Roles=BK,CP,DD,LR,MS,RK,SS,TL
320.171453 2.0.1.1:1 ID=da60ac7999cc737f 7 recruiting_transaction_servers Conf={"backup_worker_enabled":1,"commit_proxies":5,"grv_proxies":1,"log_routers":6,"log_spill":2,"log_version":6,"logs":6,"perpetual_storage_wiggle":0,"perpetual_storage_wiggle_locality":"data_hall:0","proxies":6,"redundancy_mode":"single","regions":[{"datacenters":[{"id":"0","priority":2},{"id":"2","priority":1,"satellite":1},{"id":"4","priority":1,"satellite":1}]},{"datacenters":[{"id":"1","priority":1},{"id":"3","priority":1,"satellite":1},{"id":"5","priority":1,"satellite":1}]}],"remote_redu... RequiredCommitProxies=1 RequiredGrvProxies=1 RequiredResolvers=1 ThreadID=8063736322934559140 Roles=BK,CP,DD,LR,MS,RK,SS,TL
320.176199 2.0.1.1:1 ID=da60ac7999cc737f 8 initializing_transaction_servers CommitProxies=2 GrvProxies=1 TLogs=2 Resolvers=1 SatelliteTLogs=0 OldLogRouters=1 StorageServers=0 BackupWorkers=3 PrimaryDcIds=0 RemoteDcIds=1 ThreadID=8063736322934559140 Roles=DD,MS,RK,SS,TL
320.198125 2.0.1.1:1 ID=da60ac7999cc737f 9 recovery_transaction PrimaryLocality=-1 DcId=0 ThreadID=8063736322934559140 Roles=CP,DD,GP,LR,MS,RK,SS,TL
320.225664 2.0.1.1:1 ID=da60ac7999cc737f 10 writing_coordinated_state TLogList=0: 43fd2181d0ada4f366e1e610ea0e2194, 036201ec527e05078cf14be39319f12b ThreadID=8063736322934559140 Roles=CP,DD,GP,LR,MS,RK,SS,TL
324.128473 2.0.1.1:1 ID=da60ac7999cc737f 11 accepting_commits StoreType=memory-radixtree-beta RecoveryDuration=4.61807 ThreadID=8063736322934559140 Roles=CP,DD,GP,LR,MS,RK,SS,TL
324.128473 2.0.1.1:1 ID=da60ac7999cc737f 12 all_logs_recruited ThreadID=8063736322934559140 Roles=CP,DD,GP,LR,MS,RK,SS,TL
333.554110 2.0.1.1:1 ID=da60ac7999cc737f 14 fully_recovered FullyRecoveredAtVersion=955256309 ThreadID=8063736322934559140 Roles=BK,CP,DD,GP,LR,MS,RK,SS,TL
```
Added more instrumentation and what happened seems to be: A client has a VV cache of `[{0:1,841885664} maxversion: 841885664]`; client issues a transaction commits at `841905696`. There is a recovery. The client got `Delta=[ maxversion: -1]` back. Client issues a read at `946662688`, which uses the cached version `841885664 < 841905696`, thus reading corrupted data. The client side code didn't know there is a recovery, thus kept using the cache from older master, which caused this problem. The fix is to [reset client side cache once a new master is used](https://github.com/apple/foundationdb/pull/5942/commits/6540bef9ee5090f24c418da739719f188f02bf60).
### Sample 3: VV failure `DeliveredToNotAssigned` assertion failure at SS
* commit: [a8ba9bd02](https://github.com/apple/foundationdb/commit/a8ba9bd021ba8a05c35221c7b9b5581eb52dce72)
* patch: `init( PROXY_USE_RESOLVER_PRIVATE_MUTATIONS, true );`
* seed: `-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/slow/ApiCorrectness.toml -s 809470518 -b on`
* Failure: `DeliveredToNotAssigned ID=0000000000000000 Version=184396025 Mutation=code: SetValue param1: \xff/moveKeysLock/Write param2: \x15\x1d\xc0%\x06\xb2\xa3)\x96[{*\x8b\xd5\x17\x1c`
* DB configuration is single: `new backup_worker_enabled:=1 log_routers:=4 log_spill:=2 log_version:=6 perpetual_storage_wiggle:=0 single remote_triple memory-2 storage_migration_type=disabled tss_count:=1 usable_regions:=2 tss_storage_engine:=1 regions=[{"datacenters":[{"id":"0","priority":2},{"id":"2","priority":1,"satellite":1,"satellite_logs":5}],"satellite_redundancy_mode":"one_satellite_double"},{"datacenters":[{"id":"1","priority":1},{"id":"3","priority":1,"satellite":1}],"satellite_redundancy_mode":"one_satellite_double"}]`
Failed SS ID=eccf72642c631fe9 and Tag="0:0".
The failure `DeliveredToNotAssigned` means the SS received a mutation that it is not responsible for. So we need to look at shard ranges on the SS and see how they are moved. The shard movement is through MoveKey operations.
Since the failed key is `\xff/moveKeysLock/Write`, naturally we'll enable mutation tracking for this key.
By going through log with more instrumentation on source and destination of MoveKey operations, we can
find that there is a shard movement before:
`Time="117.703077" DateTime="2021-11-10T18:34:37Z" Type="RelocateShard_FinishMoveKeysWaitDurable" Machine="2.0.1.0:1" ID="9da64b1d90b87835" EndPair="3aac5b0622bfae4e" ReadyServers="2" ReadyTSS="1" SRCs="36486523b7b50a46,b0ecc74478ea7330,c201743d86431db5,eccf72642c631fe9" DESTs="36486523b7b50a46,c201743d86431db5"`
This is the final transaction, i.e., `FinishMoveKey`. At Resolver, `0:0` no longer responsible for range `[0000000002zu, \xff\xff)`
and the range is moved to `0:1`.
```
117.720628 SendingPrivatized_Span ID=8fe8fb3abf0743cd Span=0000000000000000
117.720628 SendingPrivateMutation ID=8fe8fb3abf0743cd Original=code: SetValue param1: \xff/serverKeys/\xe9\x1fc,dr\xcf\xec?3\\8\xab\x82\xe6\x13/0000000002zu param2: Privatized=code: SetValue param1: \xff\xff/serverKeys/\xe9\x1fc,dr\xcf\xec?3\\8\xab\x82\xe6\x13/0000000002zu param2: Server=eccf72642c631fe9 TagKey=\xff/serverTag/\xe9\x1fc,dr\xcf\xec?3\\8\xab\x82\xe6\x13 Tag=0:0
117.720628 SendingPrivatized_Span ID=8fe8fb3abf0743cd Span=0000000000000000
117.720628 SendingPrivateMutation ID=8fe8fb3abf0743cd Original=code: SetValue param1: \xff/serverKeys/\xe9\x1fc,dr\xcf\xec?3\\8\xab\x82\xe6\x13/\xff\xff param2: Privatized=code: SetValue param1: \xff\xff/serverKeys/\xe9\x1fc,dr\xcf\xec?3\\8\xab\x82\xe6\x13/\xff\xff param2: Server=eccf72642c631fe9 TagKey=\xff/serverTag/\xe9\x1fc,dr\xcf\xec?3\\8\xab\x82\xe6\x13 Tag=0:0
```
More instrumentation confirmed the transaction committed: `117.921585 MKIRDone ID=c5bcfd554ae74b5a ReadVersion=181974393 CommitVersion=182968702 Range=Begin:0000000002zuEnd:\\xff\\xff` and SS applied this mutation:
`119.844811 SSPrivateMutation ID=eccf72642c631fe9 Mutation=code: SetValue param1: \xff\xff/serverKeys/\xe9\x1fc,dr\xcf\xec?3\\8\xab\x82\xe6\x13/0000000002zu param2: Version=182968702`.
Shard has been moved to DESTs `0:1,1:1`. And next MoveKey lock write is sent to SS `0:1`, so CP tags the mutation correctly:
`"117.722621" DateTime="2021-11-10T18:34:37Z" Type="MutationTracking" Machine="2.0.1.2:1" ID="84ae49626c5ed171" Label="MoveKey" At="ProxyCommit" Version="182968702" Mutation="code: SetValue param1: \xff/moveKeysLock/Write param2: \\\xf0\x05t\xd23\x0c\xbb\xb3\xfe'\xd3\x97\xad\x15g" To="0:1,1:1"`
Another move key starts for range `[0000000001z, 0000000002zu)`:
```
119.089702 RelocateShardHasDestination ID=22e7af88890c7882 PairId=3e16733cbbb3e507 KeyBegin=0000000001z KeyEnd=0000000002zu SourceServers=b0ecc74478ea7330,eccf72642c631fe9 DestinationTeam=36486523b7b50a46,c201743d86431db5 ExtraIds=[no items]
119.135619 MKIRStartMK ID=3e16733cbbb3e507 ReadVersion=182968702 VV=[{0:0,182454221}{1:1,182968702}{1:0,182454221}{0:1,182968702}{1:2,182368775} maxversion: 182968702] Range=Begin:0000000001zEnd:0000000002zu EndKey=0000000002zu
```
Look closely, we found that the VV has `{0:0,182454221}`, which has wrong version, `182454221 != 182968702`,
because the previous move has written to `0:0` at `182968702`.
Now we have a breakthrough! The problem is the version vector: we instrumented master at version `182968702`
and confirmed that tag `0:0` is not in the written tags sent from the commit proxy. So this is the problem.
By examining code, we noticed the private mutation generated on the Resolver, but their tags are not sent back
to commit proxies, thus their tags are not updated. The fix is in [PR 5960](https://github.com/apple/foundationdb/pull/5960).