Flow Best Practices

# Flow Best Practices This document covers some of the best practices when programming in flow. It is *not* meant to serve as a flow tutorial or documentation. The reader is expected to have a basic understanding of flow. ## Flow Header Files :::success In a `.actor.h` file an actor can be declared like this: ```cpp ACTOR Future<Void> someActor(SomeStruct* self, std::string someCopiedType); ``` ::: :::danger Don't forward declare actors in a plain `.h`-file like this: ```cpp! Future<Void> someActor(SomeStruct* const& self, std::string const& someCopiedType); ``` ::: Don't be afraid writing flow code (including forward declaration) into header files. These have to have the suffix `.actor.h`. The main confusion people have when writing these, is how the include guard should look like. Writing an include-guard is a bit tricky, but you can use a simple template: ```cpp= #if defined(NO_INTELLISENSE) && !defined(FILE_NAME_ACTOR_G_H) #define FILE_NAME_ACTOR_G_H #include "fdbmodule/FileName.actor.g.h" #elif !defined(FILE_NAME_ACTOR_H) #define FILE_NAME_ACTOR_H // all your includes should come here #include "flow/actorcompiler.h" // this has to be the last include // .. all your code should be here #include "flow/unactorcompiler.h" // IMPORTANT!! #endif ``` You can use this template pretty easily and it's not necessary to really understand how this works (though that usually helps). When we compile outside of an IDE we pass `-DNO_INTELLISENSE` to the compiler. So in the IDE (when using `clangd` or compiling with `-DOPEN_FOR_IDE`) the first `#if`-block will be ignored. In that case we just can think of this as the normal `#ifndef`-`#define`-`#endif` pattern that is usually used in C/C++. If `-DNO_INTELLISENSE` is passed, the code is a bit more interesting. For that case, the code-flow is a bit more complicated: 1. When the file is preprocessed, it will go into the first `#if`-block where it first defines `FILE_NAME_ACTOR_G_H` and then includes the *generated* header file (the `.actor.g.h`-file). 2. In that file, the exact some code block will be evaluated again, but this time `FILE_NAME_ACTOR_G_H` will be set, so the first `#if` condition will be `false`. 3. The `#elif` block will be true. So now `FILE_NAME_ACTOR_H` will be defined and all generated code is included. 4. If the same `actor.h` file is included a second time, the `#if` and the `#elif` conditions will both evaluate to `false` and therefore no additional code is included (so this acts as a proper include guard). ### `flow/actorcompiler.h` and `flow/unactorcompiler.h` :::success For each `actor.cpp` file include `flow/actorcompiler.h` as your last include. For each `actor.h` file include `flow/actorcompiler.h` as your last include and include `flow/unactorcompiler.h` at the end of the file ::: These two files contain macros to make flow-expressions look like real C++ to a C++ compiler. So they are not strictly needed, but are used by people to make IDEs work. For any actor-file, `flow/actorcompiler.h` has to be included last. Especially system headers and/or thrid-party headers will not work properly otherwise in IDE mode. The biggest issue usually is that `flow/actorcompiler.h` defines a macro with the name `state` that evaluates to nothing (so if any included file uses `state` as an identifier, the compiler will fail if we use IDE mode). In header-files `flow/unactorcompiler.h` has to be included at the end of the header. This is to solve the same problem that adding `actorcompiler.h` as the last include solves. `unactorcompiler.h` simply undefined all macros so that headers that are included later will work properly. ## Be mindful of C++ Scoping Rules :::success Unless you have a **very** good reason not to, always declare all `state` variables at the top of the function. Use `store` when the result of a `wait` needs to be assigned to a state variable. ::: :::danger Don't redeclare variables in the same scope. ::: Flow breaks C++ scoping rules. Mostly `state` variables behave weird because of the wait they work. The actor compiler translates an `ACTOR` to a class with a number of methods. This means that a `state`-variable is always declared in the scope of the actor and NOT within any inner scope no matter where in the code the state variable is defined. This can lead to hard to follow code. ### Examples :::danger This is legal, but bad, flow code, don't do this: ```cpp ACTOR Future<Void> foo() { state bool bar = false; loop { if (bar) { doSomething(result); // This will be confusing } state Standalone<RangeResult> result = wait(getSomeRange()); if (...) { bar = true; } } } ``` ::: :::success Instead, write code like this: ```cpp ACTOR Future<Void> foo() { state bool bar = false; state Standalone<RangeResult> result; loop { if (bar) { doSomething(result); // this makes sense now } // store assigns the result of a future to its // first argument wait(store(result, getSomeRange())); if (...) { bar = true; } } } ``` `store` creates an actor, which is not free. In most cases this won't matter. However, if you change code in a hot path, you might want to avoid this. In that case, this is going to avoid this extra actor (at the cost of being slightly uglier): ``` RangeResult _result = wait(getSomeRange()); result = std::move(_result); ``` ::: :::danger Don't do this: ```cpp bool res = wait(someTest()); ASSERT(res); bool res = wait(someOtherTest()); ASSERT(res); ``` ::: :::success Use scoping: ```cpp { bool res = wait(someTest()); ASSERT(res); } { bool res = wait(someOtherTest()); ASSERT(res); } ``` A state variable (uses more memory on the heap, but in most cases that is a non-issue): ```cpp state bool testResult; wait(store(testResult, someTest())); ASSERT(testResult); wait(store(testResult, someOtherTest())); ASSERT(testResult); ``` Or distinct variable names: ```cpp= bool res1 = wait(someTest()); ASSERT(res1); bool res2 = wait(someOtherTest()); ASSERT(res2); ``` ::: ## Use descriptive variable names for `state` variables :::success State variables should have names that are descriptive and likely to be unique. ::: Due to the way the flow compiler works, `state` variables will be in scope for the entire `ACTOR` in which they are declared. In general variable names should get longer and more descriptive as their scope gets larger, and based on that principle `state` variables should have correspondingly descriptive names. Additionally, accidentally shadowing a `state` variable can mask subtle bugs where a non-`state` variable goes out of scope (e.g. due to a `wait` statement) but the code continues to work due to the presence of the `state` variable. This can be especially tricky in Flow due to the differences in scoping rules within an `ACTOR`. :::danger Don't do this: ```cpp state int i; for (i = 0; i < items.size(); ++i) { wait(doSomething(items[i])); } ... for (int i = 0; i < otherItems.size(); ++i) { // This will not use the 'i' that is expected wait(doSomething(otherItems[i])); } ``` ::: :::success Use a descriptive name for the state variable: ```cpp state int itemsIndex; for (itemsIndex = 0; i < items.size(); ++itemsIndex) { wait(doSomething(items[itemsIndex])); } ``` ::: ### Special cases There are a few special cases in the FoundationDB codebase where we use short `state` variable names. In particular, these include `tr` for transactions and `db` for databases. It is acceptable to use these short names similarly in new code, but consider using more descriptive names if an `ACTOR` is running multiple transactions or working with multiple databases. ## Use `resetReply` when reusing a request object In other words, a `ReplyPromise` needs to be reset; otherwise, old response already received will be replayed in old FDB code and the request is *not* sent out for the second time. Note how `resetReply(request)` is invoked in the example below in the `catch` block to avoid the problem described. ```cpp Future<REPLY_TYPE(Req)> retryBrokenPromise(RequestStream<Req, P> to, Req request) { // Like to.getReply(request), except that a broken_promise exception results in retrying request immediately. // Suitable for use with well known endpoints, which are likely to return to existence after the other process // restarts. Not normally useful for ordinary endpoints, which conventionally are permanently destroyed after // replying with broken_promise. loop { try { REPLY_TYPE(Req) reply = wait(to.getReply(request)); return reply; } catch (Error& e) { if (e.code() != error_code_broken_promise) throw; resetReply(request); wait(delayJittered(FLOW_KNOBS->PREVENT_FAST_SPIN_DELAY)); CODE_PROBE(true, "retryBrokenPromise"); } } } ``` Fortunately, this problem will trigger an assertion failure in `getReply()` now, which was added to prevent it: ```cpp template <class X> Future<REPLY_TYPE(X)> getReply(const X& value) const { ASSERT(!getReplyPromise(value).getFuture().isReady()); ... ``` ## Arena Memory Management The memory management uses `Arena` abstraction a lot, which owns some memory blocks, typically for `MutationRef`, `VectorRef` these `*Ref` types. For example, a `StringRef` object is usually a pointer and a length to one of the blocks `Arena` owns. Thus, it's very important that the `Arena` is live while accessing these `*Ref` objects. `Arena` is internal reference counted objects. Thus, if you declare `Arena b = a;`, then both `a` and `b` share the same memory blocks that `a` owns before. Even if `a` goes out of scope, these memory blocks are still valid. Alternatively, you can have `b.dependsOn(a)`, so that `b` also holds references to the memory blocks `a` owns before. A caveat is that memory allocated by `a` afterward may *NOT* be reference counted by `b`. This is because the new memory block is only referenced by `a`, not `b`. If you are curious about the implementation, here is the explanation: ``` // For an arena A, one would expect using B.dependsOn(A) would extent A's lifecycle to B's lifecycle. // However this is not quite true. Internally arenas are implemented as linked lists, or // // A.impl = ArenaBlock -> ArenaBlock -> ... // // where each ArenaBlock is a piece of allocated memory managed by reference counting. Applying // B.dependsOn(A) would cause B refers to the first ArenaBlock in A, or // // B----| // V // A.impl = ArenaBlock -> ArenaBlock -> ... // // so when A destructs, there is still at least one extra reference to the first ArenaBlock, // preventing the release of the chained blocks. // However, when additional memory is requested to A, A will create additional ArenaBlocks (see // Arena.cpp:ArenaBlock::create), and the new ArenaBlock might be inserted *prior* to the // existing ArenaBlock, i.e. // B----| // V // A.impl = ArenaBlock -> ArenaBlock -> ArenaBlock ... // ^- new created // and when A is destructed, the unreferrenced (hereby the first one) will be destructed. // This causes B depends on only part of A. The only way to ensure all blocks in A have // the same life cycle to B is to let B.dependsOn(A) get called when A is stable, or no // extra ArenaBlocks will be created. ``` ## Serialization with FlatBuffers FDB implements its own [FlatBuffers](https://google.github.io/flatbuffers/), originally invented by Google. FlatBuffers allow accessing serialized data without parsing/unpacking. ``` struct GetKeyValuesReply : public LoadBalancedReply { constexpr static FileIdentifier file_identifier = 1783066; Arena arena; VectorRef<KeyValueRef, VecSerStrategy::String> data; Version version; // useful when latestVersion was requested bool more; bool cached = false; GetKeyValuesReply() : version(invalidVersion), more(false), cached(false) {} template <class Ar> void serialize(Ar& ar) { serializer(ar, LoadBalancedReply::penalty, LoadBalancedReply::error, data, version, more, cached, arena); } }; ``` The above is an example of using FlatBuffers. The function `serialize()`, surprisingly, does both serialization and deserialization! This is because: ``` template <class Archive, class Item, class... Items> typename Archive::WRITER& serializer(Archive& ar, const Item& item, const Items&... items) { save(ar, item); if constexpr (sizeof...(Items) > 0) { serializer(ar, items...); } return ar; } template <class Archive, class Item, class... Items> typename Archive::READER& serializer(Archive& ar, Item& item, Items&... items) { load(ar, item); if constexpr (sizeof...(Items) > 0) { serializer(ar, items...); } return ar; } ``` The "Reader" (e.g., `BinaryReader`, `ObjectReader`, `ArenaReader`) calls `load(ar, item)` to deserialize all items, while the "Writer" (e.g., `BinaryWriter`, `ObjectWriter`) calls `save(ar, item)` to serialize all data items. Sometimes, we want the serialization and deserialization to behave differently according to the value of data items. For instance, a single key clear `MutationRef` can be compressed by sending only the key once, not twice (`key` and `keyAfter(key)`). To achieve this, the `serialize()` function below checks if we are serializing `ar.isSerializing` or deserializing `ar.isDeserializing`: ``` struct MutationRef { uint8_t type; StringRef param1, param2; template <class Ar> void serialize(Ar& ar) { if (ar.isSerializing && type == ClearRange && equalsKeyAfter(param1, param2)) { StringRef empty; serializer(ar, type, param2, empty); } else { serializer(ar, type, param1, param2); } if (ar.isDeserializing && type == ClearRange && param2 == StringRef() && param1 != StringRef()) { ASSERT(param1[param1.size() - 1] == '\x00'); param2 = param1; param1 = param2.substr(0, param2.size() - 1); } } ``` ### Append new fields in `serializer()` FlatBuffer has a requirement that new fields should be added to the end of the list. During the 7.1.4 release process, we encountered an interesting bug: suddenly the Python or Java clients can no longer issues API calls to the 7.1.2 cluster. On the other hand, the same client that uses the 7.1.4 `libfdb_c.so` can work perfectly fine with the 7.1.4 cluster. After several hours, a group of engineers noticed there were incompatible changes: ``` struct OpenDatabaseCoordRequest { ... void serialize(Ar& ar) { - serializer(ar, issues, supportedVersions, traceLogGroup, knownClientInfoID, clusterKey, coordinators, reply); + serializer(ar, + issues, + supportedVersions, + traceLogGroup, + knownClientInfoID, + clusterKey, + hostnames, + coordinators, + reply); } ``` Note `hostnames` is inserted into the middle of the list of items. This change breaks compatibility between the client and the server, because the 7.1.2 server would interpret `hostnames` as `coordinators`----flat buffer use the relative position of items for serialization. The [fix](https://github.com/apple/foundationdb/pull/7094) is to move `hostnames` to the end of the list. A side note is that within a major version, such as 6.3 or 7.1, FDB binaries should be compatible. More specifically using FDB's term, they should be protocol compatible. In this way, a client uses 6.3.9 C library can connect to 6.3.24 cluster without any issues. And the server cluster can upgrade to new minor release versions, without the upgrade requirement for the client C library. ## BinaryReader has an `Arena` A surprising fact is that `BinaryReader` has an internal `Arena` object, which is used in `BinaryReader::arenaRead()`: ``` const uint8_t* arenaRead(int bytes) { // Reads and returns the next bytes. // The returned pointer has the lifetime of this.arena() // Could be implemented zero-copy if [begin,end) was in this.arena() already; for now is a copy if (!bytes) return nullptr; uint8_t* dat = new (arena()) uint8_t[bytes]; serializeBytes(dat, bytes); return dat; } ``` Note `arenaRead()` allocates memory `new (arena()) uint8_t[bytes]`, and then copies bytes into the newly allocated memory. This is important to know, because for `StringRef`, its serialization (defined in [Arena.h](https://github.com/apple/foundationdb/blob/main/Arena.h) ) is: ``` template <class Archive> inline void load(Archive& ar, StringRef& value) { uint32_t length; ar >> length; value = StringRef(ar.arenaRead(length), length); } ``` :::danger So if a `BinaryReader` is used to deserialize a `StringRef`, then the returned value's life cycle is the same as the `BinaryReader` object. As a result, if the `BinaryReader` object goes out of the scope, the `MutationRef` returned by the `BinaryReader` object will points to invalid memory: ``` StringRef serialized = some_valid_string_ref; MutationRef m; { BinaryReader reader(serialized, IncludeVersion()); reader >> m; } // m.param1 now points to invalid memory ``` ::: ## `AsyncVar::set()` may not "set" the value The `AsyncVar::set()` compares the value with its internal state and only set if they are different: ``` void set(V const& v) { if (v != value) setUnconditional(v); } ``` This usually works. However, if the `!=` operator for the value `v` is overloaded, there could be unintended consequences. For example, the `ClientDBInfo` has the overloaded operator as: ``` bool operator!=(ClientDBInfo const& r) const { return id != r.id; } ``` And in `extractClientInfo()`, we had code like: ``` ClientDBInfo ni = db->get().client; shrinkProxyList(ni, lastCommitProxyUIDs, lastCommitProxies, lastGrvProxyUIDs, lastGrvProxies); info->set(ni); ``` This looks correct, but is wrong! The reason is that `shrinkProxyList()` modifies the `ni` object, but does not change the `id` field. As a result, even if `ni` has changed, `info->set(ni)` has no effect! This bug has caused infinite retrying GRVs at the client side. Specifically, the version vector feature introduces a change that the client side compares the returned proxy ID with its known set of GRV proxies and will retry GRV if the returned proxy ID is not in the set. Due to the above bug, GRV returned by a proxy is not within the client set, because the change was not applied to the "info". The fix is in [PR #6877](https://github.com/apple/foundationdb/pull/6877). ## `Promise::send()` synchronously invokes callbacks ## Uncancellable actors Some actors cannot be cancelled, e.g., an actor that issues a system call. A serious TLog corruption bug falls into this category: https://github.com/apple/foundationdb/pull/8527. The problem is that original `read(RawDiskQueue_TwoFiles* self, int file, int pageOffset, int nPages)` actor can be cancelled, however, the `read(mutateString(result), bytesRequested, pageOffset * sizeof(Page))` system call cannot be cancelled. As a result, the memory of result is freed and can be reused by other code after the actor cancellation, Then the system call will overwrite the memory with data from disk, corruption that memory location. This bug has been around for many years! And another audit of the code base found many other instances https://github.com/apple/foundationdb/pull/8526.