In CouchDB `mem3_sync` is the mechanism that keeps shards of the same range synchronised across multiple cluster nodes. It does so by implementing the CouchDB replication protocol over Erlang distributed messages. In slightly more detail this means it reads the by-seq index of a shard and copies out all live content into the new shard. During cluster management operations like increasing the number of nodes it can become necessary to move an existing shard from one node (A) to another (B) where there is no such shard on node B already, because it just freshly joined the cluster and no database shard map points to it. The easiest way to move a shard to node B is to edit the shard map for a respective database and change the entries for a particular shard range or ranges to point to node B instead of node A. `mem3` will recognise the change in the shard map and notice the missing shard on node B and then will initiate `mem3_sync` to fill up the shard to match the copies on the other cluster nodes. Unfortunately, implementing CouchDB replication means that data transfer does not happen at network speed because following the protocol correctly means multiple request/response cycles during each batch of operation. However, when replicating into an empty shard, none of these extra roundtrips are strictly necessary. They are necessary later when a shard might be slightly behind other nodes so that the shard can be caught up correctly, but for the initial sync, `mem3_sync` does more work than strictly needed. A common recommendation for speeding up this process, especially for large databases, is to first `scp` or `rsync` the `.couch` shard file to the new host and then let `mem3_sync` top it up. It would be a lot nicer if CouchDB chose a faster binary-copy on its own when it knows it is replicating into an empty shard file. ## Proposal We propose the following algorithm to speed up the process: - when `mem3_sync` is invoked: check if the target shard exists on node B. - if yes: use the remaining functionality of `mem3_sync` nothing changes. - if no: - create a [hard link](https://en.wikipedia.org/wiki/Hard_link) of the source shard: - `dbname.timestamp.couch.hardlink.file-length`, e.g. `db.1234567890.couch.hardlink.7865432` - ensure that compaction cannot occur between recording the file length and calling `ln`. it should be in the `couch_file` `gen_server` loop therefore. - if the hardlink operation fails, fall back to normal `mem3_sync` functionality, log a warning about using an odd file system - create a target shard to copy data into with the name of the shard file plus a `.initial` suffix. - if the suffixed file already exists on node B, create a sha256 checksum of it and the hard link up to the file length recorded in the hard link name. If the checksum matches, continue copying bytes from `length(target)` to `length(source)` (as recorded at the beginning of this), do *not* read until `EOF`. - if not: start copying bytes from position 0 in the source file on node A until the length reached that is specified in the hard link name - when done - remove hard link on node A - optional: do another sha256 of the final file and compare against source - continue with regular `mem3_sync` functionality to top up the shard file - then rename the file to remove the `.initial` suffix TODO: - decide when/how to clean up stale `.hardlink` and `.initial` files.