In CouchDB mem3_sync is the mechanism that keeps shards of the same range synchronised across multiple cluster nodes. It does so by implementing the CouchDB replication protocol over Erlang distributed messages. In slightly more detail this means it reads the by-seq index of a shard and copies out all live content into the new shard.

During cluster management operations like increasing the number of nodes it can become necessary to move an existing shard from one node (A) to another (B) where there is no such shard on node B already, because it just freshly joined the cluster and no database shard map points to it.

The easiest way to move a shard to node B is to edit the shard map for a respective database and change the entries for a particular shard range or ranges to point to node B instead of node A. mem3 will recognise the change in the shard map and notice the missing shard on node B and then will initiate mem3_sync to fill up the shard to match the copies on the other cluster nodes.

Unfortunately, implementing CouchDB replication means that data transfer does not happen at network speed because following the protocol correctly means multiple request/response cycles during each batch of operation.

However, when replicating into an empty shard, none of these extra roundtrips are strictly necessary. They are necessary later when a shard might be slightly behind other nodes so that the shard can be caught up correctly, but for the initial sync, mem3_sync does more work than strictly needed.

A common recommendation for speeding up this process, especially for large databases, is to first scp or rsync the .couch shard file to the new host and then let mem3_sync top it up.

It would be a lot nicer if CouchDB chose a faster binary-copy on its own when it knows it is replicating into an empty shard file.

Proposal

We propose the following algorithm to speed up the process:

  • when mem3_sync is invoked: check if the target shard exists on node B.
  • if yes: use the remaining functionality of mem3_sync nothing changes.
  • if no:
    • create a hard link of the source shard:
      • dbname.timestamp.couch.hardlink.file-length, e.g. db.1234567890.couch.hardlink.7865432
      • ensure that compaction cannot occur between recording the file length and calling ln. it should be in the couch_file gen_server loop therefore.
      • if the hardlink operation fails, fall back to normal mem3_sync functionality, log a warning about using an odd file system
  • create a target shard to copy data into with the name of the shard file plus a .initial suffix.
    • if the suffixed file already exists on node B, create a sha256 checksum of it and the hard link up to the file length recorded in the hard link name. If the checksum matches, continue copying bytes from length(target) to length(source) (as recorded at the beginning of this), do not read until EOF.
    • if not: start copying bytes from position 0 in the source file on node A until the length reached that is specified in the hard link name
  • when done
    • remove hard link on node A
    • optional: do another sha256 of the final file and compare against source
    • continue with regular mem3_sync functionality to top up the shard file
    • then rename the file to remove the .initial suffix

TODO:

  • decide when/how to clean up stale .hardlink and .initial files.
Select a repo