What is the CouchDB replication protocol? Is it like Git?

GitCouchdb

Git Problem Overview


Is there technical documentation describing how replication between two Couches works?

What is the basic overview of CouchDB replication? What are some noteworthy characteristics about it?

Git Solutions


Solution 1 - Git

Unfortunately there is no detailed documentation describing the replication protocol. There is only the reference implementation built into CouchDB, and Filipe Manana's rewrite of the same which will probably become the new implmentation in the future.

However, this is the general idea:

Key points

If you know Git, then you know how Couch replication works. Replicating is very similar to pushing or pulling with distributed source managers like Git.

CouchDB replication does not have its own protocol. A replicator simply connects to two DBs as a client, then reads from one and writes to the other. Push replication is reading the local data and updating the remote DB; pull replication is vice versa.

  • Fun fact 1: The replicator is actually an independent Erlang application, in its own process. It connects to both couches, then reads records from one and writes them to the other.
  • Fun fact 2: CouchDB has no way of knowing who is a normal client and who is a replicator (let alone whether the replication is push or pull). It all looks like client connections. Some of them read records. Some of them write records.

Everything flows from the data model

The replication algorithm is trivial, uninteresting. A trained monkey could design it. It's simple because the cleverness is the data model, which has these useful characteristics:

  1. Every record in CouchDB is completely independent of all others. That sucks if you want to do a JOIN or a transaction, but it's awesome if you want to write a replicator. Just figure out how to replicate one record, and then repeat that for each record.
  2. Like Git, records have a linked-list revision history. A record's revision ID is the checksum of its own data. Subsequent revision IDs are checksums of: the new data, plus the revision ID of the previous.
  3. In addition to application data ({"name": "Jason", "awesome": true}), every record stores the evolutionary timeline of all previous revision IDs leading up to itself.
  • Exercise: Take a moment of quiet reflection. Consider any two different records, A and B. If A's revision ID appears in B's timeline, then B definitely evolved from A. Now consider Git's fast-forward merges. Do you hear that? That is the sound of your mind being blown.
  1. Git isn't really a linear list. It has forks, when one parent has multiple children. CouchDB has that too.
  • Exercise: Compare two different records, A and B. A's revision ID does not appear in B's timeline; however, one revision ID, C, is in both A's and B's timeline. Thus A didn't evolve from B. B didn't evolve from A. But rather, A and B have a common ancestor C. In Git, that is a "fork." In CouchDB, it's a "conflict."

    • In Git, if both children go on to develop their timelines independently, that's cool. Forks totally support that.
    • In CouchDB, if both children go on to develop their timelines independently, that cool too. Conflicts totally support that.
    • Fun fact 3: CouchDB "conflicts" do not correspond to Git "conflicts." A Couch conflict is a divergent revision history, what Git calls a "fork." For this reason the CouchDB community pronounces "conflict" with a silent n: "co-flicked."
  1. Git also has merges, when one child has multiple parents. CouchDB sort of has that too.
  • In the data model, there is no merge. The client simply marks one timeline as deleted and continues to work with the only extant timeline.
  • In the application, it feels like a merge. Typically, the client merges the data from each timeline in an application-specific way. Then it writes the new data to the timeline. In Git, this is like copying and pasting the changes from branch A into branch B, then commiting to branch B and deleting branch A. The data was merged, but there was no git merge.
  • These behaviors are different because, in Git, the timeline itself is important; but in CouchDB, the data is important and the timeline is incidental—it's just there to support replication. That is one reason why CouchDB's built-in revisioning is inappropriate for storing revision data like a wiki page.

Final notes

At least one sentence in this writeup (possibly this one) is complete BS.

Solution 2 - Git

Thanks Jason for the excellent overview! Jens Alfke, who is working on TouchDB and its replication for Couchbase, has (unofficially) described the CouchDB replication algorithm itself if you're interested in the technical details of how a "standard" CouchDB replicator protocol tends to work.

To summarize the steps he's outlined:

  1. Figure out how far any previous replication got
  2. Get the source database _changes since that point
  3. Use revs_diff on a batch of changes to see which are needed on the target
  4. Copy any missing revision metadata and current document data+attachments from source to target, posting to bulk_docs both for optimization and so as to store the docs differently than the usual higher-level MVCC handling does on PUT.

I've glossed over many details here, and would recommend reading through the original explanation as well.

Solution 3 - Git

The documentation for CouchDB v2.0.0 covers the replication algorithm much more extensively. They have diagrams, example intermediate responses, and example errors. They use the "MUST", "SHALL", etc. language of IETF RFCs.

The specifics for 2.0.0 (still unreleased as of January 2016) are a bit different from 1.x, but the basics are still as @natevw described.

Solution 4 - Git

At Apache CouchDB Conf 2013, Benjamin Young introduced replication.io in his Replication, FTW! talk. It's an ongoing effort to define, and eventually mint, the spec for HTTP-based master-master replication.

Solution 5 - Git

it is also documented here: http://www.dataprotocols.org/en/latest/couchdb_replication.html, well, sort of.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionJasonSmithView Question on Stackoverflow
Solution 1 - GitJasonSmithView Answer on Stackoverflow
Solution 2 - GitnatevwView Answer on Stackoverflow
Solution 3 - GitEthanPView Answer on Stackoverflow
Solution 4 - GitsghillView Answer on Stackoverflow
Solution 5 - GitdongshengcnView Answer on Stackoverflow