Non-Blocking Two Phase Commit
Non-Blocking Two Phase Commit
Regular 2PC is blocking because we need to wait for all nodes to agree that an operation is commit-able. There are massive performance implications to read-only transactions (could fix with snapshot reads) and lock contention. We can fix this by relying on Paxos.
We make the coordinator as well as each participant a Paxos group with its own shard, and then we can just wait until all groups agree to an operation.
2PC on Paxos
- Client requests multi-key operation at coordinator
- Coordinator logs request
- Coordinator sends prepare
- Participants decide to commit/abort, and log result
- Coordinator sends a commit/abort
- Participants record results
Multi-Key Transactions in a KV Store
Assuming reader-writer locking scheme, and that the application code runs on the client:
- RPCs to storage layer to start/end transactions, and read/write values
- KV store acquires/releases locks and does read/write operations
- Keep KV store simple so application logic is easy to change
- Server can abandon transaction and release locks if needed (e.g. if the client fails)
The basic approach is to:
- Read and write objects during execution
- Contact the appropriate paxos group leader and acquire any locks
- Client decides to commit and notifies coordinator
- Coordinator contacts all shards and sends
prepare
message - Prepare log entry is replicated throughout group
- Group votes to either send
ok
orabort
- If all contacted shards vote
ok
, coordinator sendscommit
message - Each shard replicates
commit
entry in log - Leader released locks
Caution: Deadlocks
Deadlocks are very easy to trigger when performing operations across shards. A general solution is to always kill things that need to wait. For instance, with a checking/savings account at a bank:
- Two clients execute the same exact transaction concurrently
- Both get read locks on accounts, so neither can acquire write locks and the transaction stalls
- Two clients, one moves from savings to checking, and the other from checking to savings
- Each gets read lock, neither can get write lock, so transaction stalls
Deadlocks can be an issue in sharded systems when moving shards across groups.
Google's Bigtable in Retrospect
Jeff Dean of Google said that not supporting distributed transactions was the biggest mistake in the the design of Bigtable. Incremental updates make it a very important feature, so users really wanted them and often tried to implement it themselves on top of Bigtable
Spanner, Google's multi-datacenter KV store uses 2PC over Paxos, and is one of the backbones of Google's ad service.