Distributed Cache Coherence

When linearizability is a concern, any duplication of mutable data across multiple nodes must be kept consistent. This is the problem of cache coherence: ensuring that all nodes in a distributed system have the same view of the data.

Distributed Caching with Leases

A lease is a time-limited right to do something. In the context of distributed caching, our lease gives us a right to cache some data.

If a node holding the lease fails, we just wait for the lease to expire. Leases can be renewed by the holder, so long as the node is still up.

Cache Reads

  1. Cache obtains a lease containing the data
  2. No one can modify the data until the lease either expires or is revoked. Thus, a server/service needs to track who has which data and for how long
  3. Once the lease expires, the value can then change. The items is no longer cached by anyone, so it can only be copied at the server. All subsequent caches can refetch the new data.

This approach is both linearizable and fault tolerant, since the lease will eventually expire if the node holding it fails, allowing another node to take over. However, this approach is not very scalable, since the server needs to maintain state for every cached item.

Clients are also able to cache values, and you can do this by forwarding the lease along with the data to the client.

Cache Updates

Leases allow the server to reclaim a single copy, regardless of whether caches are up or not. A naive approach would be to wait for all copies to timeout any time you want to update.

An optimized version would be to use a callback, preventing the need for timeouts if no error occurs. When the server receives an update for a cached value, it forwards an invalidation/revoke to all nodes with a copy of the data, and waits for a response from all (or for the lease to timeout), after which it can proceed with the update.

The key insight however is that in order for your system to be linearizable, you must have only one copy of the data while updating.

Lease Timeouts

If we use the same timeout value for all leases, then we need to track less state at our server, and this also reduces the total time needed to reclaim all leases. On the other hand, if we use different timeouts, then caches will all ask for a new lease at staggered times, preventing an overwhelming number of requests at once.

Weaknesses of Linearizable Caches

Caching Widely Shared Data

It's often okay to use snapshot read consistency, allowing for reads to return stale data. Much of the web follows this model.

Usually this would look like having many read only caches and a single copy that is writable, for which updates are propagated from to the caches.

Examples

Sun Network File System (NFS)

Domain Name System (DNS)

Caching Terminology

Write Back Cache Coherence

However, with write back caching, durability becomes an issue since a failure might lose writes.