Look Into

Remote Procedure Call (RPC)

A request from a client to execute a function on a server/different machine. - To the client, looks like a local function call. - To the server, looks like an implementation of a function call.

Google handles $10^{10}$ RPCs per second.

Local Procedure Call

The compiler defines the protocol for the call above.

Remote Procedure Call

Client/server implementation is usually auto-generated from procedure spec, e.g. Google's Protocol Buffers/Protobuf.

RPC vs. Local Procedure Call

Binding

Service Discovery Service

Interface Description Language (IDL, e.g. Protobuf)

Serialization is important!

Failures

Some of the network issues can be mitigated by TCP, but sockets can fail and messages aren't always transmitted over TCP anyways.

Fault Model

Naive RPC

Client timeout and retry

If a request or reply message is dropped, the client will wait forever for the response. This can be fixed with client timer/retransmission, where the client sends the request again if it doesn't get a response in a certain amount of time. This leads to duplication and reordering of messages at the server.

We can handle this with a unique request ID. Include a message ID in each request/reply. When the client retransmits, it uses the same message ID. The server can then ignore duplicate requests.

RPC Semantics

At least once

Client should do a finite number of retries, eventually giving up and returning an error to the caller.

This only works if the server is idempotent, meaning it has the same effect if it's executed multiple times. All read-only operations are idempotent, but not all write operations are. For example, icrementing a counter is not idempotent, but setting the counter to a value is.

Does TCP handle this? Not really despite being reliable. Most RPCs are sent over TCP, and it guarantees in-order delivery with retransmission and duplicate detection. However, it doesn't guarantee exactly-once semantics. If the server crashes after processing the request but before sending the response, the client will retransmit the request, and the server will execute it again.

End to end principle: Functionality should be implemented where it can be completely handled, rather than partially handled at each layer. This decreases the chance of partially completed work due to unrelated failures.

Examples: | Example | Explanation | |---|---| | DNS lookup | Queries are read-only, so it's idempotent | | MapReduce | The Map phase is idempotent since it is a pure function | | NFS | If the client maintains offset, reading/writing a block is idempotent |

Importantly, in situations with multiple clients, operations like Put(k, v) are not idempotent, since the value of k can change between the time the client reads the value and the time it writes the value.

Two Generals Problem

Just a thought experiment to emphasize the difficulty of message passing in a distributed system. Two generals are trying to coordinate an attack on a city. They are separated by a valley, and can only communicate by messenger. The messenger can be captured by the city, and the generals don't know if the message was delivered. The generals need to agree on a time to attack, but they can't be sure the message was delivered. They can only attack if they both agree on the time.

The problem boils down to the fact that at any point in time, if we sent a message, we don't know if it was delivered. Regardless of how many round trips you make to confirm, the last message sent could always have been dropped. This is a fundamental and central problem in distributed systems.