Xusheng Chen's Homepage | On the Usage of Atomic Clocks in Spanner

The Spanner (Corbett et al., 2013) system has been well studied since it was published in OSDI ‘12, and many geo-distributed databases (e.g., CockroachDB) following Spanner have been built. Everyone knows that Spanner uses TrueTime/Atomic clocks to achieve strict serializability (linearizability), but a key question that I want to answer here is what role TrueTime plays in Spanner’s protocol? How does it contribute to Spanner’s guarantee? How should we design a system without such atomic clocks?

Spanner in a Nutshell

Spanner is a geo-distributed database built by Google. Spanner has a data model that is stronger than key-value stores but weaker than fully relational databases.
Spanner adopts a layered design, with concurrency control (2PL) on top of replication (Paxos). This design is easy to maintain, but it may cause extra RTTs. Spanner has three types of transactions, namely read-write (RW) transactions, consistent read-only transactions (RO), and stale snapshot transactions (the namings may be a little bit different from its original paper).

The key weapon of Spanner is its TrueTime API based on atomic clocks and GPS. The TrueTime API has a (at least practically) guaranteed offset among clocks. Compared to traditional non-synchronized clocks, a node in Spanner can predicate that a remote node’s clock has passed \(T\) without communication, which brings Spanner unique strength on efficiently ensuring consistency.

Spanner’s Consistency/Isolation

For read-write (RW) transactions, Spanner uses (strict) 2PL for concurrency control and uses leader-based Paxos for replication. A RW transaction is triggered by its client: the client first acquires all read locks, reads all values, and buffers all writes; then, a client acquires all write locks and commits via 2PC. This is a standard 2PL, and it actually already ensures Strict Serializability (even with replication, as all read/write requests go through the leader) without TrueTime. A key invariant when assigning timestamp is that the timestamp order should be the same as the equivalent serial order of 2PL.

For read-only (RO) transactions, Spanner uses TrueTime to ensure their consistency. For consistency, there are two main requirements for a RO transaction \(R\) at timestamp \(ts\). First, Spanner needs to ensure that \(R\) sees all writes with timestamps smaller than \(ts\). This completeness enures atomicity because updates from the same transaction share the same timestamp, so completeness already infers atomicity. The completeness also preserves serializability: RW transactions are serialized according to their timestamps, so completeness ensures that there is no gap in the equivalent serial schedule of RW transactions. Completeness together with the second requirement below also ensures the strictness of serializability.

Second, to ensure “strictness” in strict serializability, Spanner must ensure that a RO transaction’s timestamp is larger than all completed RW transactions’. This is where Spanner’s TrueTime comes to help. Since nodes’ clocks have a bounded offset, consider a RO transaction \(R\) starting at \(t\). When \(R\) starts, any other node’s clock is at most \(t +\frac{1}{2} offset\). Therefore, assigning \(R\) with a timestamp larger than \(t +\frac{1}{2} offset\) is enough for consistency.

Without TrueTime, these two requirements become expensive to meet. For the first requirement, Two key aspects are (1) to ensure that nodes will not travel back in the history and assign transactions with past timestamps, and (2) to have a mechanism helping nodes to predicate that all on-the-flight transactions. Achieving these two aspects efficiently in a geo-distributed deployment is challenging. Ocean Vista (Fan & Golab, 2019) is an impressive work showing how these two aspects can be met with commodity hardware while still ensuring high performance. My previous work DAST (Chen et al., 2021) also addresses these two aspects, but in a more specific near-client computing deployment scenarios: data is assigned home regions based on geo-locality, which brings DAST the potential for higher performance.

For the second requirement, without TrueTime, a system must keep track of the timestamp of finished transactions. This cannot be efficiently achieved in general cases for geo-distributed deployments, and normally a RO has to wait/communicate with delay in the order of global RTT time. This is actually in line with the SNOW (Lu et al., 2016) impossibility result, and explains why Google invest to build such a strong hardware to support RO transactions. There are two difficulties here. First, nodes’ clocks are not synchronized, so reading at a specific timestamp is bounded by the global straggler (who can assign relevant read-write transactions). DAST somehow sidesteps this difficulty because in DAST, for each key \(K\), only a restricted set of local nodes can assign timestamps to RW transactions accessing \(K\). Second, most systems do not explicitly record the time when each transaction finishes, so to ensure strict serializability, the system has to include a conservative superset of finished transactions (e.g., those who are assigned timestamps). This superset makes the RO must wait for un-finished transactions.

Overall, there seems to be two chances for improving RO transactions in Strictly serializable databases: (1) strong data ownership based on locality (e.g,. SLOG (Fan & Golab, 2019)), and (2) leveraging strongly synchronized clocks (e.g., recent works on NSDI (Geng et al., 2018; Najafi & Wei, 2022) )

References:

Ali Najafi, and Michael Wei . Graham: Synchronizing Clocks by Leveraging Local Clock Properties. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22) 2022
Xusheng Chen , Haoze Song , Jianyu Jiang , Chaoyi Ruan, Cheng Li , Sen Wang, Gong Zhang, Reynold Cheng , and Heming Cui . Achieving low tail-latency and high scalability for serializable transactions in edge computing. In Proceedings of the Sixteenth European Conference on Computer Systems 2021
Hua Fan, and Wojciech Golab . Ocean vista: gossip-based visibility control for speedy geo-distributed transactions. Proceedings of the VLDB Endowment 2019
Yilong Geng, Shiyu Liu, Zi Yin, Ashish Naik, Balaji Prabhakar, Mendel Rosenblum, and Amin Vahdat . Exploiting a natural network effect for scalable, fine-grained clock synchronization. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18) 2018
Haonan Lu, Christopher Hodsdon, Khiem Ngo, Shuai Mu, and Wyatt Lloyd . The SNOW Theorem and Latency-Optimal Read-Only Transactions. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) 2016
James C Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, Jeffrey John Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, and others . Spanner: Google’s globally distributed database. ACM Transactions on Computer Systems (TOCS) 2013