Gossip Glomers challenges in sync Rust
Gossip Glomers are a set of distributed system challenges prepared by Fly.io and Kyle Kingsbury.
They are based on Maelstrom, a workbench platform for testing distributed systems built on top of the Jepsen verification framework, used for testing a bunch of services like Kafka, TigerBeetle, or ElasticSearch.
I attempted to solve them in Rust.
You can check my approach here.
Maelstrom Node in sync Rust
Our task is to build a node, which will communicate over stdin/stdout with other nodes and Maelstrom’s provided services in order to implement a concrete workload.
Maelstrom will take care of scheduling many instances of our node, often at concrete topologies, and later inject faults like network partitions to check how our node implementation behaves under not-so-ideal circumstances.
By default, we are given a Go library that implements glue code needed to bootstrap a Maelstrom-compatible node. I, however, wanted to treat it as an opportunity to try out building something in Rust. There already exist multiple Maelstrom glue code libraries for Rust, but the protocol is not complicated, so I wanted to give it a try myself. As an additional limitation to reduce the number of dependencies, I decided to do everything in sync Rust.
That means no async keyword, no tokio (!).
The approach I settled on was a bounded thread-pool with callback-driven IO.
In the end, it became sort of a double-edged sword; it made the code too complicated at times, especially because most of my scheduled jobs & callbacks required: Send + Sync + 'static. So we ended up with a lot of Arcs and clones.
Using a continuation monad-like trait such as Future with park/unpark feature in the form of async/await would help us tremendously to untangle some of the code.
The fact that continuations help with readability is a known fact; however, in my experience, this is even more true in Rust.
In addition, while not all of the Send + Sync + 'static trio could be escaped with Rust’s most popular async runtime - Tokio, there are solutions to ease the pain.
Challenges
Let’s go through actual challenges.
Echo server
This task was relatively simple - make a Node that replies back with a given echo string message. Here I spent most of the time going through Maelstrom protocol code and setting up the boilerplate.
Unique ID Generation
In this task, we need to implement a totally-available unique ID generation service. We don’t get any constraints regarding the contents of the generated ID, or whether it needs to be sorted or not, so I just went with the UUID approach. Otherwise, we could encode in the ID the information about the node, request, or go in the direction of Lamport Clock.
Broadcast
This is the first more involved challenge. Our task is to build a multi-node, fault-tolerant P2P system for effectively broadcasting messages (just ints) across the other nodes. It also required me to re-architect some of my Maelstrom glue-code to allow callbacks.
The simplest solution here would be to transfer every received message to every known node on every request. In real scenarios, however, this would produce a lot of network traffic, and messages could be huge in size.
My implementation does the following:
- On every broadcast message received, send an ack and schedule the message for sending on a background processing thread.
- In the background processing thread, we send messages in small batches to every known node in our topology.
- We also keep track of messages for which we got acks from other nodes; if we don’t get an ack, we retry sending them.
It’s also important to consider the topology of the system. To balance the latency with availability and msg-per-op metric, I decided to use the tree4 topology for challenge d.
A related work in this area to read about - Epidemic Broadcast Trees / Plumtree algorithm.
Grow-Only Counter
This time our challenge is to implement a stateless grow-only counter.
This is the first challenge in which we are required to implement our solution on top of Maelstrom’s sequentially-consistent KV store implementation, which provides read, write, and cas operations.
The sequentially-consistent bit is important; it means that we have a total order of operations, the order respects the program order but there is no real-time guarantee.
In practice, it means that:
- If Process A writes X=1, then Process B writes X=2, all processes will see either [A then B] or [B then A], but not a mix.
- But once a process A has observed some operation from process B, it can never observe a state prior to B.
My implementation is based on the G-Counter CRDT, where the Maelstrom KV is used for storage and coordination point.
Each Node writes its version of the state under a separate key, reads are done from the memory, and merging is happening in the background thread by reading the state of other nodes directly from the KV.
Using different keys and allowing ourselves to be eventually consistent circumvents the problem of the lack of real-time guarantees of the sequentially-consistent version of the KV store.
Kafka-Style Log
Here our challenge is to build a simplified version of a Kafkaesque system - a distributed log where clients could append messages, poll them, and their consumption state is tracked in our service.
Although it wasn’t a requirement, I decided to use the suggested approach with Maelstrom KV store again.
This time, however, with the linearizable KV service that adds a real-time guarantee.
Otherwise, I would need to implement a storage layer with replication myself, which sounds fun but adds a lot of work.
We work with only one implicit topic; there is only one consumer group/client for which we track offsets, but we need to make sure that appends & polls for a given key are sequential.
My approach for this challenge was to embrace the Single Writer Principle, where only one node does the writes for a single key/partition.
This is done by implementing a ring-like structure where keys are consistently hashed to values mapped to particular nodes, so that the minimal amount of keys needs to be remapped when partitions are added or removed - Cassandra style.
For us, the ring state is static, but you can imagine it being a part of distributed state exchanged using Raft or Paxos. During write, we hash the key to decide on the partition leader for a key, and then:
- If the current node is the leader - write to KV directly.
- If another node is the leader - delegate the write to the other node. Reads can happen from any node.
Storing the whole partition state under one key is kind of naive, but it works in the scope of this challenge. To improve it further, we could distribute it across multiple segments and add a background cleaning task that removes records beyond our TTL policy. Or better yet - implement it as an actual log. 🙈
Totally-Available Transactions KV store
In this last challenge, our task is to implement a totally available transactional workload that accepts messages with a list of transactions and has to process them according to the read uncommitted and read committed isolation levels.
The first one helps us avoid dirty-writes, the second one dirty-reads.
Since our transaction consistency guarantees are actually pretty lax, I was considering going the CRDT approach again with RAMP transactions. For serializable, lock-free transactions, we could go CockroachDB style.
In the end, I decided to save some time by cheating a bit and going for a far simpler approach - a distributed global lock based on Maelstrom linearizable KV service built on top of cas - compare-and-swap operation.
For a more efficient approach, we could make our lock more granular - instead of locking the whole DB, we could lock just the keys touched by the transaction.
That was fun, but perhaps next time I shouldn’t try to be too creative and go with async Rust like a normal person.