This is an automated email from the ASF dual-hosted git repository.

maciej pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/iggy-website.git


The following commit(s) were added to refs/heads/main by this push:
     new f2b3eaea Fix grammar and punctuation in blog post
f2b3eaea is described below

commit f2b3eaead3f48fc0b79b33bdb82fd3ce0d361af9
Author: Maciej Modzelewski <[email protected]>
AuthorDate: Fri Feb 27 09:32:32 2026 +0100

    Fix grammar and punctuation in blog post
---
 content/blog/thread-per-core-io_uring.mdx | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/content/blog/thread-per-core-io_uring.mdx 
b/content/blog/thread-per-core-io_uring.mdx
index 0ae93279..b9ed6581 100644
--- a/content/blog/thread-per-core-io_uring.mdx
+++ b/content/blog/thread-per-core-io_uring.mdx
@@ -15,17 +15,17 @@ Apache Iggy utilized `tokio` as its async runtime, which 
uses a multi-threaded w
 
 When `tokio` starts, it spins up `N` worker threads (typically one per core) 
that continuously execute and reschedule `Futures`. The scheduler decides on 
which worker a particular `Future` gets to run, which can lead to task 
migrations between workers, cache invalidations, and less predictable execution 
paths. While Rust `Send` and `Sync` bounds prevent data-race undefined 
behavior, they do not prevent higher-level concurrency bugs such as 
[deadlocks](https://github.com/apache/iggy/pull/1567).
 
-But even these challenges weren't what finally tipped us over the edge. The 
way `tokio` handles block device I/O was the real dealbreaker. Tokio, following 
the poll-based Rust `Futures` model, uses (depending on the platform) a 
notification-based mechanism to perform I/O on file descriptors. The runtime 
subscribes for a readiness notification for a particular descriptor and 
`awaits` the readiness in order to submit the I/O operation. While this works 
decently well for network sockets, it [...]
+But even these challenges weren't what finally tipped us over the edge. The 
way `tokio` handles block device I/O was the real dealbreaker. Tokio, following 
the poll-based Rust `Futures` model, uses (depending on the platform) a 
notification-based mechanism to perform I/O on file descriptors. The runtime 
subscribes for a readiness notification for a particular descriptor and 
`awaits` the readiness in order to submit the I/O operation. While this works 
decently well for network sockets, it [...]
 
 ## Thread per core shared nothing architecture
 The thread-per-core shared-nothing architecture is what we landed on when it 
comes to improving the scalability of Apache Iggy. It has been proven to be 
successful by high-performance systems such as 
[ScyllaDB](https://github.com/scylladb/scylladb) and 
[Redpanda](https://github.com/redpanda-data/redpanda), both of those projects 
utilize the [Seastar](https://github.com/scylladb/seastar) framework to achieve 
their performance goals.
 
-In short, the core philosophy behind this approach is to pin a single thread 
to each CPU core, partition your resources based on an heuristic (commonly 
hashing), eliminate shared state, thereby [reduce lock contention and improve 
cache 
locality](https://www.scylladb.com/2024/10/21/why-scylladbs-shard-per-core-architecture-matters/)
 and finally, use message passing for communication between those threads, also 
known as `shards` in `Seastar` terminology. Sounds like a good plan, but as wit 
[...]
+In short, the core philosophy behind this approach is to pin a single thread 
to each CPU core, partition your resources based on a heuristic (commonly 
hashing), eliminate shared state, thereby [reduce lock contention and improve 
cache 
locality](https://www.scylladb.com/2024/10/21/why-scylladbs-shard-per-core-architecture-matters/)
 and finally, use message passing for communication between those threads, also 
known as `shards` in `Seastar` terminology. Sounds like a good plan, but as 
with [...]
 
 ![Diagram of thread per core shared nothing 
architecture](/thread-per-core-io_uring/sharding.png)
 From a bird's-eye view, this architecture solves the primary issues of our 
previous approach: we move from **work stealing** to **work steering**. That's 
a big W, but we were still left with block-device I/O. Using a thread pool for 
file operations would ultimately negate the performance gains from core 
pinning, so we needed a truly asynchronous I/O interface, and that is how we 
discovered `io_uring`.
 
-There is plethora of materials regarding `io_uring` as it's the hot thing, but 
very briefly the interface is straight forward, `io_uring` rather than being a 
notification system (readiness based), it's completion based, you submit the 
operation and the kernel drives it to completion. The core mechanism revolves 
around two lock-free ring buffers shared between user space and the kernel: the 
**Submission Queue (SQ)**, where your application enqueues I/O requests, and 
the **Completion Queue [...]
+There is plethora of materials regarding `io_uring` as it's the hot thing, but 
very briefly the interface is straightforward, `io_uring` rather than being a 
notification system (readiness based), it's completion-based, you submit the 
operation and the kernel drives it to completion. The core mechanism revolves 
around two lock-free ring buffers shared between user space and the kernel: the 
**Submission Queue (SQ)**, where your application enqueues I/O requests, and 
the **Completion Queue  [...]
 
 ## Pick your poison
 With all the design pieces in place, it was time to visit the marketplace of 
**async runtimes**. We evaluated 3 candidates:
@@ -42,10 +42,10 @@ Next on the list 
[glommio](https://github.com/DataDog/glommio) - this one is par
 
 Finally, [compio](https://github.com/compio-rs/compio) - this is what we ended 
up using. It's very similar to `monoio` in terms of architecture, but it stands 
out for its broad `io_uring` feature coverage, active maintenance (our patches 
got [merged within hours](https://github.com/compio-rs/compio/pull/440)), and 
its codebase structure. Unlike `monoio`, the `compio` codebase is structured in 
a way where the `driver` is disaggregated from the `executor`, meaning that one 
can build their  [...]
 
-Notably, `compio` boxes the I/O request that is submitted to the SQ, which 
means that every I/O request incurs a heap allocation, something that `monoio` 
avoids. In our case it's not that big of a deal, as those allocations are very 
small and `mimalloc` is quite good at maintaining a pool for small, predictable 
allocations. We did raise the question in their `Telegram` channel about 
whether it would be feasible to use a `Slab` allocator the approach that 
`monoio` takes, but the authors d [...]
+Notably, `compio` boxes the I/O request that is submitted to the SQ, which 
means that every I/O request incurs a heap allocation, something that `monoio` 
avoids. In our case, it's not that big of a deal, as those allocations are very 
small and `mimalloc` is quite good at maintaining a pool for small, predictable 
allocations. We did raise the question in their `Telegram` channel about 
whether it would be feasible to use a `Slab` allocator the approach that 
`monoio` takes, but the authors  [...]
 
 ## Devil's speech
-Remember how we mentioned that **the devil is in the details** ? Let's give 
him mic now.
+Remember how we mentioned that **the devil is in the details**? Let's give him 
mic now.
 
 At first glance since the thread-per-core shared-nothing model all state is 
local to each shard and anything that requires a **global** view must be 
replicated across shards via message passing, it looks like a perfect candidate 
for **Interior mutability**, replace your `Mutexes` with `RefCells` and run 
with the quick win. If you thought that, I've got bad news, you'd be greeted 
straight from the ninth circle of Dante's Inferno with:
 > thread 'shard-8' (496633) panicked at 
 > core/server/src/streaming/topics/helpers.rs:298:21:
@@ -55,7 +55,7 @@ Turns out that holding a `RefCell` borrow across an `.await` 
point can cause run
 
 The Rust `wg-async` (async working group) seems to be aware of that footgun 
and describes it in [this 
story](https://rust-lang.github.io/wg-async/vision/submitted_stories/status_quo/barbara_wants_to_use_ghostcell.html).
 It *feels* like it should be possible to express statically-checked borrowing 
for `Futures` using primitives such as `GhostCell`, they even share a 
[proof-of-concept runtime](https://crates.io/crates/stakker) that does exactly 
that, but achieving an ergonomic API indistin [...]
 
-We didn't give up (yet) on interior mutability, rather, we reasoned about the 
underlying problem and attempted to solve it with better a API.
+We didn't give up (yet) on interior mutability, rather, we reasoned about the 
underlying problem and attempted to solve it with a better API.
 
 The issue is that during `.await` points, the executor can potentially yield 
the execution context to another `Future`, and that other `Future` may attempt 
to borrow the same `RefCell`, causing a panic at runtime since the borrow from 
the first `Future` is still active. We ran into this often because our data 
structures followed an OOP-style of **compile time hierarchy that matches the 
domain model**, which looked akin to that.
 
@@ -94,14 +94,14 @@ The `save` procedure can be split into two parts
  - The mutation of the in-memory state
  - The I/O operation using `storage`
 
- This way our `RefCell` can be much more granular, we use it only for the 
in-memory representation of `Stream`, while the storage is stored out of 
bounds, but for that we needed a bigger gun, let us introduce **ECS** (Entity 
Component System).
+ This way our `RefCell` can be much more granular, we use it only for the 
in-memory representation of `Stream`, while the storage is stored out of 
bounds, but for that, we needed a bigger gun, let us introduce **ECS** (Entity 
Component System).
 
 One might be familiar with `ECS` from game engines, not from message streaming 
platforms, personally I think the general idea behind ECS - **SOA** (Struct of 
arrays) is fairly underrated in general. 
 What we did is split the `Entities` (Streams, Topics, Partitions, etc.) into 
their components, where each component is stored in its own dedicated 
collection.
 
 ![Entity Component System in-memory 
representation](/thread-per-core-io_uring/ecs.png)
 
-In this case our components are `State` and `Storage`. This allows us to write:
+In this case, our components are `State` and `Storage`. This allows us to 
write:
 
 ```rs
 struct Streams {
@@ -130,9 +130,9 @@ Well, this approach crumbles just as miserably as the 
*naive* attempt...
 
 The thread-per-core shared-nothing architecture requires broadcasting events 
whenever state changes on one shard. For example, if `shard-0` receives a 
`CreateStream` request, once it finishes processing, it broadcasts a 
`CreatedStream` event through a channel to all other shards. On the receiving 
end, each shard has a background task that polls this channel for incoming 
events. The crux of the issue lies in the word **background**.
 
-![Sequence diagram demonstrating non deterministic event 
handling](/thread-per-core-io_uring/diagram_szpont.png)
+![Sequence diagram demonstrating non-deterministic event 
handling](/thread-per-core-io_uring/diagram_szpont.png)
 
-In our `Streams` example, it might not look like a big deal, but in reality 
our other `Entities` were much more complicated, without even introducing other 
background workers that weren't necessary as part of the thread-per-core shared 
nothing architecture. An solution to this problem could be using `async` lock, 
but those can be [footguns aswell](https://rfd.shared.oxide.computer/rfd/0400).
+In our `Streams` example, it might not look like a big deal, but in reality 
our other `Entities` were much more complicated, without even introducing other 
background workers that weren't necessary as part of the thread-per-core shared 
nothing architecture. A solution to this problem could be using `async` lock, 
but those can be [footguns aswell](https://rfd.shared.oxide.computer/rfd/0400).
 
 To our surprise, the issue persisted even in scenarios where we enforced a 
single-writer principle (we dedicated one shard to become the serialization 
point for all requests), which was the final nail in the coffin that led us to 
conclude the experiment as failure. Maintaining a non-shared but consistent 
state is much more difficult, than *just use message passing bro*.
 
@@ -142,7 +142,7 @@ To our surprise, the issue persisted even in scenarios 
where we enforced a singl
 
 After a long fight with `interior mutability`, we gave up on trying to make 
fetch happen. Instead, we doubled down on the artifact from the previous 
iteration (the single-writer principle). We divided our `resources` into two 
groups: shared, strongly consistent resources and sharded, eventually 
consistent ones. An example of a sharded resource is `Partition`, while 
`Streams` and `Topics` remain shared and strongly consistent, this split later 
on coined name (Control Plane/Data Plane).
 
-For shared resources, we decided to use 
[`left-right`](https://github.com/jonhoo/left-right), a concurrent data 
structure designed for a single writer and multiple readers. It works by 
maintaining two pointers to the underlying data: one for readers and one for 
the writer. During a writer commit, those pointers are swapped atomically 
(greatly simplifying). The single writer is the first shard - `shard0`, while 
remaining shards have an `read` handle to the data. In case if an shard other  
[...]
+For shared resources, we decided to use 
[`left-right`](https://github.com/jonhoo/left-right), a concurrent data 
structure designed for a single writer and multiple readers. It works by 
maintaining two pointers to the underlying data: one for readers and one for 
the writer. During a writer commit, those pointers are swapped atomically 
(greatly simplifying). The single writer is the first shard - `shard0`, while 
remaining shards have an `read` handle to the data. In case if a shard other t 
[...]
 
 As for our partitions, we maintain one shared table (DashMap) called 
`shards_table` that functions as barrier to fence requests that would try to 
access `Partition` that is in the process of creation/deletion, the requests 
are still routed to appropriate shard that contains the `Partition`, but by 
consulting the `shards_table` (during the routing and after the routing), we 
make sure that the eventual consistency does not come to bite us. 
 
@@ -154,7 +154,7 @@ This design turned out to be a can of worms, or a 
bottomless pit, if you prefer.
 
 We can exploit the fact that our `Partition` uses **segmented log**, thus the 
partition can be sharded even harder based on the segment range and knowledge 
of which segments are sealed.
 
-Getting the performance benefits out of `io_uring` itself is a challenge on 
it's own (it's not enough to just swap `tokio` with an `io_uring` based 
runtime), in order to fully take advantage of the benefits from the `io_uring` 
design one have to heavily batch syscalls, as this is the main advantage of 
such interface (less context switches, from userspace to kernel space), Rust 
`Futures` can be composed together pretty well to facilitate that, but you have 
to be careful!
+Getting the performance benefits out of `io_uring` itself is a challenge on 
its own (it's not enough to just swap `tokio` with an `io_uring` based 
runtime), in order to fully take advantage of the benefits from the `io_uring` 
design one has to heavily batch syscalls, as this is the main advantage of such 
interface (less context switches, from userspace to kernel space), Rust 
`Futures` can be composed together pretty well to facilitate that, but you have 
to be careful!
 
 The following code snippet, submits two I/O operations in one "batch", but 
`io_uring` does not guarantee that the submission order = completion order! 
 
@@ -176,7 +176,7 @@ To submit a batch while preserving operation order, one 
must use the io_uring ch
 
 ## The state of Rust async runtimes ecosystem
 
-The problem is twofold: at the time of writing this blog post, there is no 
Rust equivalent of the `Seastar` framework. That is unfortunate, because 
`glommio` attempted to be one, but things changed: Glauber moved on to work on 
[Turso](https://github.com/tursodatabase/turso), and the Datadog team does not 
seem to be actively maintaining the runtime while building [a real-time 
time-series storage engine in Rust for performance at 
scale](https://www.datadoghq.com/blog/engineering/rust-times [...]
+The problem is twofold: at the time of writing this blog post, there is no 
Rust equivalent of the `Seastar` framework. That is unfortunate because 
`glommio` attempted to be one, but things changed: Glauber moved on to work on 
[Turso](https://github.com/tursodatabase/turso), and the Datadog team does not 
seem to be actively maintaining the runtime while building [a real-time 
time-series storage engine in Rust for performance at 
scale](https://www.datadoghq.com/blog/engineering/rust-timese [...]
 
 Secundo problemo is that these runtimes imitate the `std` library APIs, which 
is `POSIX` compliant, while many of `io_uring`'s most powerful features are 
not, leaving those capabilities out of reach for us mere mortals. Request 
chaining is only the tip of the iceberg, there is plenty more, for example 
`oneshot` APIs for listen/recv, `registered buffers`, and so on. Ultimately, 
`File`, `TcpListener`, and `TcpStream` are not the right abstractions. From the 
point of view of `POSIX` complia [...]
 
@@ -296,6 +296,6 @@ Flush the data to disk on every batch write.
 | 3,361 MB/s | 1.98 ms | 2.26 ms | 2.57 ms | 3.88 ms |
 
 ## Closing words
-Finally, even though we went into significant detail in this blog post, we 
have only scratched the surface of what is possible, and several subsections 
could easily be blog posts on their own. If you are interested in learning more 
about thread-per-core shared-nothing design, check out the `Seastar` framework, 
it is the SOTA in this space. For now we shift our attention to the [on-going 
work on clustering](https://github.com/apache/iggy/releases/tag/server-0.7.0), 
using [Viewstamped Repl [...]
+Finally, even though we went into significant detail in this blog post, we 
have only scratched the surface of what is possible, and several subsections 
could easily be blog posts on their own. If you are interested in learning more 
about thread-per-core shared-nothing design, check out the `Seastar` framework, 
it is the SOTA in this space. For now, we shift our attention to the [ongoing 
work on clustering](https://github.com/apache/iggy/releases/tag/server-0.7.0), 
using [Viewstamped Repl [...]
 
 Stay tuned a deep-dive blog post on that is coming, and we’re just getting 
started 🚀

Reply via email to