This is an automated email from the ASF dual-hosted git repository.

gkoszyk pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/iggy-website.git


The following commit(s) were added to refs/heads/main by this push:
     new d8b6ebae io_uring blogpost
d8b6ebae is described below

commit d8b6ebaee59625d282f03f80f6ecb2e31f915098
Author: numinex <[email protected]>
AuthorDate: Thu Feb 26 10:19:27 2026 +0100

    io_uring blogpost
---
 content/blog/do.mmd                                |  21 +++
 content/blog/thread-per-core-io_uring.mdx          | 198 +++++++++++++++++++++
 public/thread-per-core-io_uring/diagram_szpont.png | Bin 0 -> 271604 bytes
 public/thread-per-core-io_uring/ecs.png            | Bin 0 -> 154902 bytes
 public/thread-per-core-io_uring/sharding.png       | Bin 0 -> 535868 bytes
 source.config.ts                                   |   1 +
 src/app/(site)/blogs/[...slug]/page.tsx            |   3 +-
 src/app/(site)/blogs/page.tsx                      |   6 +-
 8 files changed, 226 insertions(+), 3 deletions(-)

diff --git a/content/blog/do.mmd b/content/blog/do.mmd
new file mode 100644
index 00000000..7bd6151b
--- /dev/null
+++ b/content/blog/do.mmd
@@ -0,0 +1,21 @@
+sequenceDiagram
+    participant S0 as shard-0
+    participant S1 as shard-1
+
+    Note over S1: T0: shard-1 receives local CreateStream request
+    S1->>S1: Handle CreateStream (local request)
+
+    Note over S1: T1: local handler hits .await and yields
+    S1-->>S1: Yield to executor
+
+    Note over S0,S1: T2: shard-0 broadcasts CreatedStream event
+    S0->>S1: CreatedStream event
+
+    Note over S1: T3: background event task on shard-1 starts
+    S1->>S1: Handle CreatedStream (background task)
+
+    Note over S1: T4: background handler hits .await and yields
+    S1-->>S1: Yield to executor
+
+    Note over S1: T5: executor can resume either task first
+    S1->>S1: Resume order is non-deterministic
\ No newline at end of file
diff --git a/content/blog/thread-per-core-io_uring.mdx 
b/content/blog/thread-per-core-io_uring.mdx
new file mode 100644
index 00000000..3b4fc572
--- /dev/null
+++ b/content/blog/thread-per-core-io_uring.mdx
@@ -0,0 +1,198 @@
+---
+title: Our migration journey to thread-per-core architecture powered by 
io_uring
+author: grzegorz
+tags: ["engineering", "performance", "io_uring", "thread-per-core", "rust"]
+date: 2026-02-25
+draft: true
+---
+
+## Introduction
+
+At Apache Iggy, performance is one of our core principles. We take pride in 
being blazingly fast, pushing our systems to reach the absolute limits of the 
underlying hardware, eventually exhausting all available options within our 
previous architecture. Thus, a new approach was needed. If you're an active 
Rust Reddit user, you may have already seen [this 
discussion](https://www.reddit.com/r/rust/comments/1pn6010/compio_instead_of_tokio_what_are_the_implications/).
 It predates this blog po [...]
+
+## Rationale
+To explain the "whys" of that decision in detail, a quick primer on the status 
quo is needed.
+Apache Iggy utilized `tokio` as its async runtime, which uses a multi-threaded 
work-stealing executor. While this works great for a lot of applications (work 
stealing takes care of load balancing), fundamentally it runs into the same 
problem as many "high-level" libraries: a lack of control. 
+
+When `tokio` starts, it spins up `N` worker threads (typically one per core) 
that continuously execute and reschedule `Futures`. The scheduler decides on 
which worker a particular `Future` gets to run, which can potentially lead to a 
lot of context switches, cache invalidations, and unpredictable execution 
paths. While Rust `Send` and `Sync` bounds eliminate many of the data races 
associated with multi-threaded programming, there are still bunch of footguns 
left open, such as [deadlocks] [...]
+
+But even these challenges weren't what finally tipped us over the edge. The 
way `tokio` handles block device I/O was the real dealbreaker. Tokio, following 
the poll-based Rust `Futures` model, uses (depending on the platform) a 
notification-based mechanism to perform I/O on file descriptors. The runtime 
subscribes for a readiness notification for a particular descriptor and 
`awaits` the readiness in order to submit the I/O operation. While this works 
decently well for network sockets, it [...]
+
+## Thread per core shared nothing architecture
+The thread-per-core shared-nothing architecture is what we landed on when it 
comes to improving the scalability of Apache Iggy. It has been proven to be 
successful by high-performance systems such as 
[ScyllaDB](https://github.com/scylladb/scylladb) and 
[Redpanda](https://github.com/redpanda-data/redpanda), both of those projects 
utilize the [Seastar](https://github.com/scylladb/seastar) framework to achieve 
their performance goals.
+
+In short, the core philosophy behind this approach is to pin a single thread 
to each CPU core, partition your resources based on an heuristic (commonly 
hashing), eliminate shared state, thereby [reduce lock contention and improve 
cache 
locality](https://www.scylladb.com/2024/10/21/why-scylladbs-shard-per-core-architecture-matters/)
 and finally, use message passing for communication between those threads, also 
known as `shards` in `Seastar` terminology. Sounds like a good plan, but as wit 
[...]
+
+![Diagram of thread per core shared nothing 
architecture](/thread-per-core-io_uring/sharding.png)
+From a bird's-eye view, this architecture solves the primary issues of our 
previous approach: we move from **work stealing** to **work steering**. That's 
a big W, but we were still left with block-device I/O. Using a thread pool for 
file operations would ultimately negate the performance gains from core 
pinning, so we needed a truly asynchronous I/O interface, and that is how we 
discovered `io_uring`.
+
+There is plethora of materials regarding `io_uring` as it's the hot thing, but 
very briefly the interface is straight forward, `io_uring` rather than being a 
notification system (readiness based), it's completion based, you submit the 
operation and the kernel drives it to completion. The core mechanism revolves 
around two lock-free ring buffers shared between user space and the kernel: the 
**Submission Queue (SQ)**, where your application enqueues I/O requests, and 
the **Completion Queue [...]
+
+## Pick your poison
+With all the design pieces in place, it was time to visit the marketplace of 
**async runtimes**. We evaluated 3 candidates:
+
+- `monoio`
+- `glommio`
+- `compio`
+
+All of them support `io_uring` as the driver, some exclusively, others as one 
of several available ones.
+,
+Using the FIFO order - [monoio](https://github.com/bytedance/monoio) was our 
choice for the initial 
[proof-of-concept](https://github.com/apache/iggy/tree/io_uring_monoio_runtime),
 it worked pretty well, but as we explored the monstrous API surface of 
`io_uring`, we realized that it's pretty far behind when it comes to feature 
parity and doesn't appear to be very actively maintained. Don't get us wrong, 
the runtime still receives patches, especially after [incidents like 
this](https://ww [...]
+
+Next on the list [glommio](https://github.com/DataDog/glommio) - this one is 
particularly interesting as it was initially developed by `Glauber Costa`, who 
previously worked at `ScyllaDB`, the creators of the `Seastar` framework, 
`glommio` significantly differs from the other two runtimes on our list. It's 
still a thread-per-core runtime, but it uses a proportional-share scheduler, 
creates 3 `io_uring` instances per thread (a main ring, a latency ring, and a 
polling ring), and ships with [...]
+
+Finally, [compio](https://github.com/compio-rs/compio) - this is what we ended 
up using. It's very similar to `monoio` in terms of architecture, but it stands 
out for its broad `io_uring` feature coverage, active maintenance (our patches 
got [merged within hours](https://github.com/compio-rs/compio/pull/440)), and 
its codebase structure. Unlike `monoio`, the `compio` codebase is structured in 
a way where the `driver` is disaggregated from the `executor`, meaning that one 
can build their  [...]
+
+Notably, `compio` boxes the I/O request that is submitted to the SQ, which 
means that every I/O request incurs a heap allocation, something that `monoio` 
avoids. In our case it's not that big of a deal, as those allocations are very 
small and `mimalloc` is quite good at maintaining a pool for small, predictable 
allocations. We did raise the question in their `Telegram` channel about 
whether it would be feasible to use a `Slab` allocator the approach that 
`monoio` takes, but the authors d [...]
+
+## Devil's speech
+Remember how we mentioned that **the devil is in the details** ? Let's give 
him mic now.
+
+At first glance since the thread-per-core shared-nothing model all state is 
local to each shard and anything that requires a **global** view must be 
replicated across shards via message passing, it looks like a perfect candidate 
for **Interior mutability**, replace your `Mutexes` with `RefCells` and run 
with the quick win. If you thought that, I've got bad news, you'd be greeted 
straight from the ninth circle of Dante's Inferno with:
+> thread 'shard-8' (496633) panicked at 
core/server/src/streaming/topics/helpers.rs:298:21:
+RefCell already borrowed
+
+Turns out that `RefCell` isn't safe to use across an `.await` point, there is 
even clippy lint for that - `clippy::await_holding_refcell_ref`.
+
+The Rust `wg-async` (async working group) seems to be aware of that footgun 
and describes it in [this 
story](https://rust-lang.github.io/wg-async/vision/submitted_stories/status_quo/barbara_wants_to_use_ghostcell.html).
 It *feels* like it should be possible to express statically-checked borrowing 
for `Futures` using primitives such as `GhostCell`, they even share a 
[proof-of-concept runtime](https://crates.io/crates/stakker) that does exactly 
that, but achieving an ergonomic API indistin [...]
+
+We didn't give up (yet) on interior mutability, rather, we reasoned about the 
underlying problem and attempted to solve it with better a API.
+
+The issue is that during `.await` points, the executor can potentially yield 
the execution context to another `Future`, and that other `Future` may attempt 
to borrow the same `RefCell`, causing a panic at runtime since the borrow from 
the first `Future` is still active. We ran into this often because our data 
structures followed an OOP-style of **compile time hierarchy that matches the 
domain model**, which looked akin to that.
+
+```rs
+struct Stream {
+    id: usize,
+    name: String,
+    storage: Storage
+}
+
+impl Stream {
+    async fn save(&mut self) {
+        // Do smth with `name` field ...
+        self.storage.save().await; // <-- Non-Mutable borrow
+    }
+}
+
+// .....
+
+struct Server {
+    streams: RefCell<Vec<Streams>>,
+}
+
+impl Server {
+    async fn save_stream(id: usize) {
+        // Holding the `BorrowMut`
+        let streams = self.streams.borrow_mut();
+        let stream = streams.iter_mut().find(|s| s.id == id).unwrap();
+        // Await oopsie.
+        stream.save().await;
+    }
+}
+```
+
+The `save` procedure can be split into two parts
+ - The mutation of the in-memory state
+ - The I/O operation using `storage`
+
+ This way our `RefCell` can be much more granular, we use it only for the 
in-memory representation of `Stream`, while the storage is stored out of 
bounds, but for that we needed a bigger gun, let us introduce **ECS** (Entity 
Component System).
+
+One might be familiar with `ECS` from game engines, not from message streaming 
platforms, personally I think the general idea behind ECS - **SOA** (Struct of 
arrays) is fairly underrated in general. 
+What we did is split the `Entities` (Streams, Topics, Partitions, etc.) into 
their components, where each component is stored in its own dedicated 
collection.
+
+![Entity Component System in-memory 
representation](/thread-per-core-io_uring/ecs.png)
+
+In this case our components are `State` and `Storage`. This allows us to write:
+
+```rs
+struct Streams {
+    states: Vec<RefCell<State>>
+    storages: Vec<Storage>,
+}
+
+impl Streams {
+    async fn save_stream(id: usize) {
+        self.with_component_by_id_mut(id, |mut states| {
+            // Update the in-memory representation.
+        });
+
+        // AsyncFn closures just got stabilized while we were working on this 
+        // what a coincidence :)
+        self.with_component_by_id_async(id, async |storage| {
+            storage.save().await;
+        }).await;
+    }
+}
+```
+
+We accompany the `Streams` ECS with component closures that statically 
disallow `async` code inside a mutable borrow and voilà.
+
+Well, this approach crumbles just as miserably as the *naive* attempt...
+
+The thread-per-core shared-nothing architecture requires broadcasting events 
whenever state changes on one shard. For example, if `shard-0` receives a 
`CreateStream` request, once it finishes processing, it broadcasts a 
`CreatedStream` event through a channel to all other shards. On the receiving 
end, each shard has a background task that polls this channel for incoming 
events. The crux of the issue lies in the word **background**.
+
+![Sequence diagram demonstrating non deterministic event 
handling](/thread-per-core-io_uring/diagram_szpont.png)
+
+In our `Streams` example, it might not look like a big deal, but in reality 
our other `Entities` were much more complicated, without even introducing other 
background workers that weren't necessary as part of the thread-per-core shared 
nothing architecture. An solution to this problem could be using `async` lock, 
but those can be [footguns aswell](https://rfd.shared.oxide.computer/rfd/0400).
+
+To our surprise, the issue persisted even in scenarios where we enforced a 
single-writer principle (we dedicated one shard to become the serialization 
point for all requests), which was the final nail in the coffin that led us to 
conclude the experiment as failure. Maintaining a non-shared but consistent 
state is much more difficult, than *just use message passing bro*.
+
+## Thread per core shared *something* architecture™
+
+> ### All roads lead to a _Mutex_ — but much more sophisticated one.
+
+After a long fight with `interior mutability`, we gave up on trying to make 
fetch happen. Instead, we doubled down on the artifact from the previous 
iteration (the single-writer principle). We divided our `resources` into two 
groups: shared, strongly consistent resources and sharded, eventually 
consistent ones. An example of a sharded resource is `Partition`, while 
`Streams` and `Topics` remain shared and strongly consistent, this split later 
on coined name (Control Plane/Data Plane).
+
+For shared resources, we decided to use 
[`left-right`](https://github.com/jonhoo/left-right), a concurrent data 
structure designed for a single writer and multiple readers. It works by 
maintaining two pointers to the underlying data: one for readers and one for 
the writer. During a writer commit, those pointers are swapped atomically 
(greatly simplifying). The single writer is the first shard - `shard0`, while 
remaining shards have an `read` handle to the data. In case if an shard other  
[...]
+
+As for our partitions, we maintain one shared table (DashMap) called 
`shards_table` that functions as barrier to fence requests that would try to 
access `Partition` that is in the process of creation/deletion, the requests 
are still routed to appropriate shard that contains the `Partition`, but by 
consulting the `shards_table` (during the routing and after the routing), we 
make sure that the eventual consistency does not come to bite us. 
+
+## More caveats
+
+This design turned out to be a can of worms, or a bottomless pit, if you 
prefer. There are plenty more questions to answer, for example, load balancing. 
In the `tokio` case, this was fairly simple because it was handled by the 
task-stealing executor. In our case, if access patterns are unpredictable and 
some shards become hotspots, we have to deal with that ourselves, a true 
double-edged sword. A theoretical optimization that we may employ in the future 
is to shard certain partitions acr [...]
+
+> #### One can imagine others ways to architect a share-nothing system that 
may mitigate these forms of imbalance (such as caching hot keys on additional 
partitions). 
+
+We can exploit the fact that our `Partition` uses **segmented log**, thus the 
partition can be sharded even harder based on the segment range and knowledge 
of which segments are sealed.
+
+Getting the performance benefits out of `io_uring` itself is a challenge on 
it's own (it's not enough to just swap `tokio` with an `io_uring` based 
runtime), in order to fully take advantage of the benefits from the `io_uring` 
design one have to heavily batch syscalls, as this is the main advantage of 
such interface (less context switches, from userspace to kernel space), Rust 
`Futures` can be composed together pretty well to facilitate that, but you have 
to be careful!
+
+The following code snippet, submits two I/O operations in one "batch", but 
`io_uring` does not guarantee that the submission order = completion order! 
+
+This "chain" can potentially execute out of order and if your server would 
crash halfway through, your block device state is broken.
+```rs
+let file = compio::fs::open("foo.bar").await.unwrap();
+
+let content = "some".to_vec();
+let content1 = "bytes".to_vec();
+
+// Now batch together those two writes 
+let write1 = file.write_all(content)
+let write2 = file.write_all(content1);
+
+join!(write1, write2).await;
+```
+
+To submit a batch while preserving operation order, one must use the io_uring 
chaining flag `IOSQE_IO_LINK` on the submitted SQEs, which brings us to the 
next point.
+
+## The state of Rust async runtimes ecosystem
+
+The problem is twofold: at the time of writing this blog post, there is no 
Rust equivalent of the `Seastar` framework. That is unfortunate, because 
`glommio` attempted to be one, but things changed: Glauber moved on to work on 
`Turso`, and the Datadog team does not seem to be actively maintaining the 
runtime while building [a real-time time-series storage engine in Rust for 
performance at 
scale](https://www.datadoghq.com/blog/engineering/rust-timeseries-engine/). 
They mention `sharding`  [...]
+
+Secundo problemo is that these runtimes imitate the `std` library APIs, which 
is `POSIX` compliant, while many of `io_uring`'s most powerful features are 
not, leaving those capabilities out of reach for us mere mortals. Request 
chaining is only the tip of the iceberg, there is plenty more, for example 
`oneshot` APIs for listen/recv, `registered buffers`, and so on. Ultimately, 
`File`, `TcpListener`, and `TcpStream` are not the right abstractions. From the 
point of view of `POSIX` complia [...]
+
+It seems like we are not the only ones that are aware of that problem: 
[Microsoft recently announced a thread-per-core async 
runtime](https://www.reddit.com/r/rust/comments/1p1flpx/kimojio_a_threadpercore_linux_io_uring_async/),
 that uses `Operation` as the unit of abstraction, this is a much better idea.
+
+It's worth noting that one of the key reasons we ended up going with `compio` 
is that they want to move with the times and expose more and more `io_uring` 
APIs. Their codebase is structured so that the driver is decoupled from the 
executor, I would push the pluggability even further. A very hot topic in 
distributed systems these days is `DST` (Deterministic Simulation Testing): the 
idea is to replace all non-deterministic sources in your system (network, block 
devices, time, etc.) with d [...]
+
+## Benchmarking
+
+// TODO: run benchmarks on AWS
+
+## Closing words
+Finally, even though we went into significant detail in this blog post, we 
have only scratched the surface of what is possible, and several subsections 
could easily be blog posts on their own. If you are interested in learning more 
about thread-per-core shared-nothing design, check out the `Seastar` framework, 
it is the SOTA in this space. For now we shift our attention to the [on-going 
work on clustering](https://github.com/apache/iggy/releases/tag/server-0.7.0), 
using [Viewstamped Repl [...]
+
+Stay tuned a deep-dive blog post on that is coming, and we’re just getting 
started 🚀
+
+
+
diff --git a/public/thread-per-core-io_uring/diagram_szpont.png 
b/public/thread-per-core-io_uring/diagram_szpont.png
new file mode 100644
index 00000000..a6727e1a
Binary files /dev/null and b/public/thread-per-core-io_uring/diagram_szpont.png 
differ
diff --git a/public/thread-per-core-io_uring/ecs.png 
b/public/thread-per-core-io_uring/ecs.png
new file mode 100644
index 00000000..cd5e7c97
Binary files /dev/null and b/public/thread-per-core-io_uring/ecs.png differ
diff --git a/public/thread-per-core-io_uring/sharding.png 
b/public/thread-per-core-io_uring/sharding.png
new file mode 100644
index 00000000..98db8066
Binary files /dev/null and b/public/thread-per-core-io_uring/sharding.png differ
diff --git a/source.config.ts b/source.config.ts
index e69238b3..48035697 100644
--- a/source.config.ts
+++ b/source.config.ts
@@ -21,6 +21,7 @@ export const blogPosts = defineCollections({
       .or(z.date())
       .transform((value) => new Date(value)),
     tags: z.array(z.string()).optional().default([]),
+    draft: z.boolean().optional().default(false),
   }),
 });
 
diff --git a/src/app/(site)/blogs/[...slug]/page.tsx 
b/src/app/(site)/blogs/[...slug]/page.tsx
index 7163256f..d3d13e43 100644
--- a/src/app/(site)/blogs/[...slug]/page.tsx
+++ b/src/app/(site)/blogs/[...slug]/page.tsx
@@ -18,6 +18,7 @@ function findPostByDateSlug(slugParts: string[]) {
   const [year, month, day, postSlug] = slugParts;
   return (
     blogPosts.find((p) => {
+      if (p.draft) return false;
       const date = new Date(p.date);
       return (
         date.getFullYear() === Number(year) &&
@@ -79,7 +80,7 @@ export default async function BlogPost(props: {
 }
 
 export function generateStaticParams() {
-  return blogPosts.map((page) => {
+  return blogPosts.filter((page) => !page.draft).map((page) => {
     const date = new Date(page.date);
     const year = String(date.getFullYear());
     const month = String(date.getMonth() + 1).padStart(2, "0");
diff --git a/src/app/(site)/blogs/page.tsx b/src/app/(site)/blogs/page.tsx
index 3b542067..0e86ecab 100644
--- a/src/app/(site)/blogs/page.tsx
+++ b/src/app/(site)/blogs/page.tsx
@@ -17,9 +17,11 @@ function getSlug(filePath: string): string {
 }
 
 export default function BlogIndex() {
-  const posts = [...blogPosts].sort(
+  const posts = [...blogPosts]
+    .filter((post) => !post.draft)
+    .sort(
     (a, b) => new Date(b.date).getTime() - new Date(a.date).getTime(),
-  );
+    );
 
   return (
     <main className="min-h-screen px-6 py-20 md:px-12">

Reply via email to