One of the biggest changes that flash is making in the storage world is that 
the way basic trade-offs in storage management software architecture are being 
affected. In the HDD world CPU time per IOP was relatively inconsequential, 
i.e., it had little effect on overall performance which was limited by the 
physics of the hard drive. Flash is now inverting that situation. When you look 
at the performance levels being delivered in the latest generation of NVMe SSDs 
you rapidly see that that storage itself is generally no longer the bottleneck 
(speaking about BW, not latency of course) but rather it's the system sitting 
in front of the storage that is the bottleneck. Generally it's the CPU cost of 
an IOP.

When Sandisk first starting working with Ceph (Dumpling) the design of librados 
and the OSD lead to the situation that the CPU cost of an IOP was dominated by 
context switches and network socket handling. Over time, much of that has been 
addressed. The socket handling code has been re-written (more than once!) some 
of the internal queueing in the OSD (and the associated context switches) have 
been eliminated. As the CPU costs have dropped, performance on flash has 
improved accordingly.

Because we didn't want to completely re-write the OSD (time-to-market and 
stability drove that decision), we didn't move it from the current "thread per 
IOP" model into a truly asynchronous "thread per CPU core" model that 
essentially eliminates context switches in the IO path. But a fully optimized 
OSD would go down that path (at least part-way). I believe it's been proposed 
in the past. Perhaps a hybrid "fast-path" style could get most of the benefits 
while preserving much of the legacy code.

I believe this trend toward thread-per-core software development will also tend 
to support the "do it in user-space" trend. That's because most of the kernel 
and file-system interface is architected around the blocking "thread-per-IOP" 
model and is unlikely to change in the future.


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
[email protected]

-----Original Message-----
From: Martin Millnert [mailto:[email protected]]
Sent: Thursday, October 22, 2015 6:20 AM
To: Mark Nelson <[email protected]>
Cc: Ric Wheeler <[email protected]>; Allen Samuels 
<[email protected]>; Sage Weil <[email protected]>; 
[email protected]
Subject: Re: newstore direction

Adding 2c

On Wed, 2015-10-21 at 14:37 -0500, Mark Nelson wrote:
> My thought is that there is some inflection point where the userland
> kvstore/block approach is going to be less work, for everyone I think,
> than trying to quickly discover, understand, fix, and push upstream
> patches that sometimes only really benefit us.  I don't know if we've
> truly hit that that point, but it's tough for me to find flaws with
> Sage's argument.

Regarding the userland / kernel land aspect of the topic, there are further 
aspects AFAIK not yet addressed in the thread:
In the networking world, there's been development on memory mapped (multiple 
approaches exist) userland networking, which for packet management has the 
benefit of - for very, very specific applications of networking code - avoiding 
e.g. per-packet context switches etc, and streamlining processor cache 
management performance. People have gone as far as removing CPU cores from CPU 
scheduler to completely dedicate them to the networking task at hand (cache 
optimizations). There are various latency/throughput (bulking) optimizations 
applicable, but at the end of the day, it's about keeping the CPU bus busy with 
"revenue" bus traffic.

Granted, storage IO operations may be much heavier in cycle counts for context 
switches to ever appear as a problem in themselves, certainly for slower SSDs 
and HDDs. However, when going for truly high performance IO, *every* hurdle in 
the data path counts toward the total latency.
(And really, high performance random IO characteristics approaches the 
networking, per-packet handling characteristics).  Now, I'm not really 
suggesting memory-mapping a storage device to user space, not at all, but 
having better control over the data path for a very specific use case, reduces 
dependency on the code that works as best as possible for the general case, and 
allows for very purpose-built code, to address a narrow set of requirements. 
("Ceph storage cluster backend" isn't a typical FS use case.) It also decouples 
dependencies on users i.e.
waiting for the next distro release before being able to take up the benefits 
of improvements to the storage code.

A random google came up with related data on where "doing something way 
different" /can/ have significant benefits:
http://phunq.net/pipermail/tux3/2015-April/002147.html

I (FWIW) certainly agree there is merit to the idea.
The scientific approach here could perhaps be to simply enumerate all corner 
cases of "generic FS" that actually are cause for the experienced issues, and 
assess probability of them being solved (and if so when).
That *could* improve chances of approaching consensus which wouldn't hurt I 
suppose?

BR,
Martin


________________________________

PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

Reply via email to