On Wed, 28 Jun 2023 at 22:44, Ilya Dryomov <[email protected]> wrote: >> ** TL;DR >> >> In testing, the write latency performance of a PWL-cache backed RBD >> disk was 2 orders of magnitude worse than the disk holding the PWL >> cache. >> >> ** Summary >> >> I was hoping that PWL cache might be a good solution to the problem of >> write latency requirements of etcd when running a kubernetes control >> plane on ceph. Etcd is extremely write latency sensitive and becomes >> unstable if write latency is too high. The etcd workload can be >> characterised by very small (~4k) writes with a queue depth of 1. >> Throughput, even on a busy system, is normally very low. As etcd is >> distributed and can safely handle the loss of un-flushed data from a >> single node, a local ssd PWL cache for etcd looked like an ideal >> solution. > > > Right, this is exactly the use case that the PWL cache is supposed to address.
Good to know! >> My expectation was that adding a PWL cache on a local SSD to an >> RBD-backed would improve write latency to something approaching the >> write latency performance of the local SSD. However, in my testing >> adding a PWL cache to an rbd-backed VM increased write latency by >> approximately 4x over not using a PWL cache. This was over 100x more >> than the write latency performance of the underlying SSD. >> >> My expectation was based on the documentation here: >> https://docs.ceph.com/en/quincy/rbd/rbd-persistent-write-log-cache/ >> >> “The cache provides two different persistence modes. In >> persistent-on-write mode, the writes are completed only when they are >> persisted to the cache device and will be readable after a crash. In >> persistent-on-flush mode, the writes are completed as soon as it no >> longer needs the caller’s data buffer to complete the writes, but does >> not guarantee that writes will be readable after a crash. The data is >> persisted to the cache device when a flush request is received.” >> >> ** Method >> >> 2 systems, 1 running single-node Ceph Quincy (17.2.6), the other >> running libvirt and mounting a VM’s disk with librbd (also 17.2.6) >> from the first node. >> >> All performance testing is from the libvirt system. I tested write >> latency performance: >> >> * Inside the VM without a PWL cache >> * Of the PWL device directly from the host (direct to filesystem, no VM) >> * Inside the VM with a PWL cache >> >> I am testing with fio. Specifically I am running a containerised test, >> executed with: >> podman run --volume .:/var/lib/etcd:Z quay.io/openshift-scale/etcd-perf >> >> This container runs: >> fio --rw=write --ioengine=sync --fdatasync=1 >> --directory=/var/lib/etcd --size=100m --bs=8000 --name=etcd_perf >> --output-format=json --runtime=60 --time_based=1 >> >> And extracts sync.lat_ns.percentile["99.000000"] > > > Matthew, do you have the rest of the fio output captured? It would be > interesting to see if it's just the 99th percentile that is bad or the PWL > cache is worse in general. Sure. With PWL cache: https://paste.openstack.org/show/820504/ Without PWL cache: https://paste.openstack.org/show/b35e71zAwtYR2hjmSRtR/ With PWL cache, 'rbd_cache'=false: https://paste.openstack.org/show/byp8ZITPzb3r9bb06cPf/ >> ** Results >> >> All results were stable across multiple runs within a small margin of error. >> >> * rbd no cache: 1417216 ns >> * pwl cache device: 44288 ns >> * rbd with pwl cache: 5210112 ns >> >> Note that by adding a PWL cache we increase write latency by >> approximately 4x, which is more than 100x than the underlying device. >> >> ** Hardware >> >> 2 x Dell R640s, each with Xeon Silver 4216 CPU @ 2.10GHz and 192G RAM >> Storage under test: 2 x SAMSUNG MZ7KH480HAHQ0D3 SSDs attached to PERC >> H730P Mini (Embedded) >> >> OS installed on rotational disks >> >> N.B. Linux incorrectly detects these disks as rotational, which I >> assume relates to weird behaviour by the PERC controller. I remembered >> to manually correct this on the ‘client’ machine for the PWL cache, >> but at OSD configuration time ceph would have detected them as >> rotational. They are not rotational. >> >> ** Ceph Configuration >> >> CentOS Stream 9 >> >> # ceph version >> ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy >> (stable) >> >> Single node installation with cephadm. 2 OSDs, one on each SSD. >> 1 pool with size 2 >> >> ** Client Configuration >> >> Fedora 38 >> Librbd1-17.2.6-3.fc38.x86_64 >> >> PWL cache is XFS filesystem with 4k block size, matching the >> underlying device. The filesystem uses the whole block device. There >> is no other load on the system. >> >> ** RBD Configuration >> >> # rbd config image list libvirt-pool/pwl-test | grep cache >> rbd_cache true > > > I wonder if rbd_cache should have been set to false here to disable the > default volatile cache. Other than that, I don't see anything obviously > wrong with the configuration at first sight. I added some full output for this above. > > -- > Ilya > >> config >> rbd_cache_block_writes_upfront false >> config >> rbd_cache_max_dirty 25165824 >> config >> rbd_cache_max_dirty_age 1.000000 >> config >> rbd_cache_max_dirty_object 0 >> config >> rbd_cache_policy writeback >> pool >> rbd_cache_size 33554432 >> config >> rbd_cache_target_dirty 16777216 >> config >> rbd_cache_writethrough_until_flush true >> pool >> rbd_parent_cache_enabled false >> config >> rbd_persistent_cache_mode ssd >> pool >> rbd_persistent_cache_path /var/lib/libvirt/images/pwl >> pool >> rbd_persistent_cache_size 1073741824 >> config >> rbd_plugins pwl_cache >> pool >> >> # rbd status libvirt-pool/pwl-test >> Watchers: >> watcher=10.1.240.27:0/1406459716 client.14475 >> cookie=140282423200720 >> Persistent cache state: >> host: dell-r640-050 >> path: >> /var/lib/libvirt/images/pwl/rbd-pwl.libvirt-pool.37e947fd216b.pool >> size: 1 GiB >> mode: ssd >> stats_timestamp: Mon Jun 26 11:29:21 2023 >> present: true empty: false clean: true >> allocated: 180 MiB >> cached: 135 MiB >> dirty: 0 B >> free: 844 MiB >> hits_full: 1 / 0% >> hits_partial: 3 / 0% >> misses: 21952 >> hit_bytes: 6 KiB / 0% >> miss_bytes: 349 MiB -- Matthew Booth _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
