[ceph-users] Re: RBD with PWL cache shows poor performance compared to cache device

Ilya Dryomov Mon, 03 Jul 2023 10:35:50 -0700

On Mon, Jul 3, 2023 at 6:58 PM Mark Nelson <[email protected]> wrote:
>
>
> On 7/3/23 04:53, Matthew Booth wrote:
> > On Thu, 29 Jun 2023 at 14:11, Mark Nelson <[email protected]> wrote:
> >>>>> This container runs:
> >>>>>      fio --rw=write --ioengine=sync --fdatasync=1
> >>>>> --directory=/var/lib/etcd --size=100m --bs=8000 --name=etcd_perf
> >>>>> --output-format=json --runtime=60 --time_based=1
> >>>>>
> >>>>> And extracts sync.lat_ns.percentile["99.000000"]
> >>>> Matthew, do you have the rest of the fio output captured?  It would be 
> >>>> interesting to see if it's just the 99th percentile that is bad or the 
> >>>> PWL cache is worse in general.
> >>> Sure.
> >>>
> >>> With PWL cache: https://paste.openstack.org/show/820504/
> >>> Without PWL cache: https://paste.openstack.org/show/b35e71zAwtYR2hjmSRtR/
> >>> With PWL cache, 'rbd_cache'=false:
> >>> https://paste.openstack.org/show/byp8ZITPzb3r9bb06cPf/
> >>
> >> Also, how's the CPU usage client side?  I would be very curious to see
> >> if unwindpmp shows anything useful (especially lock contention):
> >>
> >>
> >> https://github.com/markhpc/uwpmp
> >>
> >>
> >> Just attach it to the client-side process and start out with something
> >> like 100 samples (more are better but take longer).  You can run it like:
> >>
> >>
> >> ./unwindpmp -n 100 -p <pid>
> > I've included the output in this gist:
> > https://gist.github.com/mdbooth/2d68b7e081a37e27b78fe396d771427d
> >
> > That gist contains 4 runs: 2 with PWL enabled and 2 without, and also
> > a markdown file explaining the collection method.
> >
> > Matt
>
>
> Thanks Matt!  I looked through the output.  Looks like the symbols might
> have gotten mangled.  I'm not an expert on the RBD client, but I don't
> think we would really be calling into
> rbd_group_snap_rollback_with_progress from
> librbd::cache::pwl::ssd::WriteLogEntry::writeback_bl.  Was it possible
> you used the libdw backend for unwindpmp?  libdw sometimes gives
> strange/mangled callgraphs, but I haven't seen it before with
> libunwind.  Hopefully Congmin Yin or Ilya can confirm if it's garbage.


>
> So with that said, assuming we can trust these callgraphs at all, it
> looks like it might be worth looking at the latency of the
> AbstractWriteLog, librbd::cache::pwl::ssd::WriteLogEntry::writeback_bl,
> and possibly usage of librados::v14_2_0::IoCtx::object_list.  On the

Hi Mark,

Both rbd_group_snap_rollback_with_progress and
librados::v14_2_0::IoCtx::object_list entries don't make sense to me,
so I'd say it's garbage.

Thanks,

                Ilya
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: RBD with PWL cache shows poor performance compared to cache device

Reply via email to