Re: [ceph-users] Very HIGH Disk I/O latency on instances

Peter Maloney Wed, 28 Jun 2017 08:17:37 -0700

On 06/28/17 16:52, [email protected] wrote:
>
> We were using HP Helion 2.1.5 ( OpenStack + Ceph )
>
> The OpenStack version is *Kilo* and Ceph version is *firefly*
>
>  
>
> The way we backup VMs is create a snapshot by Ceph commands (rbd
> snapshot) then download (rbd export) it.
>
>  
>
> We found a very high Disk Read / Write latency during creating /
> deleting snapshots, it will higher than 10000 ms.
>
>  
>
> Even not during backup jobs, we often see a more than 4000 ms latency
> occurred.
>
>  
>
> Users start to complain.
>
> Could you please help us to how to start the troubleshooting?
>
>  
>
For creating snaps and keeping them, this was marked wontfix
http://tracker.ceph.com/issues/10823


For deleting, see the recent "Snapshot removed, cluster thrashed" thread
for some config to try.

And I find this to be a very severe problem. And you haven't even seen
the worst... also make more and it gets slower and slower to do many
things (resize, clone, snap revert, etc.) (but a fully flattened image
seen by a client seems as fast as normal usually).

Let's pool some money together as a reward for making snapshots work
properly/modern, like on ZFS and btrfs where they don't have to copy so
much....they "redirect on write" rather than literally "copy on write".
(what would be a good way to pool money like that?). If others are
interested, I surely am, but would have to ask the boss about money.
Even if it's only for bluestore, so only for future releases, that's ok
with me. And if it keeps the copy on the same osd/fs as the original,
that is acceptable too.


https://storageswiss.com/2016/04/01/snapshot-101-copy-on-write-vs-redirect-on-write/
> Consider a *copy-on-write* system, which /copies/ any blocks before
> they are overwritten with new information (i.e. it copies on writes).
> In other words, if a block in a protected entity is to be modified,
> the system will copy that block to a separate snapshot area before it
> is overwritten with the new information. This approach requires three
> I/O operations for each write: one read and two writes. [...] This
> decision process for each block also comes with some computational
> overhead.

> A *redirect-on-write* system uses pointers to represent all protected
> entities. If a block needs modification, the storage system merely
> /redirects/ the pointer for that block to another block and writes the
> data there. [...] There is zero computational overhead of reading a
> snapshot in a redirect-on-write system.

> The redirect-on-write system uses 1/3 the number of I/O operations
> when modifying a protected block, and it uses no extra computational
> overhead reading a snapshot. Copy-on-write systems can therefore have
> a big impact on the performance of the protected entity. The more
> snapshots are created and the longer they are stored, the greater the
> impact to performance on the protected entity.

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Very HIGH Disk I/O latency on instances

Reply via email to