Re: [ceph-users] Very HIGH Disk I/O latency on instances

Gregory Farnum Thu, 29 Jun 2017 14:43:33 -0700

On Thu, Jun 29, 2017 at 12:16 AM Peter Maloney <
[email protected]> wrote:


> On 06/28/17 21:57, Gregory Farnum wrote:
>
> On Wed, Jun 28, 2017 at 9:17 AM Peter Maloney <
> [email protected]> wrote:
>
> On 06/28/17 16:52, [email protected] wrote:
>>
> [...]backup VMs is create a snapshot by Ceph commands (rbd snapshot) then
>> download (rbd export) it.
>>
>>
>>
>> We found a very high Disk Read / Write latency during creating / deleting
>> snapshots, it will higher than 10000 ms.
>>
>>
>>
>> Even not during backup jobs, we often see a more than 4000 ms latency
>> occurred.
>>
>>
>>
>> Users start to complain.
>>
>> Could you please help us to how to start the troubleshooting?
>>
>>
>>
>> For creating snaps and keeping them, this was marked wontfix
>> http://tracker.ceph.com/issues/10823
>>
>> For deleting, see the recent "Snapshot removed, cluster thrashed" thread
>> for some config to try.
>>
>
> Given he says he's seeing 4 second IOs even without snapshot involvement,
> I think Keynes must be seeing something else in his cluster.
>
>
> If you have few enough OSDs and slow enough journals that seem ok without
> snaps, with snaps can be much worse than 4s IOs if you have any sync heavy
> clients, like ganglia.
>
> Before I figured out that it was exclusive-lock causing VMs to hang, I
> tested many things and spent months on it and found that out. Also some
> people in freenode irc ##proxmox channel with cheap home setups with ceph
> complain about such things often.
>
>
>
>
>>
>> https://storageswiss.com/2016/04/01/snapshot-101-copy-on-write-vs-redirect-on-write/
>>
>> Consider a *copy-on-write* system, which *copies* any blocks before they
>> are overwritten with new information (i.e. it copies on writes). In other
>> words, if a block in a protected entity is to be modified, the system will
>> copy that block to a separate snapshot area before it is overwritten with
>> the new information. This approach requires three I/O operations for each
>> write: one read and two writes. [...] This decision process for each block
>> also comes with some computational overhead.
>>
>>
>> A *redirect-on-write* system uses pointers to represent all protected
>> entities. If a block needs modification, the storage system merely
>> *redirects* the pointer for that block to another block and writes the
>> data there. [...] There is zero computational overhead of reading a
>> snapshot in a redirect-on-write system.
>>
>>
>> The redirect-on-write system uses 1/3 the number of I/O operations when
>> modifying a protected block, and it uses no extra computational overhead
>> reading a snapshot. Copy-on-write systems can therefore have a big impact
>> on the performance of the protected entity. The more snapshots are created
>> and the longer they are stored, the greater the impact to performance on
>> the protected entity.
>>
>>
> I wouldn't consider that a very realistic depiction of the tradeoffs
> involved in different snapshotting strategies[1], but BlueStore uses
> "redirect-on-write" under the formulation presented in those quotes. RBD
> clones of protected images will remain copy-on-write forever, I imagine.
> -Greg
>
> It was simply the first link I found which I could quote, but I didn't
> find it too bad... just it describes it like all implementations are the
> same.
>
>
> [1]: There's no reason to expect a copy-on-write system will first copy
> the original data and then overwrite it with the new data when it can
> simply inject the new data along the way. *Some* systems will copy the
> "old" block into a new location and then overwrite in the existing location
> (it helps prevent fragmentation), but many don't. And a "redirect-on-write"
> system needs to persist all those block metadata pointers, which may be
> much cheaper or much, much more expensive than just duplicating the blocks.
>
>
> After a snap is unprotected, will the clones be redirect-on-write? Or
> after the image is flattened (like dd if=/dev/zero to the whole disk)?
>
> Are there other cases where you get a copy-on-write behavior?
>
> Glad to hear bluestore has something better. Is that avaliable and default
> behavior on kraken (which I tested but where it didn't seem to be fixed,
> although all storage backends were less block prone on kraken)?
>
> If it was a true redirect-on-write system, I would expect that when you
> make a snap, there is just the overhead of organizing some metadata, and
> then after that, any writes just write as normal, to a new place, not
> requiring the old data to be copied, ideally not any of it, even partially
> written objects. And I don't think I saw that behavior on my kraken tests,
> although the performance was better (due to no blocked requests, but the
> iops at peak was basically the same; and I didn't measure total IO or
> something that would be more reliable...just looked at performance effects
> and blocking).
>
>
Bluestore was available for dev/testing in Kraken, but not the default. I
think it's going to be the default in Luminous, and yes, it's "just
metadata" with new block locations for updates.

Anything involving RBD clones is fundamentally different from "normal"
snapshots, though — when you clone an RBD volume, you are writing data to a
completely new location so the object has to be copied when you modify that
object. (The only alternative would be to keep a per-block bitmap — ie, to
keep in memory a data structure roughly 1/1000 the size of your volume for
every layer of cloning you have, to indicate if it's in the new overwrite
location or in the parent image.)

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Very HIGH Disk I/O latency on instances

Reply via email to