Re: Fwd: [newstore (again)] how disable double write WAL

Sage Weil Tue, 24 Nov 2015 12:43:23 -0800

On Tue, 24 Nov 2015, Sébastien VALSEMEY wrote:
> Hello Vish,
> 
> Please apologize for the delay in my answer.
> Following the conversation you had with my colleague David, here are 
> some more details about our work :
> 
> We are working on Filestore / Newstore optimizations by studying how we 
> could set ourselves free from using the journal.
> 
> It is very important to work with SSD, but it is also mandatory to 
> combine it with regular magnetic platter disks. This is why we are 
> combining metadata storing on flash with data storing on disk.


This is pretty common, and something we will support natively with 
newstore.
 
> Our main goal is to have the control on performance. Which is quite 
> difficult with NewStore, and needs fundamental hacks with FileStore.

Can you clarify what you mean by "quite difficult with NewStore"?

FWIW, the latest bleeding edge code is currently at 
github.com/liewegas/wip-bluestore.

sage


> Is Samsung working on ARM boards with embedded flash and a SATA port, in 
> order to allow us to work on a hybrid approach? What is your line of 
> work with Ceph?
> 
> How can we work together ?
> 
> Regards,
> Sébastien
> 
> > Début du message réexpédié :
> > 
> > De: David Casier <[email protected]>
> > Date: 12 octobre 2015 20:52:26 UTC+2
> > À: Sage Weil <[email protected]>, Ceph Development 
> > <[email protected]>
> > Cc: Sébastien VALSEMEY <[email protected]>, 
> > [email protected], Denis Saget <[email protected]>, "luc.petetin" 
> > <[email protected]>
> > Objet: Rép : Fwd: [newstore (again)] how disable double write WAL
> > 
> > Ok,
> > Great.
> > 
> > With these  settings :
> > //
> > newstore_max_dir_size = 4096
> > newstore_sync_io = true
> > newstore_sync_transaction = true
> > newstore_sync_submit_transaction = true
> > newstore_sync_wal_apply = true
> > newstore_overlay_max = 0
> > //
> > 
> > And direct IO in the benchmark tool (fio)
> > 
> > I see that the HDD is 100% charged and there are notransfer of /db to 
> > /fragments after stopping benchmark : Great !
> > 
> > But when i launch a bench with random blocs of 256k, i see random blocs 
> > between 32k and 256k on HDD. Any idea ?
> > 
> > Debits to the HDD are about 8MBps when they could be higher with larger 
> > blocs (~30MBps)
> > And 70 MBps without fsync (hard drive cache disabled).
> > 
> > Other questions :
> > newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread 
> > fsync_wq) ?
> > newstore_sync_transaction -> true = sync in DB ?
> > newstore_sync_submit_transaction -> if false then kv_queue (only if 
> > newstore_sync_transaction=false) ?
> > newstore_sync_wal_apply = true -> if false then WAL later (thread wal_wq) ?
> > 
> > Is it true ?
> > 
> > Way for cache with battery (sync DB and no sync data) ?
> > 
> > Thanks for everything !
> > 
> > On 10/12/2015 03:01 PM, Sage Weil wrote:
> >> On Mon, 12 Oct 2015, David Casier wrote:
> >>> Hello everybody,
> >>> fragment is stored in rocksdb before being written to "/fragments" ?
> >>> I separed "/db" and "/fragments" but during the bench, everything is 
> >>> writing
> >>> to "/db"
> >>> I changed options "newstore_sync_*" without success.
> >>> 
> >>> Is there any way to write all metadata in "/db" and all data in 
> >>> "/fragments" ?
> >> You can set newstore_overlay_max = 0 to avoid most data landing in db/.
> >> But if you are overwriting an existing object, doing write-ahead logging
> >> is usually unavoidable because we need to make the update atomic (and the
> >> underlying posix fs doesn't provide that).  The wip-newstore-frags branch
> >> mitigates this somewhat for larger writes by limiting fragment size, but
> >> for small IOs this is pretty much always going to be the case.  For small
> >> IOs, though, putting things in db/ is generally better since we can
> >> combine many small ios into a single (rocksdb) journal/wal write.  And
> >> often leave them there (via the 'overlay' behavior).
> >> 
> >> sage
> >> 
> > 
> > 
> > -- 
> > ________________________________________________________
> > 
> > Cordialement,
> > 
> > *David CASIER
> > DCConsulting SARL
> > 
> > 
> > 4 Trait d'Union
> > 77127 LIEUSAINT
> > 
> > **Ligne directe: _01 75 98 53 85_
> > Email: [email protected]_
> > * ________________________________________________________
> > Début du message réexpédié :
> > 
> > De: David Casier <[email protected]>
> > Date: 2 novembre 2015 20:02:37 UTC+1
> > À: "Vish (Vishwanath) Maram-SSI" <[email protected]>
> > Cc: benoit LORIOT <[email protected]>, Sébastien VALSEMEY 
> > <[email protected]>
> > Objet: Rép : Fwd: [newstore (again)] how disable double write WAL
> > 
> > Hi Vish,
> > In FileStore, data and metadata are stored in files, with xargs FS and omap.
> > NewStore works with RocksDB.
> > There are a lot of configuration in RocksDB but all options not implemented.
> > 
> > The best way, for me, is not to use the logs, with secure cache (for 
> > example SSD 845DC).
> > I don't think that is necessary to report I/O with a good metadata 
> > optimisation.
> > 
> > The problem with RocksDB is that not possible to control I/O blocs size.
> > 
> > We will resume work on NewStore soon. 
> > 
> > On 10/29/2015 05:30 PM, Vish (Vishwanath) Maram-SSI wrote:
> >> Thanks David for the reply.
> >>  
> >> Yeah We just wanted to know how different is it from Filestore and how do 
> >> we contribute for this? My motive is to first understand the design of 
> >> Newstore and get the Performance loopholes so that we can try looking into 
> >> it.
> >>  
> >> It would be helpful if you can share what is your idea from your side to 
> >> use Newstore and configuration? What plans you are having for 
> >> contributions to help us understand and see if we can work together.
> >>  
> >> Thanks,
> >> -Vish
> >>   <>
> >> From: David Casier [mailto:[email protected] 
> >> <mailto:[email protected]>] 
> >> Sent: Thursday, October 29, 2015 4:41 AM
> >> To: Vish (Vishwanath) Maram-SSI
> >> Cc: benoit LORIOT; Sébastien VALSEMEY
> >> Subject: Re: Fwd: [newstore (again)] how disable double write WAL
> >>  
> >> Hi Vish,
> >> It's OK.
> >> 
> >> We have a lot of different configuration with newstore tests.
> >> 
> >> What is your goal with ?
> >> 
> >> On 10/28/2015 11:02 PM, Vish (Vishwanath) Maram-SSI wrote:
> >> Hi David,
> >>  
> >> Sorry for sending you the mail directly.
> >>  
> >> This is Vishwanath Maram from Samsung and started to play around with 
> >> Newstore and observing some issues with running FIO.
> >>  
> >> Can you please share your Ceph Configuration file which you have used to 
> >> run the IO's using FIO?
> >>  
> >> Thanks,
> >> -Vish
> >>  
> >> -----Original Message-----
> >> From: [email protected] 
> >> <mailto:[email protected]> 
> >> [mailto:[email protected] 
> >> <mailto:[email protected]>] On Behalf Of David Casier
> >> Sent: Monday, October 12, 2015 11:52 AM
> >> To: Sage Weil; Ceph Development
> >> Cc: Sébastien VALSEMEY; [email protected] 
> >> <mailto:[email protected]>; Denis Saget; luc.petetin
> >> Subject: Re: Fwd: [newstore (again)] how disable double write WAL
> >>  
> >> Ok,
> >> Great.
> >>  
> >> With these  settings :
> >> //
> >> newstore_max_dir_size = 4096
> >> newstore_sync_io = true
> >> newstore_sync_transaction = true
> >> newstore_sync_submit_transaction = true
> >> newstore_sync_wal_apply = true
> >> newstore_overlay_max = 0
> >> //
> >>  
> >> And direct IO in the benchmark tool (fio)
> >>  
> >> I see that the HDD is 100% charged and there are notransfer of /db to 
> >> /fragments after stopping benchmark : Great !
> >>  
> >> But when i launch a bench with random blocs of 256k, i see random blocs 
> >> between 32k and 256k on HDD. Any idea ?
> >>  
> >> Debits to the HDD are about 8MBps when they could be higher with larger 
> >> blocs (~30MBps)
> >> And 70 MBps without fsync (hard drive cache disabled).
> >>  
> >> Other questions :
> >> newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread 
> >> fsync_wq) ?
> >> newstore_sync_transaction -> true = sync in DB ?
> >> newstore_sync_submit_transaction -> if false then kv_queue (only if 
> >> newstore_sync_transaction=false) ?
> >> newstore_sync_wal_apply = true -> if false then WAL later (thread wal_wq) ?
> >>  
> >> Is it true ?
> >>  
> >> Way for cache with battery (sync DB and no sync data) ?
> >>  
> >> Thanks for everything !
> >>  
> >> On 10/12/2015 03:01 PM, Sage Weil wrote:
> >> On Mon, 12 Oct 2015, David Casier wrote:
> >> Hello everybody,
> >> fragment is stored in rocksdb before being written to "/fragments" ?
> >> I separed "/db" and "/fragments" but during the bench, everything is 
> >> writing
> >> to "/db"
> >> I changed options "newstore_sync_*" without success.
> >>  
> >> Is there any way to write all metadata in "/db" and all data in 
> >> "/fragments" ?
> >> You can set newstore_overlay_max = 0 to avoid most data landing in db/.
> >> But if you are overwriting an existing object, doing write-ahead logging
> >> is usually unavoidable because we need to make the update atomic (and the
> >> underlying posix fs doesn't provide that).  The wip-newstore-frags branch
> >> mitigates this somewhat for larger writes by limiting fragment size, but
> >> for small IOs this is pretty much always going to be the case.  For small
> >> IOs, though, putting things in db/ is generally better since we can
> >> combine many small ios into a single (rocksdb) journal/wal write.  And
> >> often leave them there (via the 'overlay' behavior).
> >>  
> >> sage
> >>  
> >>  
> >>  
> >>  
> >> 
> >> -- 
> >> ________________________________________________________ 
> >> 
> >> Cordialement,
> >> 
> >> David CASIER
> >> 
> >> 
> >> 4 Trait d'Union
> >> 77127 LIEUSAINT
> >> 
> >> Ligne directe: 01 75 98 53 85
> >> Email: [email protected] <mailto:[email protected]>
> >> ________________________________________________________
> > 
> > 
> > -- 
> > ________________________________________________________ 
> > 
> > Cordialement,
> > 
> > David CASIER
> > 
> >  
> > 4 Trait d'Union
> > 77127 LIEUSAINT
> > 
> > Ligne directe: 01 75 98 53 85
> > Email: [email protected] <mailto:[email protected]>
> > ________________________________________________________
> > Début du message réexpédié :
> > 
> > De: Sage Weil <[email protected]>
> > Date: 12 octobre 2015 21:33:52 UTC+2
> > À: David Casier <[email protected]>
> > Cc: Ceph Development <[email protected]>, Sébastien VALSEMEY 
> > <[email protected]>, [email protected], Denis Saget 
> > <[email protected]>, "luc.petetin" <[email protected]>
> > Objet: Rép : Fwd: [newstore (again)] how disable double write WAL
> > 
> > Hi David-
> > 
> > On Mon, 12 Oct 2015, David Casier wrote:
> >> Ok,
> >> Great.
> >> 
> >> With these  settings :
> >> //
> >> newstore_max_dir_size = 4096
> >> newstore_sync_io = true
> >> newstore_sync_transaction = true
> >> newstore_sync_submit_transaction = true
> > 
> > Is this a hard disk?  Those settings probably don't make sense since it 
> > does every IO synchronously, blocking the submitting IO path...
> > 
> >> newstore_sync_wal_apply = true
> >> newstore_overlay_max = 0
> >> //
> >> 
> >> And direct IO in the benchmark tool (fio)
> >> 
> >> I see that the HDD is 100% charged and there are notransfer of /db to
> >> /fragments after stopping benchmark : Great !
> >> 
> >> But when i launch a bench with random blocs of 256k, i see random blocs
> >> between 32k and 256k on HDD. Any idea ?
> > 
> > Random IOs have to be write ahead logged in rocksdb, which has its own IO 
> > pattern.  Since you made everything sync above I think it'll depend on 
> > how many osd threads get batched together at a time.. maybe.  Those 
> > settings aren't something I've really tested, and probably only make 
> > sense with very fast NVMe devices.
> > 
> >> Debits to the HDD are about 8MBps when they could be higher with larger 
> >> blocs> (~30MBps)
> >> And 70 MBps without fsync (hard drive cache disabled).
> >> 
> >> Other questions :
> >> newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread
> >> fsync_wq) ?
> > 
> > yes
> > 
> >> newstore_sync_transaction -> true = sync in DB ?
> > 
> > synchronously do the rocksdb commit too
> > 
> >> newstore_sync_submit_transaction -> if false then kv_queue (only if
> >> newstore_sync_transaction=false) ?
> > 
> > yeah.. there is an annoying rocksdb behavior that makes an async 
> > transaction submit block if a sync one is in progress, so this queues them 
> > up and explicitly batches them.
> > 
> >> newstore_sync_wal_apply = true -> if false then WAL later (thread wal_wq) ?
> > 
> > the txn commit completion threads can do the wal work synchronously.. this 
> > is only a good idea if it's doing aio (which it generally is).
> > 
> >> Is it true ?
> >> 
> >> Way for cache with battery (sync DB and no sync data) ?
> > 
> > ?
> > s
> > 
> >> 
> >> Thanks for everything !
> >> 
> >> On 10/12/2015 03:01 PM, Sage Weil wrote:
> >>> On Mon, 12 Oct 2015, David Casier wrote:
> >>>> Hello everybody,
> >>>> fragment is stored in rocksdb before being written to "/fragments" ?
> >>>> I separed "/db" and "/fragments" but during the bench, everything is
> >>>> writing
> >>>> to "/db"
> >>>> I changed options "newstore_sync_*" without success.
> >>>> 
> >>>> Is there any way to write all metadata in "/db" and all data in
> >>>> "/fragments" ?
> >>> You can set newstore_overlay_max = 0 to avoid most data landing in db/.
> >>> But if you are overwriting an existing object, doing write-ahead logging
> >>> is usually unavoidable because we need to make the update atomic (and the
> >>> underlying posix fs doesn't provide that).  The wip-newstore-frags branch
> >>> mitigates this somewhat for larger writes by limiting fragment size, but
> >>> for small IOs this is pretty much always going to be the case.  For small
> >>> IOs, though, putting things in db/ is generally better since we can
> >>> combine many small ios into a single (rocksdb) journal/wal write.  And
> >>> often leave them there (via the 'overlay' behavior).
> >>> 
> >>> sage
> >>> 
> >> 
> >> 
> >> -- 
> >> ________________________________________________________
> >> 
> >> Cordialement,
> >> 
> >> *David CASIER
> >> DCConsulting SARL
> >> 
> >> 
> >> 4 Trait d'Union
> >> 77127 LIEUSAINT
> >> 
> >> **Ligne directe: _01 75 98 53 85_
> >> Email: [email protected]_
> >> * ________________________________________________________
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to [email protected]
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> 
> >> 
> > Début du message réexpédié :
> > 
> > De: David Casier <[email protected]>
> > Date: 29 octobre 2015 12:41:22 UTC+1
> > À: "Vish (Vishwanath) Maram-SSI" <[email protected]>
> > Cc: benoit LORIOT <[email protected]>, Sébastien VALSEMEY 
> > <[email protected]>
> > Objet: Rép : Fwd: [newstore (again)] how disable double write WAL
> > 
> > Hi Vish,
> > It's OK.
> > 
> > We have a lot of different configuration with newstore tests.
> > 
> > What is your goal with ?
> > 
> > On 10/28/2015 11:02 PM, Vish (Vishwanath) Maram-SSI wrote:
> >> Hi David,
> >> 
> >> Sorry for sending you the mail directly.
> >> 
> >> This is Vishwanath Maram from Samsung and started to play around with 
> >> Newstore and observing some issues with running FIO.
> >> 
> >> Can you please share your Ceph Configuration file which you have used to 
> >> run the IO's using FIO?
> >> 
> >> Thanks,
> >> -Vish
> >> 
> >> -----Original Message-----
> >> From: [email protected] 
> >> <mailto:[email protected]> 
> >> [mailto:[email protected] 
> >> <mailto:[email protected]>] On Behalf Of David Casier
> >> Sent: Monday, October 12, 2015 11:52 AM
> >> To: Sage Weil; Ceph Development
> >> Cc: Sébastien VALSEMEY; [email protected] 
> >> <mailto:[email protected]>; Denis Saget; luc.petetin
> >> Subject: Re: Fwd: [newstore (again)] how disable double write WAL
> >> 
> >> Ok,
> >> Great.
> >> 
> >> With these  settings :
> >> //
> >> newstore_max_dir_size = 4096
> >> newstore_sync_io = true
> >> newstore_sync_transaction = true
> >> newstore_sync_submit_transaction = true
> >> newstore_sync_wal_apply = true
> >> newstore_overlay_max = 0
> >> //
> >> 
> >> And direct IO in the benchmark tool (fio)
> >> 
> >> I see that the HDD is 100% charged and there are notransfer of /db to 
> >> /fragments after stopping benchmark : Great !
> >> 
> >> But when i launch a bench with random blocs of 256k, i see random blocs 
> >> between 32k and 256k on HDD. Any idea ?
> >> 
> >> Debits to the HDD are about 8MBps when they could be higher with larger 
> >> blocs (~30MBps)
> >> And 70 MBps without fsync (hard drive cache disabled).
> >> 
> >> Other questions :
> >> newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread 
> >> fsync_wq) ?
> >> newstore_sync_transaction -> true = sync in DB ?
> >> newstore_sync_submit_transaction -> if false then kv_queue (only if 
> >> newstore_sync_transaction=false) ?
> >> newstore_sync_wal_apply = true -> if false then WAL later (thread wal_wq) ?
> >> 
> >> Is it true ?
> >> 
> >> Way for cache with battery (sync DB and no sync data) ?
> >> 
> >> Thanks for everything !
> >> 
> >> On 10/12/2015 03:01 PM, Sage Weil wrote:
> >>> On Mon, 12 Oct 2015, David Casier wrote:
> >>>> Hello everybody,
> >>>> fragment is stored in rocksdb before being written to "/fragments" ?
> >>>> I separed "/db" and "/fragments" but during the bench, everything is 
> >>>> writing
> >>>> to "/db"
> >>>> I changed options "newstore_sync_*" without success.
> >>>> 
> >>>> Is there any way to write all metadata in "/db" and all data in 
> >>>> "/fragments" ?
> >>> You can set newstore_overlay_max = 0 to avoid most data landing in db/.
> >>> But if you are overwriting an existing object, doing write-ahead logging
> >>> is usually unavoidable because we need to make the update atomic (and the
> >>> underlying posix fs doesn't provide that).  The wip-newstore-frags branch
> >>> mitigates this somewhat for larger writes by limiting fragment size, but
> >>> for small IOs this is pretty much always going to be the case.  For small
> >>> IOs, though, putting things in db/ is generally better since we can
> >>> combine many small ios into a single (rocksdb) journal/wal write.  And
> >>> often leave them there (via the 'overlay' behavior).
> >>> 
> >>> sage
> >>> 
> >> 
> > 
> > 
> > -- 
> > ________________________________________________________ 
> > 
> > Cordialement,
> > 
> > David CASIER
> > 
> >  
> > 4 Trait d'Union
> > 77127 LIEUSAINT
> > 
> > Ligne directe: 01 75 98 53 85
> > Email: [email protected] <mailto:[email protected]>
> > ________________________________________________________
> > Début du message réexpédié :
> > 
> > De: "Vish (Vishwanath) Maram-SSI" <[email protected]>
> > Date: 29 octobre 2015 17:30:56 UTC+1
> > À: David Casier <[email protected]>
> > Cc: benoit LORIOT <[email protected]>, Sébastien VALSEMEY 
> > <[email protected]>
> > Objet: RE: Fwd: [newstore (again)] how disable double write WAL
> > 
> > Thanks David for the reply.
> >  
> > Yeah We just wanted to know how different is it from Filestore and how do 
> > we contribute for this? My motive is to first understand the design of 
> > Newstore and get the Performance loopholes so that we can try looking into 
> > it.
> >  
> > It would be helpful if you can share what is your idea from your side to 
> > use Newstore and configuration? What plans you are having for contributions 
> > to help us understand and see if we can work together.
> >  
> > Thanks,
> > -Vish
> >   <>
> > From: David Casier [mailto:[email protected]] 
> > Sent: Thursday, October 29, 2015 4:41 AM
> > To: Vish (Vishwanath) Maram-SSI
> > Cc: benoit LORIOT; Sébastien VALSEMEY
> > Subject: Re: Fwd: [newstore (again)] how disable double write WAL
> >  
> > Hi Vish,
> > It's OK.
> > 
> > We have a lot of different configuration with newstore tests.
> > 
> > What is your goal with ?
> > 
> > On 10/28/2015 11:02 PM, Vish (Vishwanath) Maram-SSI wrote:
> > Hi David,
> >  
> > Sorry for sending you the mail directly.
> >  
> > This is Vishwanath Maram from Samsung and started to play around with 
> > Newstore and observing some issues with running FIO.
> >  
> > Can you please share your Ceph Configuration file which you have used to 
> > run the IO's using FIO?
> >  
> > Thanks,
> > -Vish
> >  
> > -----Original Message-----
> > From: [email protected] 
> > <mailto:[email protected]> 
> > [mailto:[email protected] 
> > <mailto:[email protected]>] On Behalf Of David Casier
> > Sent: Monday, October 12, 2015 11:52 AM
> > To: Sage Weil; Ceph Development
> > Cc: Sébastien VALSEMEY; [email protected] 
> > <mailto:[email protected]>; Denis Saget; luc.petetin
> > Subject: Re: Fwd: [newstore (again)] how disable double write WAL
> >  
> > Ok,
> > Great.
> >  
> > With these  settings :
> > //
> > newstore_max_dir_size = 4096
> > newstore_sync_io = true
> > newstore_sync_transaction = true
> > newstore_sync_submit_transaction = true
> > newstore_sync_wal_apply = true
> > newstore_overlay_max = 0
> > //
> >  
> > And direct IO in the benchmark tool (fio)
> >  
> > I see that the HDD is 100% charged and there are notransfer of /db to 
> > /fragments after stopping benchmark : Great !
> >  
> > But when i launch a bench with random blocs of 256k, i see random blocs 
> > between 32k and 256k on HDD. Any idea ?
> >  
> > Debits to the HDD are about 8MBps when they could be higher with larger 
> > blocs (~30MBps)
> > And 70 MBps without fsync (hard drive cache disabled).
> >  
> > Other questions :
> > newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread 
> > fsync_wq) ?
> > newstore_sync_transaction -> true = sync in DB ?
> > newstore_sync_submit_transaction -> if false then kv_queue (only if 
> > newstore_sync_transaction=false) ?
> > newstore_sync_wal_apply = true -> if false then WAL later (thread wal_wq) ?
> >  
> > Is it true ?
> >  
> > Way for cache with battery (sync DB and no sync data) ?
> >  
> > Thanks for everything !
> >  
> > On 10/12/2015 03:01 PM, Sage Weil wrote:
> > On Mon, 12 Oct 2015, David Casier wrote:
> > Hello everybody,
> > fragment is stored in rocksdb before being written to "/fragments" ?
> > I separed "/db" and "/fragments" but during the bench, everything is writing
> > to "/db"
> > I changed options "newstore_sync_*" without success.
> >  
> > Is there any way to write all metadata in "/db" and all data in 
> > "/fragments" ?
> > You can set newstore_overlay_max = 0 to avoid most data landing in db/.
> > But if you are overwriting an existing object, doing write-ahead logging
> > is usually unavoidable because we need to make the update atomic (and the
> > underlying posix fs doesn't provide that).  The wip-newstore-frags branch
> > mitigates this somewhat for larger writes by limiting fragment size, but
> > for small IOs this is pretty much always going to be the case.  For small
> > IOs, though, putting things in db/ is generally better since we can
> > combine many small ios into a single (rocksdb) journal/wal write.  And
> > often leave them there (via the 'overlay' behavior).
> >  
> > sage
> >  
> >  
> >  
> >  
> > 
> > -- 
> > ________________________________________________________ 
> > 
> > Cordialement,
> > 
> > David CASIER
> > 
> > 
> > 4 Trait d'Union
> > 77127 LIEUSAINT
> > 
> > Ligne directe: 01 75 98 53 85
> > Email: [email protected] <mailto:[email protected]>
> > ________________________________________________________
> > Début du message réexpédié :
> > 
> > De: David Casier <[email protected]>
> > Date: 14 octobre 2015 22:03:38 UTC+2
> > À: Sébastien VALSEMEY <[email protected]>, [email protected]
> > Cc: Denis Saget <[email protected]>, "luc.petetin" <[email protected]>
> > Objet: Rép : Fwd: [newstore (again)] how disable double write WAL
> > 
> > Bonsoir Messieurs, 
> > Je viens de vivre le premier vrai feu Ceph.
> > Loic Dachary m'a bien appuyé sur le coup.
> > 
> > Je peux vous dire une chose : on a beau penser maîtriser le produit, c'est 
> > lors d'un incident qu'on se rend compte du nombre de facteurs à connaître 
> > par coeur.
> > Aussi, pas de panique, je prend vraiment l'expérience de ce soir comme un 
> > succès et comme un excellent coup de boost.
> > 
> > Explications :
> >  - LI ont un peu trop joué avec la crushmap (je ferai de la technique 
> > pointut un autre jour)
> >  - Mise à jour et redémarrage des OSD
> >  - Les OSD ne savaient plus où étaient la data
> >  - Reconstruction à la mimine de la crushmap et zzooouu.
> > 
> > Rien de bien grave en soit et un gros plus (++++) en image chez LI 
> > (j'aurais perdu 1h à 2h de plus sans Loic où ne s'est pas dispersé)
> > 
> > Conclusion : 
> > On va bosser ensemble sur des stress-tests, un peu comme des validations 
> > RedHat : une plate-forme, je casse, vous réparez.
> > Vous aurez autant de temps qu'il faut pour trouver (il m'est arrivé de 
> > passer quelques jours sur certains trucs).
> > 
> > Objectifs : 
> >  - Maîtriser une liste de vérifs à faire 
> >  - La rejouer toutes les semaines si beaucoup de fautes
> >  - Tous les mois si un peu de faute
> >  - Tous les 3 mois si bonne maîtrise 
> >  - ...
> > 
> > Il faut qu'on soit au top et que certaines choses passent en réflexe (vérif 
> > crushmap, savoir trouver la data sans les process, ...).
> > Surtout qu'il faut que le client soit rassuré en cas d'incident (ou pas).
> > 
> > Et franchement, c'est vraiment passionnant Ceph !
> > 
> >   On 10/12/2015 09:33 PM, Sage Weil wrote:
> >> Hi David-
> >> 
> >> On Mon, 12 Oct 2015, David Casier wrote:
> >>> Ok,
> >>> Great.
> >>> 
> >>> With these  settings :
> >>> //
> >>> newstore_max_dir_size = 4096
> >>> newstore_sync_io = true
> >>> newstore_sync_transaction = true
> >>> newstore_sync_submit_transaction = true
> >> Is this a hard disk?  Those settings probably don't make sense since it 
> >> does every IO synchronously, blocking the submitting IO path...
> >> 
> >>> newstore_sync_wal_apply = true
> >>> newstore_overlay_max = 0
> >>> //
> >>> 
> >>> And direct IO in the benchmark tool (fio)
> >>> 
> >>> I see that the HDD is 100% charged and there are notransfer of /db to
> >>> /fragments after stopping benchmark : Great !
> >>> 
> >>> But when i launch a bench with random blocs of 256k, i see random blocs
> >>> between 32k and 256k on HDD. Any idea ?
> >> Random IOs have to be write ahead logged in rocksdb, which has its own IO 
> >> pattern.  Since you made everything sync above I think it'll depend on 
> >> how many osd threads get batched together at a time.. maybe.  Those 
> >> settings aren't something I've really tested, and probably only make 
> >> sense with very fast NVMe devices.
> >> 
> >>> Debits to the HDD are about 8MBps when they could be higher with larger 
> >>> blocs> (~30MBps)
> >>> And 70 MBps without fsync (hard drive cache disabled).
> >>> 
> >>> Other questions :
> >>> newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread
> >>> fsync_wq) ?
> >> yes
> >> 
> >>> newstore_sync_transaction -> true = sync in DB ?
> >> synchronously do the rocksdb commit too
> >> 
> >>> newstore_sync_submit_transaction -> if false then kv_queue (only if
> >>> newstore_sync_transaction=false) ?
> >> yeah.. there is an annoying rocksdb behavior that makes an async 
> >> transaction submit block if a sync one is in progress, so this queues them 
> >> up and explicitly batches them.
> >> 
> >>> newstore_sync_wal_apply = true -> if false then WAL later (thread wal_wq) 
> >>> ?
> >> the txn commit completion threads can do the wal work synchronously.. this 
> >> is only a good idea if it's doing aio (which it generally is).
> >> 
> >>> Is it true ?
> >>> 
> >>> Way for cache with battery (sync DB and no sync data) ?
> >> ?
> >> s
> >> 
> >>> Thanks for everything !
> >>> 
> >>> On 10/12/2015 03:01 PM, Sage Weil wrote:
> >>>> On Mon, 12 Oct 2015, David Casier wrote:
> >>>>> Hello everybody,
> >>>>> fragment is stored in rocksdb before being written to "/fragments" ?
> >>>>> I separed "/db" and "/fragments" but during the bench, everything is
> >>>>> writing
> >>>>> to "/db"
> >>>>> I changed options "newstore_sync_*" without success.
> >>>>> 
> >>>>> Is there any way to write all metadata in "/db" and all data in
> >>>>> "/fragments" ?
> >>>> You can set newstore_overlay_max = 0 to avoid most data landing in db/.
> >>>> But if you are overwriting an existing object, doing write-ahead logging
> >>>> is usually unavoidable because we need to make the update atomic (and the
> >>>> underlying posix fs doesn't provide that).  The wip-newstore-frags branch
> >>>> mitigates this somewhat for larger writes by limiting fragment size, but
> >>>> for small IOs this is pretty much always going to be the case.  For small
> >>>> IOs, though, putting things in db/ is generally better since we can
> >>>> combine many small ios into a single (rocksdb) journal/wal write.  And
> >>>> often leave them there (via the 'overlay' behavior).
> >>>> 
> >>>> sage
> >>>> 
> >>> 
> >>> -- 
> >>> ________________________________________________________
> >>> 
> >>> Cordialement,
> >>> 
> >>> *David CASIER
> >>> DCConsulting SARL
> >>> 
> >>> 
> >>> 4 Trait d'Union
> >>> 77127 LIEUSAINT
> >>> 
> >>> **Ligne directe: _01 75 98 53 85_
> >>> Email: [email protected]_
> >>> * ________________________________________________________
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>> the body of a message to [email protected] 
> >>> <mailto:[email protected]>
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html 
> >>> <http://vger.kernel.org/majordomo-info.html>
> >>> 
> >>> 
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to [email protected] 
> >> <mailto:[email protected]>
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html 
> >> <http://vger.kernel.org/majordomo-info.html>
> > 
> > 
> > -- 
> > ________________________________________________________ 
> > 
> > Cordialement,
> > 
> > David CASIER
> > DCConsulting SARL
> > 
> >  
> > 4 Trait d'Union
> > 77127 LIEUSAINT
> > 
> > Ligne directe: 01 75 98 53 85
> > Email: [email protected] <mailto:[email protected]>
> > ________________________________________________________
> 
>

Re: Fwd: [newstore (again)] how disable double write WAL

Reply via email to