On Tue, 24 Nov 2015, Sébastien VALSEMEY wrote: > Hello Vish, > > Please apologize for the delay in my answer. > Following the conversation you had with my colleague David, here are > some more details about our work : > > We are working on Filestore / Newstore optimizations by studying how we > could set ourselves free from using the journal. > > It is very important to work with SSD, but it is also mandatory to > combine it with regular magnetic platter disks. This is why we are > combining metadata storing on flash with data storing on disk.
This is pretty common, and something we will support natively with newstore. > Our main goal is to have the control on performance. Which is quite > difficult with NewStore, and needs fundamental hacks with FileStore. Can you clarify what you mean by "quite difficult with NewStore"? FWIW, the latest bleeding edge code is currently at github.com/liewegas/wip-bluestore. sage > Is Samsung working on ARM boards with embedded flash and a SATA port, in > order to allow us to work on a hybrid approach? What is your line of > work with Ceph? > > How can we work together ? > > Regards, > Sébastien > > > Début du message réexpédié : > > > > De: David Casier <[email protected]> > > Date: 12 octobre 2015 20:52:26 UTC+2 > > À: Sage Weil <[email protected]>, Ceph Development > > <[email protected]> > > Cc: Sébastien VALSEMEY <[email protected]>, > > [email protected], Denis Saget <[email protected]>, "luc.petetin" > > <[email protected]> > > Objet: Rép : Fwd: [newstore (again)] how disable double write WAL > > > > Ok, > > Great. > > > > With these settings : > > // > > newstore_max_dir_size = 4096 > > newstore_sync_io = true > > newstore_sync_transaction = true > > newstore_sync_submit_transaction = true > > newstore_sync_wal_apply = true > > newstore_overlay_max = 0 > > // > > > > And direct IO in the benchmark tool (fio) > > > > I see that the HDD is 100% charged and there are notransfer of /db to > > /fragments after stopping benchmark : Great ! > > > > But when i launch a bench with random blocs of 256k, i see random blocs > > between 32k and 256k on HDD. Any idea ? > > > > Debits to the HDD are about 8MBps when they could be higher with larger > > blocs (~30MBps) > > And 70 MBps without fsync (hard drive cache disabled). > > > > Other questions : > > newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread > > fsync_wq) ? > > newstore_sync_transaction -> true = sync in DB ? > > newstore_sync_submit_transaction -> if false then kv_queue (only if > > newstore_sync_transaction=false) ? > > newstore_sync_wal_apply = true -> if false then WAL later (thread wal_wq) ? > > > > Is it true ? > > > > Way for cache with battery (sync DB and no sync data) ? > > > > Thanks for everything ! > > > > On 10/12/2015 03:01 PM, Sage Weil wrote: > >> On Mon, 12 Oct 2015, David Casier wrote: > >>> Hello everybody, > >>> fragment is stored in rocksdb before being written to "/fragments" ? > >>> I separed "/db" and "/fragments" but during the bench, everything is > >>> writing > >>> to "/db" > >>> I changed options "newstore_sync_*" without success. > >>> > >>> Is there any way to write all metadata in "/db" and all data in > >>> "/fragments" ? > >> You can set newstore_overlay_max = 0 to avoid most data landing in db/. > >> But if you are overwriting an existing object, doing write-ahead logging > >> is usually unavoidable because we need to make the update atomic (and the > >> underlying posix fs doesn't provide that). The wip-newstore-frags branch > >> mitigates this somewhat for larger writes by limiting fragment size, but > >> for small IOs this is pretty much always going to be the case. For small > >> IOs, though, putting things in db/ is generally better since we can > >> combine many small ios into a single (rocksdb) journal/wal write. And > >> often leave them there (via the 'overlay' behavior). > >> > >> sage > >> > > > > > > -- > > ________________________________________________________ > > > > Cordialement, > > > > *David CASIER > > DCConsulting SARL > > > > > > 4 Trait d'Union > > 77127 LIEUSAINT > > > > **Ligne directe: _01 75 98 53 85_ > > Email: [email protected]_ > > * ________________________________________________________ > > Début du message réexpédié : > > > > De: David Casier <[email protected]> > > Date: 2 novembre 2015 20:02:37 UTC+1 > > À: "Vish (Vishwanath) Maram-SSI" <[email protected]> > > Cc: benoit LORIOT <[email protected]>, Sébastien VALSEMEY > > <[email protected]> > > Objet: Rép : Fwd: [newstore (again)] how disable double write WAL > > > > Hi Vish, > > In FileStore, data and metadata are stored in files, with xargs FS and omap. > > NewStore works with RocksDB. > > There are a lot of configuration in RocksDB but all options not implemented. > > > > The best way, for me, is not to use the logs, with secure cache (for > > example SSD 845DC). > > I don't think that is necessary to report I/O with a good metadata > > optimisation. > > > > The problem with RocksDB is that not possible to control I/O blocs size. > > > > We will resume work on NewStore soon. > > > > On 10/29/2015 05:30 PM, Vish (Vishwanath) Maram-SSI wrote: > >> Thanks David for the reply. > >> > >> Yeah We just wanted to know how different is it from Filestore and how do > >> we contribute for this? My motive is to first understand the design of > >> Newstore and get the Performance loopholes so that we can try looking into > >> it. > >> > >> It would be helpful if you can share what is your idea from your side to > >> use Newstore and configuration? What plans you are having for > >> contributions to help us understand and see if we can work together. > >> > >> Thanks, > >> -Vish > >> <> > >> From: David Casier [mailto:[email protected] > >> <mailto:[email protected]>] > >> Sent: Thursday, October 29, 2015 4:41 AM > >> To: Vish (Vishwanath) Maram-SSI > >> Cc: benoit LORIOT; Sébastien VALSEMEY > >> Subject: Re: Fwd: [newstore (again)] how disable double write WAL > >> > >> Hi Vish, > >> It's OK. > >> > >> We have a lot of different configuration with newstore tests. > >> > >> What is your goal with ? > >> > >> On 10/28/2015 11:02 PM, Vish (Vishwanath) Maram-SSI wrote: > >> Hi David, > >> > >> Sorry for sending you the mail directly. > >> > >> This is Vishwanath Maram from Samsung and started to play around with > >> Newstore and observing some issues with running FIO. > >> > >> Can you please share your Ceph Configuration file which you have used to > >> run the IO's using FIO? > >> > >> Thanks, > >> -Vish > >> > >> -----Original Message----- > >> From: [email protected] > >> <mailto:[email protected]> > >> [mailto:[email protected] > >> <mailto:[email protected]>] On Behalf Of David Casier > >> Sent: Monday, October 12, 2015 11:52 AM > >> To: Sage Weil; Ceph Development > >> Cc: Sébastien VALSEMEY; [email protected] > >> <mailto:[email protected]>; Denis Saget; luc.petetin > >> Subject: Re: Fwd: [newstore (again)] how disable double write WAL > >> > >> Ok, > >> Great. > >> > >> With these settings : > >> // > >> newstore_max_dir_size = 4096 > >> newstore_sync_io = true > >> newstore_sync_transaction = true > >> newstore_sync_submit_transaction = true > >> newstore_sync_wal_apply = true > >> newstore_overlay_max = 0 > >> // > >> > >> And direct IO in the benchmark tool (fio) > >> > >> I see that the HDD is 100% charged and there are notransfer of /db to > >> /fragments after stopping benchmark : Great ! > >> > >> But when i launch a bench with random blocs of 256k, i see random blocs > >> between 32k and 256k on HDD. Any idea ? > >> > >> Debits to the HDD are about 8MBps when they could be higher with larger > >> blocs (~30MBps) > >> And 70 MBps without fsync (hard drive cache disabled). > >> > >> Other questions : > >> newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread > >> fsync_wq) ? > >> newstore_sync_transaction -> true = sync in DB ? > >> newstore_sync_submit_transaction -> if false then kv_queue (only if > >> newstore_sync_transaction=false) ? > >> newstore_sync_wal_apply = true -> if false then WAL later (thread wal_wq) ? > >> > >> Is it true ? > >> > >> Way for cache with battery (sync DB and no sync data) ? > >> > >> Thanks for everything ! > >> > >> On 10/12/2015 03:01 PM, Sage Weil wrote: > >> On Mon, 12 Oct 2015, David Casier wrote: > >> Hello everybody, > >> fragment is stored in rocksdb before being written to "/fragments" ? > >> I separed "/db" and "/fragments" but during the bench, everything is > >> writing > >> to "/db" > >> I changed options "newstore_sync_*" without success. > >> > >> Is there any way to write all metadata in "/db" and all data in > >> "/fragments" ? > >> You can set newstore_overlay_max = 0 to avoid most data landing in db/. > >> But if you are overwriting an existing object, doing write-ahead logging > >> is usually unavoidable because we need to make the update atomic (and the > >> underlying posix fs doesn't provide that). The wip-newstore-frags branch > >> mitigates this somewhat for larger writes by limiting fragment size, but > >> for small IOs this is pretty much always going to be the case. For small > >> IOs, though, putting things in db/ is generally better since we can > >> combine many small ios into a single (rocksdb) journal/wal write. And > >> often leave them there (via the 'overlay' behavior). > >> > >> sage > >> > >> > >> > >> > >> > >> -- > >> ________________________________________________________ > >> > >> Cordialement, > >> > >> David CASIER > >> > >> > >> 4 Trait d'Union > >> 77127 LIEUSAINT > >> > >> Ligne directe: 01 75 98 53 85 > >> Email: [email protected] <mailto:[email protected]> > >> ________________________________________________________ > > > > > > -- > > ________________________________________________________ > > > > Cordialement, > > > > David CASIER > > > > > > 4 Trait d'Union > > 77127 LIEUSAINT > > > > Ligne directe: 01 75 98 53 85 > > Email: [email protected] <mailto:[email protected]> > > ________________________________________________________ > > Début du message réexpédié : > > > > De: Sage Weil <[email protected]> > > Date: 12 octobre 2015 21:33:52 UTC+2 > > À: David Casier <[email protected]> > > Cc: Ceph Development <[email protected]>, Sébastien VALSEMEY > > <[email protected]>, [email protected], Denis Saget > > <[email protected]>, "luc.petetin" <[email protected]> > > Objet: Rép : Fwd: [newstore (again)] how disable double write WAL > > > > Hi David- > > > > On Mon, 12 Oct 2015, David Casier wrote: > >> Ok, > >> Great. > >> > >> With these settings : > >> // > >> newstore_max_dir_size = 4096 > >> newstore_sync_io = true > >> newstore_sync_transaction = true > >> newstore_sync_submit_transaction = true > > > > Is this a hard disk? Those settings probably don't make sense since it > > does every IO synchronously, blocking the submitting IO path... > > > >> newstore_sync_wal_apply = true > >> newstore_overlay_max = 0 > >> // > >> > >> And direct IO in the benchmark tool (fio) > >> > >> I see that the HDD is 100% charged and there are notransfer of /db to > >> /fragments after stopping benchmark : Great ! > >> > >> But when i launch a bench with random blocs of 256k, i see random blocs > >> between 32k and 256k on HDD. Any idea ? > > > > Random IOs have to be write ahead logged in rocksdb, which has its own IO > > pattern. Since you made everything sync above I think it'll depend on > > how many osd threads get batched together at a time.. maybe. Those > > settings aren't something I've really tested, and probably only make > > sense with very fast NVMe devices. > > > >> Debits to the HDD are about 8MBps when they could be higher with larger > >> blocs> (~30MBps) > >> And 70 MBps without fsync (hard drive cache disabled). > >> > >> Other questions : > >> newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread > >> fsync_wq) ? > > > > yes > > > >> newstore_sync_transaction -> true = sync in DB ? > > > > synchronously do the rocksdb commit too > > > >> newstore_sync_submit_transaction -> if false then kv_queue (only if > >> newstore_sync_transaction=false) ? > > > > yeah.. there is an annoying rocksdb behavior that makes an async > > transaction submit block if a sync one is in progress, so this queues them > > up and explicitly batches them. > > > >> newstore_sync_wal_apply = true -> if false then WAL later (thread wal_wq) ? > > > > the txn commit completion threads can do the wal work synchronously.. this > > is only a good idea if it's doing aio (which it generally is). > > > >> Is it true ? > >> > >> Way for cache with battery (sync DB and no sync data) ? > > > > ? > > s > > > >> > >> Thanks for everything ! > >> > >> On 10/12/2015 03:01 PM, Sage Weil wrote: > >>> On Mon, 12 Oct 2015, David Casier wrote: > >>>> Hello everybody, > >>>> fragment is stored in rocksdb before being written to "/fragments" ? > >>>> I separed "/db" and "/fragments" but during the bench, everything is > >>>> writing > >>>> to "/db" > >>>> I changed options "newstore_sync_*" without success. > >>>> > >>>> Is there any way to write all metadata in "/db" and all data in > >>>> "/fragments" ? > >>> You can set newstore_overlay_max = 0 to avoid most data landing in db/. > >>> But if you are overwriting an existing object, doing write-ahead logging > >>> is usually unavoidable because we need to make the update atomic (and the > >>> underlying posix fs doesn't provide that). The wip-newstore-frags branch > >>> mitigates this somewhat for larger writes by limiting fragment size, but > >>> for small IOs this is pretty much always going to be the case. For small > >>> IOs, though, putting things in db/ is generally better since we can > >>> combine many small ios into a single (rocksdb) journal/wal write. And > >>> often leave them there (via the 'overlay' behavior). > >>> > >>> sage > >>> > >> > >> > >> -- > >> ________________________________________________________ > >> > >> Cordialement, > >> > >> *David CASIER > >> DCConsulting SARL > >> > >> > >> 4 Trait d'Union > >> 77127 LIEUSAINT > >> > >> **Ligne directe: _01 75 98 53 85_ > >> Email: [email protected]_ > >> * ________________________________________________________ > >> -- > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > >> the body of a message to [email protected] > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > >> > >> > > Début du message réexpédié : > > > > De: David Casier <[email protected]> > > Date: 29 octobre 2015 12:41:22 UTC+1 > > À: "Vish (Vishwanath) Maram-SSI" <[email protected]> > > Cc: benoit LORIOT <[email protected]>, Sébastien VALSEMEY > > <[email protected]> > > Objet: Rép : Fwd: [newstore (again)] how disable double write WAL > > > > Hi Vish, > > It's OK. > > > > We have a lot of different configuration with newstore tests. > > > > What is your goal with ? > > > > On 10/28/2015 11:02 PM, Vish (Vishwanath) Maram-SSI wrote: > >> Hi David, > >> > >> Sorry for sending you the mail directly. > >> > >> This is Vishwanath Maram from Samsung and started to play around with > >> Newstore and observing some issues with running FIO. > >> > >> Can you please share your Ceph Configuration file which you have used to > >> run the IO's using FIO? > >> > >> Thanks, > >> -Vish > >> > >> -----Original Message----- > >> From: [email protected] > >> <mailto:[email protected]> > >> [mailto:[email protected] > >> <mailto:[email protected]>] On Behalf Of David Casier > >> Sent: Monday, October 12, 2015 11:52 AM > >> To: Sage Weil; Ceph Development > >> Cc: Sébastien VALSEMEY; [email protected] > >> <mailto:[email protected]>; Denis Saget; luc.petetin > >> Subject: Re: Fwd: [newstore (again)] how disable double write WAL > >> > >> Ok, > >> Great. > >> > >> With these settings : > >> // > >> newstore_max_dir_size = 4096 > >> newstore_sync_io = true > >> newstore_sync_transaction = true > >> newstore_sync_submit_transaction = true > >> newstore_sync_wal_apply = true > >> newstore_overlay_max = 0 > >> // > >> > >> And direct IO in the benchmark tool (fio) > >> > >> I see that the HDD is 100% charged and there are notransfer of /db to > >> /fragments after stopping benchmark : Great ! > >> > >> But when i launch a bench with random blocs of 256k, i see random blocs > >> between 32k and 256k on HDD. Any idea ? > >> > >> Debits to the HDD are about 8MBps when they could be higher with larger > >> blocs (~30MBps) > >> And 70 MBps without fsync (hard drive cache disabled). > >> > >> Other questions : > >> newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread > >> fsync_wq) ? > >> newstore_sync_transaction -> true = sync in DB ? > >> newstore_sync_submit_transaction -> if false then kv_queue (only if > >> newstore_sync_transaction=false) ? > >> newstore_sync_wal_apply = true -> if false then WAL later (thread wal_wq) ? > >> > >> Is it true ? > >> > >> Way for cache with battery (sync DB and no sync data) ? > >> > >> Thanks for everything ! > >> > >> On 10/12/2015 03:01 PM, Sage Weil wrote: > >>> On Mon, 12 Oct 2015, David Casier wrote: > >>>> Hello everybody, > >>>> fragment is stored in rocksdb before being written to "/fragments" ? > >>>> I separed "/db" and "/fragments" but during the bench, everything is > >>>> writing > >>>> to "/db" > >>>> I changed options "newstore_sync_*" without success. > >>>> > >>>> Is there any way to write all metadata in "/db" and all data in > >>>> "/fragments" ? > >>> You can set newstore_overlay_max = 0 to avoid most data landing in db/. > >>> But if you are overwriting an existing object, doing write-ahead logging > >>> is usually unavoidable because we need to make the update atomic (and the > >>> underlying posix fs doesn't provide that). The wip-newstore-frags branch > >>> mitigates this somewhat for larger writes by limiting fragment size, but > >>> for small IOs this is pretty much always going to be the case. For small > >>> IOs, though, putting things in db/ is generally better since we can > >>> combine many small ios into a single (rocksdb) journal/wal write. And > >>> often leave them there (via the 'overlay' behavior). > >>> > >>> sage > >>> > >> > > > > > > -- > > ________________________________________________________ > > > > Cordialement, > > > > David CASIER > > > > > > 4 Trait d'Union > > 77127 LIEUSAINT > > > > Ligne directe: 01 75 98 53 85 > > Email: [email protected] <mailto:[email protected]> > > ________________________________________________________ > > Début du message réexpédié : > > > > De: "Vish (Vishwanath) Maram-SSI" <[email protected]> > > Date: 29 octobre 2015 17:30:56 UTC+1 > > À: David Casier <[email protected]> > > Cc: benoit LORIOT <[email protected]>, Sébastien VALSEMEY > > <[email protected]> > > Objet: RE: Fwd: [newstore (again)] how disable double write WAL > > > > Thanks David for the reply. > > > > Yeah We just wanted to know how different is it from Filestore and how do > > we contribute for this? My motive is to first understand the design of > > Newstore and get the Performance loopholes so that we can try looking into > > it. > > > > It would be helpful if you can share what is your idea from your side to > > use Newstore and configuration? What plans you are having for contributions > > to help us understand and see if we can work together. > > > > Thanks, > > -Vish > > <> > > From: David Casier [mailto:[email protected]] > > Sent: Thursday, October 29, 2015 4:41 AM > > To: Vish (Vishwanath) Maram-SSI > > Cc: benoit LORIOT; Sébastien VALSEMEY > > Subject: Re: Fwd: [newstore (again)] how disable double write WAL > > > > Hi Vish, > > It's OK. > > > > We have a lot of different configuration with newstore tests. > > > > What is your goal with ? > > > > On 10/28/2015 11:02 PM, Vish (Vishwanath) Maram-SSI wrote: > > Hi David, > > > > Sorry for sending you the mail directly. > > > > This is Vishwanath Maram from Samsung and started to play around with > > Newstore and observing some issues with running FIO. > > > > Can you please share your Ceph Configuration file which you have used to > > run the IO's using FIO? > > > > Thanks, > > -Vish > > > > -----Original Message----- > > From: [email protected] > > <mailto:[email protected]> > > [mailto:[email protected] > > <mailto:[email protected]>] On Behalf Of David Casier > > Sent: Monday, October 12, 2015 11:52 AM > > To: Sage Weil; Ceph Development > > Cc: Sébastien VALSEMEY; [email protected] > > <mailto:[email protected]>; Denis Saget; luc.petetin > > Subject: Re: Fwd: [newstore (again)] how disable double write WAL > > > > Ok, > > Great. > > > > With these settings : > > // > > newstore_max_dir_size = 4096 > > newstore_sync_io = true > > newstore_sync_transaction = true > > newstore_sync_submit_transaction = true > > newstore_sync_wal_apply = true > > newstore_overlay_max = 0 > > // > > > > And direct IO in the benchmark tool (fio) > > > > I see that the HDD is 100% charged and there are notransfer of /db to > > /fragments after stopping benchmark : Great ! > > > > But when i launch a bench with random blocs of 256k, i see random blocs > > between 32k and 256k on HDD. Any idea ? > > > > Debits to the HDD are about 8MBps when they could be higher with larger > > blocs (~30MBps) > > And 70 MBps without fsync (hard drive cache disabled). > > > > Other questions : > > newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread > > fsync_wq) ? > > newstore_sync_transaction -> true = sync in DB ? > > newstore_sync_submit_transaction -> if false then kv_queue (only if > > newstore_sync_transaction=false) ? > > newstore_sync_wal_apply = true -> if false then WAL later (thread wal_wq) ? > > > > Is it true ? > > > > Way for cache with battery (sync DB and no sync data) ? > > > > Thanks for everything ! > > > > On 10/12/2015 03:01 PM, Sage Weil wrote: > > On Mon, 12 Oct 2015, David Casier wrote: > > Hello everybody, > > fragment is stored in rocksdb before being written to "/fragments" ? > > I separed "/db" and "/fragments" but during the bench, everything is writing > > to "/db" > > I changed options "newstore_sync_*" without success. > > > > Is there any way to write all metadata in "/db" and all data in > > "/fragments" ? > > You can set newstore_overlay_max = 0 to avoid most data landing in db/. > > But if you are overwriting an existing object, doing write-ahead logging > > is usually unavoidable because we need to make the update atomic (and the > > underlying posix fs doesn't provide that). The wip-newstore-frags branch > > mitigates this somewhat for larger writes by limiting fragment size, but > > for small IOs this is pretty much always going to be the case. For small > > IOs, though, putting things in db/ is generally better since we can > > combine many small ios into a single (rocksdb) journal/wal write. And > > often leave them there (via the 'overlay' behavior). > > > > sage > > > > > > > > > > > > -- > > ________________________________________________________ > > > > Cordialement, > > > > David CASIER > > > > > > 4 Trait d'Union > > 77127 LIEUSAINT > > > > Ligne directe: 01 75 98 53 85 > > Email: [email protected] <mailto:[email protected]> > > ________________________________________________________ > > Début du message réexpédié : > > > > De: David Casier <[email protected]> > > Date: 14 octobre 2015 22:03:38 UTC+2 > > À: Sébastien VALSEMEY <[email protected]>, [email protected] > > Cc: Denis Saget <[email protected]>, "luc.petetin" <[email protected]> > > Objet: Rép : Fwd: [newstore (again)] how disable double write WAL > > > > Bonsoir Messieurs, > > Je viens de vivre le premier vrai feu Ceph. > > Loic Dachary m'a bien appuyé sur le coup. > > > > Je peux vous dire une chose : on a beau penser maîtriser le produit, c'est > > lors d'un incident qu'on se rend compte du nombre de facteurs à connaître > > par coeur. > > Aussi, pas de panique, je prend vraiment l'expérience de ce soir comme un > > succès et comme un excellent coup de boost. > > > > Explications : > > - LI ont un peu trop joué avec la crushmap (je ferai de la technique > > pointut un autre jour) > > - Mise à jour et redémarrage des OSD > > - Les OSD ne savaient plus où étaient la data > > - Reconstruction à la mimine de la crushmap et zzooouu. > > > > Rien de bien grave en soit et un gros plus (++++) en image chez LI > > (j'aurais perdu 1h à 2h de plus sans Loic où ne s'est pas dispersé) > > > > Conclusion : > > On va bosser ensemble sur des stress-tests, un peu comme des validations > > RedHat : une plate-forme, je casse, vous réparez. > > Vous aurez autant de temps qu'il faut pour trouver (il m'est arrivé de > > passer quelques jours sur certains trucs). > > > > Objectifs : > > - Maîtriser une liste de vérifs à faire > > - La rejouer toutes les semaines si beaucoup de fautes > > - Tous les mois si un peu de faute > > - Tous les 3 mois si bonne maîtrise > > - ... > > > > Il faut qu'on soit au top et que certaines choses passent en réflexe (vérif > > crushmap, savoir trouver la data sans les process, ...). > > Surtout qu'il faut que le client soit rassuré en cas d'incident (ou pas). > > > > Et franchement, c'est vraiment passionnant Ceph ! > > > > On 10/12/2015 09:33 PM, Sage Weil wrote: > >> Hi David- > >> > >> On Mon, 12 Oct 2015, David Casier wrote: > >>> Ok, > >>> Great. > >>> > >>> With these settings : > >>> // > >>> newstore_max_dir_size = 4096 > >>> newstore_sync_io = true > >>> newstore_sync_transaction = true > >>> newstore_sync_submit_transaction = true > >> Is this a hard disk? Those settings probably don't make sense since it > >> does every IO synchronously, blocking the submitting IO path... > >> > >>> newstore_sync_wal_apply = true > >>> newstore_overlay_max = 0 > >>> // > >>> > >>> And direct IO in the benchmark tool (fio) > >>> > >>> I see that the HDD is 100% charged and there are notransfer of /db to > >>> /fragments after stopping benchmark : Great ! > >>> > >>> But when i launch a bench with random blocs of 256k, i see random blocs > >>> between 32k and 256k on HDD. Any idea ? > >> Random IOs have to be write ahead logged in rocksdb, which has its own IO > >> pattern. Since you made everything sync above I think it'll depend on > >> how many osd threads get batched together at a time.. maybe. Those > >> settings aren't something I've really tested, and probably only make > >> sense with very fast NVMe devices. > >> > >>> Debits to the HDD are about 8MBps when they could be higher with larger > >>> blocs> (~30MBps) > >>> And 70 MBps without fsync (hard drive cache disabled). > >>> > >>> Other questions : > >>> newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread > >>> fsync_wq) ? > >> yes > >> > >>> newstore_sync_transaction -> true = sync in DB ? > >> synchronously do the rocksdb commit too > >> > >>> newstore_sync_submit_transaction -> if false then kv_queue (only if > >>> newstore_sync_transaction=false) ? > >> yeah.. there is an annoying rocksdb behavior that makes an async > >> transaction submit block if a sync one is in progress, so this queues them > >> up and explicitly batches them. > >> > >>> newstore_sync_wal_apply = true -> if false then WAL later (thread wal_wq) > >>> ? > >> the txn commit completion threads can do the wal work synchronously.. this > >> is only a good idea if it's doing aio (which it generally is). > >> > >>> Is it true ? > >>> > >>> Way for cache with battery (sync DB and no sync data) ? > >> ? > >> s > >> > >>> Thanks for everything ! > >>> > >>> On 10/12/2015 03:01 PM, Sage Weil wrote: > >>>> On Mon, 12 Oct 2015, David Casier wrote: > >>>>> Hello everybody, > >>>>> fragment is stored in rocksdb before being written to "/fragments" ? > >>>>> I separed "/db" and "/fragments" but during the bench, everything is > >>>>> writing > >>>>> to "/db" > >>>>> I changed options "newstore_sync_*" without success. > >>>>> > >>>>> Is there any way to write all metadata in "/db" and all data in > >>>>> "/fragments" ? > >>>> You can set newstore_overlay_max = 0 to avoid most data landing in db/. > >>>> But if you are overwriting an existing object, doing write-ahead logging > >>>> is usually unavoidable because we need to make the update atomic (and the > >>>> underlying posix fs doesn't provide that). The wip-newstore-frags branch > >>>> mitigates this somewhat for larger writes by limiting fragment size, but > >>>> for small IOs this is pretty much always going to be the case. For small > >>>> IOs, though, putting things in db/ is generally better since we can > >>>> combine many small ios into a single (rocksdb) journal/wal write. And > >>>> often leave them there (via the 'overlay' behavior). > >>>> > >>>> sage > >>>> > >>> > >>> -- > >>> ________________________________________________________ > >>> > >>> Cordialement, > >>> > >>> *David CASIER > >>> DCConsulting SARL > >>> > >>> > >>> 4 Trait d'Union > >>> 77127 LIEUSAINT > >>> > >>> **Ligne directe: _01 75 98 53 85_ > >>> Email: [email protected]_ > >>> * ________________________________________________________ > >>> -- > >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > >>> the body of a message to [email protected] > >>> <mailto:[email protected]> > >>> More majordomo info at http://vger.kernel.org/majordomo-info.html > >>> <http://vger.kernel.org/majordomo-info.html> > >>> > >>> > >> -- > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > >> the body of a message to [email protected] > >> <mailto:[email protected]> > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > >> <http://vger.kernel.org/majordomo-info.html> > > > > > > -- > > ________________________________________________________ > > > > Cordialement, > > > > David CASIER > > DCConsulting SARL > > > > > > 4 Trait d'Union > > 77127 LIEUSAINT > > > > Ligne directe: 01 75 98 53 85 > > Email: [email protected] <mailto:[email protected]> > > ________________________________________________________ > >
