Re: Storage in Gecko

Taras Glek Tue, 30 Apr 2013 10:02:23 -0700

Ehsan Akhgari <mailto:ehsan.akhg...@gmail.com>
Monday, April 29, 2013 22:33
On 2013-04-29 1:51 PM, Taras Glek wrote:
* How to robustly write/update small datasets?

#3 above is it for small datasets. The correct way to do this is to
write blobs of JSON to disk. End of discussion.
For an API that is meant to be used by add-on authors, I'm afraid thesituation is not as easy as this. For example, for a "simple"key/value store which should be used for small datasets, one cannotenforce the implicit requirement of this solution (the data fitting ina single block on the disk. for example) at the API boundary withoutcreating a crappy API which would "fail" some of the times if thevalue to be written violates those assumptions. In practice it's notvery easy for the consumer of the API to guarantee the size of thedata written to disk if the data is coming from the user, the network,etc.

I'm not saying that the json has to fit a single filesystem block. I'msaying that if it's a few blocks, it's more efficient to rewrite thedata every time.

prefs are an example of an overused key/value store that is usually wellunder the threshold of a few blocks.

I think if you look at the kinds of data extensions store, it's smallenough, especially when compressed.

Writes of data <= ~64K should just be implemented as atomic whole-file
read/write operations. Those are almost always single blocks on disk.

Writing a whole file at once eliminates risk of data corruption.
Incremental updates are what makes sqlite do the WAL/fsync/etc dance
that causes much of the slowness.
Is that true even if the file is written to more than one physicalblock on the disk, across all of the filesystems that Firefox can run on?

yes.

As you can see from above examples, manual IO is not scary
Only if you trust the consumer of the API to know the trade-offs ofwhat they're doing. That is not the right assumption for a generickey/value store API.

We can add a warning to the API when it crosses some magical boundary. Ithink small datasets are the most common, so we should focus on thatusecase.

* What about fsync-less writes?
Many log-type performance-sensitive data-storage operations are ok with
lossy appends. By lossy I mean "data will be lost if there is a power
outage within a few seconds/minutes of write", consistency is still
important. For this one should create a directory and write out log
entries as checksummed individual files...but one should really use
compression(and get checksums for free).
https://bugzilla.mozilla.org/show_bug.cgi?id=846410 is about
facilitating such an API.

Use-cases here: telemetry saved-sessions, FHR session-statistics.

This is an interesting use case indeed, but I don't think that itfalls under the umbrella of the API being discussed here.

I'm still not sure what the api needs discussed are. Hopefully we'llnarrow down the scope of this in the meeting today.

* What about large datasets?
These should be decided on a case-by-case basis. Universal solutions
will always perform poorly in some dimension.

* What about indexeddb?
IDB is overkill for simple storage needs. It is a restrictive wrapper
over an SQLite schema. Perhaps some large dataset (eg an addressbook) is
a good fit for it. IDB supports filehandles to do raw IO, but that still
requires sqlite to bootstrap, doesn't support compression, etc.
IDB also makes sense as a transitional API for web due to the need to
move away from DOM Local Storage...
Indexed DB is not a wrapper around SQLite. The fact that our currentimplementation uses SQLite is an implementation detail which mightchange. (And it's not true on the web across different browser engines.)
I'm sure that if somebody can provide testcases on bad IndexedDBperformance scenarios we can work on fixing them, and that wouldbenefit the web, and Firefox OS as well.

I like solutions that we are well-suited to the problem being solved.IndexedDB is not a natural fit, making it fit sounds like more work thandoing a natural fs-based solution.

* Why isn't there a convenience API for all of the aboverecommendations?
Because speculatively landing APIs that anticipate future consumers is
risky, results in over-engineering and unpleasant surprises...So give us
use-cases and we(ie Yoric) will make them efficient.
The use case being discussed here is a simple key/value data store,hopefully with asynchronous operations, and safety guarantees againstdataloss. I do not see the current discussion as speculative at all.

It's speculative until we define concrete consumers of such an api. gps'original email said 'maybe' a key/value store is the way to go. I'mmaking a case that something lower level is simpler + one can layerkeyvalue on top.




Taras

Taras Glek <mailto:tg...@mozilla.com>
Monday, April 29, 2013 10:51
So there is no general 'good for performance' way of doing IO.
However I think most people who need this need to write small bits ofdata and there is a good way to do that.
* How to robustly write/update small datasets?
#3 above is it for small datasets. The correct way to do this is towrite blobs of JSON to disk. End of discussion.
Writes of data <= ~64K should just be implemented as atomic whole-fileread/write operations. Those are almost always single blocks on disk.
Writing a whole file at once eliminates risk of data corruption.Incremental updates are what makes sqlite do the WAL/fsync/etc dancethat causes much of the slowness.
We invested a year worth of engineering effort into a pure-js IOlibrary to facilitate efficient application-level IO. See OS.Filedocs, eghttps://developer.mozilla.org/en-US/docs/JavaScript_OS.File/OS.File_for_the_main_thread
As you can see from above examples, manual IO is not scary
If one is into convenience APIs, one can create arbitrary json-storageabstractions in ~10lines of code.
* What about writes > 64K?
Compression gives you 5-10x reduction of json.https://bugzilla.mozilla.org/show_bug.cgi?id=846410
Compression also means that your read-throughput is up to 5x better too.


* What about fsync-less writes?
Many log-type performance-sensitive data-storage operations are okwith lossy appends. By lossy I mean "data will be lost if there is apower outage within a few seconds/minutes of write", consistency isstill important. For this one should create a directory and write outlog entries as checksummed individual files...but one should reallyuse compression(and get checksums for free).https://bugzilla.mozilla.org/show_bug.cgi?id=846410 is aboutfacilitating such an API.
Use-cases here: telemetry saved-sessions, FHR session-statistics.

* What about large datasets?
These should be decided on a case-by-case basis. Universal solutionswill always perform poorly in some dimension.
* What about indexeddb?
IDB is overkill for simple storage needs. It is a restrictive wrapperover an SQLite schema. Perhaps some large dataset (eg an addressbook)is a good fit for it. IDB supports filehandles to do raw IO, but thatstill requires sqlite to bootstrap, doesn't support compression, etc.IDB also makes sense as a transitional API for web due to the need tomove away from DOM Local Storage...
* Why isn't there a convenience API for all of the above recommendations?
Because speculatively landing APIs that anticipate future consumers isrisky, results in over-engineering and unpleasant surprises...So giveus use-cases and we(ie Yoric) will make them efficient.
Taras

_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Storage in Gecko

Reply via email to