On Tue, Oct 6, 2009 at 5:10 PM, Scott Hess <[email protected]> wrote:
> Our use of exclusive locking and page-cache preloading may open us up > more to this kind of shenanigans. Basically SQLite will trust those > pages which we faulted into memory days ago. We could mitigate > against that somewhat, but this problem reaches into areas we cannot > materially impact, such as filesystem caches. And don't even begin to > imagine that there are not similar issues with commodity disk drives > and controllers. > > That said, I don't think this is an incremental addition of any kind. > I've pointed it out before, there are things in the woods which > corrupt databases. We could MAYBE reduce occurrences to a suitable > minimum using check-summing or something of the sort, but in the end > we still have to detect corruption and decide what course to take from > there. > I do think these are two separate problems. Personally, I don't care as much if my history or any other database is corrupted and I start from scratch. But random crashes that I can't isolate is something else. > -scott > > > On Tue, Oct 6, 2009 at 4:59 PM, John Abd-El-Malek <[email protected]> > wrote: > > > > > > On Tue, Oct 6, 2009 at 4:30 PM, Carlos Pizano <[email protected]> wrote: > >> > >> On Tue, Oct 6, 2009 at 4:14 PM, John Abd-El-Malek <[email protected]> > >> wrote: > >> > I'm not sure how Carlos is doing it? Will we know if something is > >> > corrupt > >> > just on load/save? > >> > >> Many sqlite calls can return sqlite_corrupt. For example a query or an > >> insert > >> We just check for error codes 1 to 26 with 5 or 6 of them being > >> serious error such as sqlite_corrupt > >> > >> I am sure that random bit flip in memory and on disk is the cause of > >> some crashes, this is probably the 'limit' factor of how low the crash > >> rate of a perfect program deployed in millions of computers can go. > > > > The point I was trying to make is that the 'limit' factor as you put it > is > > proportional to memory usage. Given our large memory consumption in the > > browser process, the numbers from the paper imply dozens of corruptions > just > > in sqlite memory per user. Even if only a small fraction of these are > > harmful, spread over millions of users that's a lot of corruption. > >> > >> But I am unsure how to calculate, for example a random bit flip on the > >> backingstores, which add to at least 10M on most machines does not > >> hurt, or in the middle of a cache entry, or in the data part of some > >> structure. > >> > >> > >> I imagine there's no way we can know when corruption > >> > happen in steady-state and the next query leads to some other browser > >> > memory > >> > (or another database) getting corrupted? > >> > > >> > On Tue, Oct 6, 2009 at 3:58 PM, Huan Ren <[email protected]> wrote: > >> >> > >> >> It will be helpful to get our own measurement on database failures. > >> >> Carlos just added something like that. > >> >> > >> >> Huan > >> >> > >> >> On Tue, Oct 6, 2009 at 3:49 PM, John Abd-El-Malek <[email protected]> > >> >> wrote: > >> >> > Saw this on > >> >> > slashdot: > http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf > >> >> > The conclusion is "an average of 25,000–75,000 FIT (failures in > time > >> >> > per > >> >> > billion hours of operation) per Mbit". > >> >> > On my machine the browser process is usually > 100MB, so that > >> >> > averages > >> >> > out > >> >> > to 176 to 493 error per year, with those numbers having big > variance > >> >> > depending on the machine. Since most users don't have ECC, which > >> >> > means > >> >> > this > >> >> > will lead to corruption. Sqlite is a heavy user of memory, so even > >> >> > if > >> >> > it's > >> >> > 1/4 of the 100MB, that means we'll see an average of 40-120 errors > >> >> > naturally > >> >> > because of faulty DIMMs. > >> >> > Given that sqlite corruption means (repeated) crashing of the > browser > >> >> > process, it seems this data heavily suggests we should separate > >> >> > sqlite > >> >> > code > >> >> > into a separate process. The IPC overhead is negligible compared > to > >> >> > disk > >> >> > access. My hunch is that the complexity is also not that high, > since > >> >> > the > >> >> > code that deals with it is already asynchronous since we don't use > >> >> > sqlite on > >> >> > the UI/IO threads. > >> >> > What do others think? > >> >> > >> > > >> >> > > >> > > >> > > > > > > > > > > > > --~--~---------~--~----~------------~-------~--~----~ Chromium Developers mailing list: [email protected] View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~----------~----~----~----~------~----~------~--~---
