[chromium-dev] Re: Paper about DRAM error rates

John Abd-El-Malek Tue, 06 Oct 2009 17:17:07 -0700

On Tue, Oct 6, 2009 at 5:10 PM, Scott Hess <[email protected]> wrote:


> Our use of exclusive locking and page-cache preloading may open us up
> more to this kind of shenanigans.  Basically SQLite will trust those
> pages which we faulted into memory days ago.  We could mitigate
> against that somewhat, but this problem reaches into areas we cannot
> materially impact, such as filesystem caches.  And don't even begin to
> imagine that there are not similar issues with commodity disk drives
> and controllers.
>
> That said, I don't think this is an incremental addition of any kind.
> I've pointed it out before, there are things in the woods which
> corrupt databases.  We could MAYBE reduce occurrences to a suitable
> minimum using check-summing or something of the sort, but in the end
> we still have to detect corruption and decide what course to take from
> there.
>

I do think these are two separate problems.  Personally, I don't care as
much if my history or any other database is corrupted and I start from
scratch.  But random crashes that I can't isolate is something else.



> -scott
>
>
> On Tue, Oct 6, 2009 at 4:59 PM, John Abd-El-Malek <[email protected]>
> wrote:
> >
> >
> > On Tue, Oct 6, 2009 at 4:30 PM, Carlos Pizano <[email protected]> wrote:
> >>
> >> On Tue, Oct 6, 2009 at 4:14 PM, John Abd-El-Malek <[email protected]>
> >> wrote:
> >> > I'm not sure how Carlos is doing it?  Will we know if something is
> >> > corrupt
> >> > just on load/save?
> >>
> >> Many sqlite calls can return sqlite_corrupt. For example a query or an
> >> insert
> >> We just check for error codes 1 to 26 with 5 or 6 of them being
> >> serious error such as sqlite_corrupt
> >>
> >> I am sure that random bit flip in memory and on disk is the cause of
> >> some crashes, this is probably the 'limit' factor of how low the crash
> >> rate of a perfect program deployed in millions of computers can go.
> >
> > The point I was trying to make is that the 'limit' factor as you put it
> is
> > proportional to memory usage.  Given our large memory consumption in the
> > browser process, the numbers from the paper imply dozens of corruptions
> just
> > in sqlite memory per user.  Even if only a small fraction of these are
> > harmful, spread over millions of users that's a lot of corruption.
> >>
> >> But I am unsure how to calculate, for example a random bit flip on the
> >> backingstores, which add to at least 10M on most machines does not
> >> hurt, or in the middle of a cache entry, or in the data part of some
> >> structure.
> >>
> >>
> >>   I imagine there's no way we can know when corruption
> >> > happen in steady-state and the next query leads to some other browser
> >> > memory
> >> > (or another database) getting corrupted?
> >> >
> >> > On Tue, Oct 6, 2009 at 3:58 PM, Huan Ren <[email protected]> wrote:
> >> >>
> >> >> It will be helpful to get our own measurement on database failures.
> >> >> Carlos just added something like that.
> >> >>
> >> >> Huan
> >> >>
> >> >> On Tue, Oct 6, 2009 at 3:49 PM, John Abd-El-Malek <[email protected]>
> >> >> wrote:
> >> >> > Saw this on
> >> >> > slashdot:
> http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf
> >> >> > The conclusion is "an average of 25,000–75,000 FIT (failures in
> time
> >> >> > per
> >> >> > billion hours of operation) per Mbit".
> >> >> > On my machine the browser process is usually > 100MB, so that
> >> >> > averages
> >> >> > out
> >> >> > to 176 to 493 error per year, with those numbers having big
> variance
> >> >> > depending on the machine.  Since most users don't have ECC, which
> >> >> > means
> >> >> > this
> >> >> > will lead to corruption.  Sqlite is a heavy user of memory, so even
> >> >> > if
> >> >> > it's
> >> >> > 1/4 of the 100MB, that means we'll see an average of 40-120 errors
> >> >> > naturally
> >> >> > because of faulty DIMMs.
> >> >> > Given that sqlite corruption means (repeated) crashing of the
> browser
> >> >> > process, it seems this data heavily suggests we should separate
> >> >> > sqlite
> >> >> > code
> >> >> > into a separate process.  The IPC overhead is negligible compared
> to
> >> >> > disk
> >> >> > access.  My hunch is that the complexity is also not that high,
> since
> >> >> > the
> >> >> > code that deals with it is already asynchronous since we don't use
> >> >> > sqlite on
> >> >> > the UI/IO threads.
> >> >> > What do others think?
> >> >> > >> >
> >> >> >
> >> >
> >> >
> >
> >
> > > >
> >
>

--~--~---------~--~----~------------~-------~--~----~
Chromium Developers mailing list: [email protected] 
View archives, change email options, or unsubscribe: 
    http://groups.google.com/group/chromium-dev
-~----------~----~----~----~------~----~------~--~---

[chromium-dev] Re: Paper about DRAM error rates

Reply via email to