Re: Preventing index segment corruption when windows crashes

Peter Sturge Thu, 02 Dec 2010 01:53:52 -0800

As I'm not familiar with the syncing in Lucene, I couldn't say whether
there's a specific problem with regards Win7/2008 server etc.

Windows has long had the somewhat odd behaviour of deliberately
caching file handles after an explicit close(). This has been part of
NTFS since NT 4 days, but there may be some new behaviour introduced
in Windows 6.x (and there is a lot of new behaviour) that causes an
issue. I have also seen this problem in Windows Server 2008 (server
version of Win7 - same file system).

I'll try some further testing on previous Windows versions, but I've
not previously come across a single segment corruption on Win 2k3/XP
after hard failures. In fact, it was when I first encountered this
problem on Server 2008 that I even discovered CheckIndex existed!

I guess a good question for the community is: Has anyone else
seen/reproduced this problem on Windows 6.x (i.e. Server 2008 or
Win7)?

Mike, are there any diagnostics/config etc. that I could try to help
isolate the problem?

Many thanks,
Peter

On Thu, Dec 2, 2010 at 9:28 AM, Michael McCandless
<luc...@mikemccandless.com> wrote:
> On Thu, Dec 2, 2010 at 4:10 AM, Peter Sturge <peter.stu...@gmail.com> wrote:
>> The Win7 crashes aren't from disk drivers - they come from, in this
>> case, a Broadcom wireless adapter driver.
>> The corruption comes as a result of the 'hard stop' of Windows.
>>
>> I would imagine this same problem could/would occur on any OS if the
>> plug was pulled from the machine.
>
> Actually, Lucene should be robust to this -- losing power, OS crash,
> hardware failure (as long as the failure doesn't flip bits), etc.
> This is because we do not delete files associated with an old commit
> point until all files referenced by the new commit point are
> successfully fsync'd.
>
> However it sounds like something is wrong, at least on Windows 7.
>
> I suspect it may be how we do the fsync -- if you look in
> FSDirectory.fsync, you'll see that we take a String fileName in.  We
> then open a new read/write RandomAccessFile, and call its
> .getFD().sync().
>
> I think this is potentially risky, ie, it would be better if we called
> .sync() on the original file we had opened for writing and written
> lots of data to, before closing it, instead of closing it, opening a
> new FileDescriptor, and calling sync on it.  We could conceivably take
> this approach, entirely in the Directory impl, by keeping the pool of
> file handles for write open even after .close() was called.  When a
> file is deleted we'd remove it from that pool, and when it's finally
> sync'd we'd then sync it and remove it from the pool.
>
> Could it be that on Windows 7 the way we fsync (opening a new
> FileDescriptor long after the first one was closed) doesn't in fact
> work?
>
> Mike
>

Re: Preventing index segment corruption when windows crashes

Reply via email to