As I'm not familiar with the syncing in Lucene, I couldn't say whether there's a specific problem with regards Win7/2008 server etc.
Windows has long had the somewhat odd behaviour of deliberately caching file handles after an explicit close(). This has been part of NTFS since NT 4 days, but there may be some new behaviour introduced in Windows 6.x (and there is a lot of new behaviour) that causes an issue. I have also seen this problem in Windows Server 2008 (server version of Win7 - same file system). I'll try some further testing on previous Windows versions, but I've not previously come across a single segment corruption on Win 2k3/XP after hard failures. In fact, it was when I first encountered this problem on Server 2008 that I even discovered CheckIndex existed! I guess a good question for the community is: Has anyone else seen/reproduced this problem on Windows 6.x (i.e. Server 2008 or Win7)? Mike, are there any diagnostics/config etc. that I could try to help isolate the problem? Many thanks, Peter On Thu, Dec 2, 2010 at 9:28 AM, Michael McCandless <luc...@mikemccandless.com> wrote: > On Thu, Dec 2, 2010 at 4:10 AM, Peter Sturge <peter.stu...@gmail.com> wrote: >> The Win7 crashes aren't from disk drivers - they come from, in this >> case, a Broadcom wireless adapter driver. >> The corruption comes as a result of the 'hard stop' of Windows. >> >> I would imagine this same problem could/would occur on any OS if the >> plug was pulled from the machine. > > Actually, Lucene should be robust to this -- losing power, OS crash, > hardware failure (as long as the failure doesn't flip bits), etc. > This is because we do not delete files associated with an old commit > point until all files referenced by the new commit point are > successfully fsync'd. > > However it sounds like something is wrong, at least on Windows 7. > > I suspect it may be how we do the fsync -- if you look in > FSDirectory.fsync, you'll see that we take a String fileName in. We > then open a new read/write RandomAccessFile, and call its > .getFD().sync(). > > I think this is potentially risky, ie, it would be better if we called > .sync() on the original file we had opened for writing and written > lots of data to, before closing it, instead of closing it, opening a > new FileDescriptor, and calling sync on it. We could conceivably take > this approach, entirely in the Directory impl, by keeping the pool of > file handles for write open even after .close() was called. When a > file is deleted we'd remove it from that pool, and when it's finally > sync'd we'd then sync it and remove it from the pool. > > Could it be that on Windows 7 the way we fsync (opening a new > FileDescriptor long after the first one was closed) doesn't in fact > work? > > Mike >