Generally it's not safe to run CheckIndex if a writer is also open on the index.
It's not safe because CheckIndex could hit FNFE's on opening files, or, if you use -fix, CheckIndex will change the index out from under your other IndexWriter (which will then cause other kinds of corruption). That said, I don't think the corruption that CheckIndex is detecting in your index would be caused by having a writer open on the index. Your first CheckIndex has a different deletes file (_phe_p3.del, with 44824 deleted docs) than the 2nd time you ran it (_phe_p4.del, with 44828 deleted docs), so it must somehow have to do with that change. One question: if you have a corrupt index, and run CheckIndex on it several times in a row, does it always fail in the same way? (Ie the same term hits the below exception). Is there any way I could get a copy of one of your corrupt cases? I can then dig... Mike On Thu, Jan 13, 2011 at 10:52 AM, Stéphane Delprat <stephane.delp...@blogspirit.com> wrote: > I understand less and less what is happening to my solr. > > I did a checkIndex (without -fix) and there was an error... > > So a did another checkIndex with -fix and then the error was gone. The > segment was alright > > > During checkIndex I do not shut down the solr server, I just make sure no > client connect to the server. > > Should I shut down the solr server during checkIndex ? > > > > first checkIndex : > > 4 of 17: name=_phe docCount=264148 > compound=false > hasProx=true > numFiles=9 > size (MB)=928.977 > diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64, > os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06 > 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0_20, > java.vendor=Sun Microsystems Inc.} > has deletions [delFileName=_phe_p3.del] > test: open reader.........OK [44824 deleted docs] > test: fields..............OK [51 fields] > test: field norms.........OK [51 fields] > test: terms, freq, prox...ERROR [term post_id:562 docFreq=1 != num docs > seen 0 + num docs deleted 0] > java.lang.RuntimeException: term post_id:562 docFreq=1 != num docs seen 0 + > num docs deleted 0 > at > org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:675) > at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:530) > at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903) > test: stored fields.......OK [7206878 total field count; avg 32.86 fields > per doc] > test: term vectors........OK [0 total vector count; avg 0 term/freq > vector fields per doc] > FAILED > WARNING: fixIndex() would remove reference to this segment; full > exception: > java.lang.RuntimeException: Term Index test failed > at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:543) > at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903) > > > a few minutes latter : > > 4 of 18: name=_phe docCount=264148 > compound=false > hasProx=true > numFiles=9 > size (MB)=928.977 > diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64, > os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06 > 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0 > _20, java.vendor=Sun Microsystems Inc.} > has deletions [delFileName=_phe_p4.del] > test: open reader.........OK [44828 deleted docs] > test: fields..............OK [51 fields] > test: field norms.........OK [51 fields] > test: terms, freq, prox...OK [3200899 terms; 26804334 terms/docs pairs; > 28919124 tokens] > test: stored fields.......OK [7206764 total field count; avg 32.86 fields > per doc] > test: term vectors........OK [0 total vector count; avg 0 term/freq > vector fields per doc] > > > Le 12/01/2011 16:50, Michael McCandless a écrit : >> >> Curious... is it always a docFreq=1 != num docs seen 0 + num docs deleted >> 0? >> >> It looks like new deletions were flushed against the segment (del file >> changed from _ncc_22s.del to _ncc_24f.del). >> >> Are you hitting any exceptions during indexing? >> >> Mike >> >> On Wed, Jan 12, 2011 at 10:33 AM, Stéphane Delprat >> <stephane.delp...@blogspirit.com> wrote: >>> >>> I got another corruption. >>> >>> It sure looks like it's the same type of error. (on a different field) >>> >>> It's also not linked to a merge, since the segment size did not change. >>> >>> >>> *** good segment : >>> >>> 1 of 9: name=_ncc docCount=1841685 >>> compound=false >>> hasProx=true >>> numFiles=9 >>> size (MB)=6,683.447 >>> diagnostics = {optimize=false, mergeFactor=10, >>> os.version=2.6.26-2-amd64, >>> os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06 >>> 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0 >>> _20, java.vendor=Sun Microsystems Inc.} >>> has deletions [delFileName=_ncc_22s.del] >>> test: open reader.........OK [275881 deleted docs] >>> test: fields..............OK [51 fields] >>> test: field norms.........OK [51 fields] >>> test: terms, freq, prox...OK [17952652 terms; 174113812 terms/docs >>> pairs; >>> 204561440 tokens] >>> test: stored fields.......OK [45511958 total field count; avg 29.066 >>> fields per doc] >>> test: term vectors........OK [0 total vector count; avg 0 term/freq >>> vector fields per doc] >>> >>> >>> a few hours latter : >>> >>> *** broken segment : >>> >>> 1 of 17: name=_ncc docCount=1841685 >>> compound=false >>> hasProx=true >>> numFiles=9 >>> size (MB)=6,683.447 >>> diagnostics = {optimize=false, mergeFactor=10, >>> os.version=2.6.26-2-amd64, >>> os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06 >>> 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0 >>> _20, java.vendor=Sun Microsystems Inc.} >>> has deletions [delFileName=_ncc_24f.del] >>> test: open reader.........OK [278167 deleted docs] >>> test: fields..............OK [51 fields] >>> test: field norms.........OK [51 fields] >>> test: terms, freq, prox...ERROR [term post_id:1599104 docFreq=1 != num >>> docs seen 0 + num docs deleted 0] >>> java.lang.RuntimeException: term post_id:1599104 docFreq=1 != num docs >>> seen >>> 0 + num docs deleted 0 >>> at >>> org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:675) >>> at >>> org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:530) >>> at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903) >>> test: stored fields.......OK [45429565 total field count; avg 29.056 >>> fields per doc] >>> test: term vectors........OK [0 total vector count; avg 0 term/freq >>> vector fields per doc] >>> FAILED >>> WARNING: fixIndex() would remove reference to this segment; full >>> exception: >>> java.lang.RuntimeException: Term Index test failed >>> at >>> org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:543) >>> at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903) >>> >>> >>> I'll activate infoStream for next time. >>> >>> >>> Thanks, >>> >>> >>> Le 12/01/2011 00:49, Michael McCandless a écrit : >>>> >>>> When you hit corruption is it always this same problem?: >>>> >>>> java.lang.RuntimeException: term source:margolisphil docFreq=1 != >>>> num docs seen 0 + num docs deleted 0 >>>> >>>> Can you run with Lucene's IndexWriter infoStream turned on, and catch >>>> the output leading to the corruption? If something is somehow messing >>>> up the bits in the deletes file that could cause this. >>>> >>>> Mike >>>> >>>> On Mon, Jan 10, 2011 at 5:52 AM, Stéphane Delprat >>>> <stephane.delp...@blogspirit.com> wrote: >>>>> >>>>> Hi, >>>>> >>>>> We are using : >>>>> Solr Specification Version: 1.4.1 >>>>> Solr Implementation Version: 1.4.1 955763M - mark - 2010-06-17 18:06:42 >>>>> Lucene Specification Version: 2.9.3 >>>>> Lucene Implementation Version: 2.9.3 951790 - 2010-06-06 01:30:55 >>>>> >>>>> # java -version >>>>> java version "1.6.0_20" >>>>> Java(TM) SE Runtime Environment (build 1.6.0_20-b02) >>>>> Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode) >>>>> >>>>> We want to index 4M docs in one core (and when it works fine we will >>>>> add >>>>> other cores with 2M on the same server) (1 doc ~= 1kB) >>>>> >>>>> We use SOLR replication every 5 minutes to update the slave server >>>>> (queries >>>>> are executed on the slave only) >>>>> >>>>> Documents are changing very quickly, during a normal day we will have >>>>> approx >>>>> : >>>>> * 200 000 updated docs >>>>> * 1000 new docs >>>>> * 200 deleted docs >>>>> >>>>> >>>>> I attached the last good checkIndex : solr20110107.txt >>>>> And the corrupted one : solr20110110.txt >>>>> >>>>> >>>>> This is not the first time a segment gets corrupted on this server, >>>>> that's >>>>> why I ran frequent "checkIndex". (but as you can see the first segment >>>>> is >>>>> 1.800.000 docs and it works fine!) >>>>> >>>>> >>>>> I can't find any "SEVER" "FATAL" or "exception" in the Solr logs. >>>>> >>>>> >>>>> I also attached my schema.xml and solrconfig.xml >>>>> >>>>> >>>>> Is there something wrong with what we are doing ? Do you need other >>>>> info >>>>> ? >>>>> >>>>> >>>>> Thanks, >>>>> >>>> >>> >> >