Re: segment gets corrupted (after background merge ?)

Michael McCandless Thu, 13 Jan 2011 11:07:56 -0800

Generally it's not safe to run CheckIndex if a writer is also open on the index.


It's not safe because CheckIndex could hit FNFE's on opening files,
or, if you use -fix, CheckIndex will change the index out from under
your other IndexWriter (which will then cause other kinds of
corruption).

That said, I don't think the corruption that CheckIndex is detecting
in your index would be caused by having a writer open on the index.
Your first CheckIndex has a different deletes file (_phe_p3.del, with
44824 deleted docs) than the 2nd time you ran it (_phe_p4.del, with
44828 deleted docs), so it must somehow have to do with that change.

One question: if you have a corrupt index, and run CheckIndex on it
several times in a row, does it always fail in the same way?  (Ie the
same term hits the below exception).

Is there any way I could get a copy of one of your corrupt cases?  I
can then dig...

Mike

On Thu, Jan 13, 2011 at 10:52 AM, Stéphane Delprat
<stephane.delp...@blogspirit.com> wrote:
> I understand less and less what is happening to my solr.
>
> I did a checkIndex (without -fix) and there was an error...
>
> So a did another checkIndex with -fix and then the error was gone. The
> segment was alright
>
>
> During checkIndex I do not shut down the solr server, I just make sure no
> client connect to the server.
>
> Should I shut down the solr server during checkIndex ?
>
>
>
> first checkIndex :
>
>  4 of 17: name=_phe docCount=264148
>    compound=false
>    hasProx=true
>    numFiles=9
>    size (MB)=928.977
>    diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64,
> os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
> 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0_20,
> java.vendor=Sun Microsystems Inc.}
>    has deletions [delFileName=_phe_p3.del]
>    test: open reader.........OK [44824 deleted docs]
>    test: fields..............OK [51 fields]
>    test: field norms.........OK [51 fields]
>    test: terms, freq, prox...ERROR [term post_id:562 docFreq=1 != num docs
> seen 0 + num docs deleted 0]
> java.lang.RuntimeException: term post_id:562 docFreq=1 != num docs seen 0 +
> num docs deleted 0
>        at
> org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:675)
>        at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:530)
>        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
>    test: stored fields.......OK [7206878 total field count; avg 32.86 fields
> per doc]
>    test: term vectors........OK [0 total vector count; avg 0 term/freq
> vector fields per doc]
> FAILED
>    WARNING: fixIndex() would remove reference to this segment; full
> exception:
> java.lang.RuntimeException: Term Index test failed
>        at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:543)
>        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
>
>
> a few minutes latter :
>
>  4 of 18: name=_phe docCount=264148
>    compound=false
>    hasProx=true
>    numFiles=9
>    size (MB)=928.977
>    diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64,
> os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
> 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0
> _20, java.vendor=Sun Microsystems Inc.}
>    has deletions [delFileName=_phe_p4.del]
>    test: open reader.........OK [44828 deleted docs]
>    test: fields..............OK [51 fields]
>    test: field norms.........OK [51 fields]
>    test: terms, freq, prox...OK [3200899 terms; 26804334 terms/docs pairs;
> 28919124 tokens]
>    test: stored fields.......OK [7206764 total field count; avg 32.86 fields
> per doc]
>    test: term vectors........OK [0 total vector count; avg 0 term/freq
> vector fields per doc]
>
>
> Le 12/01/2011 16:50, Michael McCandless a écrit :
>>
>> Curious... is it always a docFreq=1 != num docs seen 0 + num docs deleted
>> 0?
>>
>> It looks like new deletions were flushed against the segment (del file
>> changed from _ncc_22s.del to _ncc_24f.del).
>>
>> Are you hitting any exceptions during indexing?
>>
>> Mike
>>
>> On Wed, Jan 12, 2011 at 10:33 AM, Stéphane Delprat
>> <stephane.delp...@blogspirit.com>  wrote:
>>>
>>> I got another corruption.
>>>
>>> It sure looks like it's the same type of error. (on a different field)
>>>
>>> It's also not linked to a merge, since the segment size did not change.
>>>
>>>
>>> *** good segment :
>>>
>>>  1 of 9: name=_ncc docCount=1841685
>>>    compound=false
>>>    hasProx=true
>>>    numFiles=9
>>>    size (MB)=6,683.447
>>>    diagnostics = {optimize=false, mergeFactor=10,
>>> os.version=2.6.26-2-amd64,
>>> os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
>>> 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0
>>> _20, java.vendor=Sun Microsystems Inc.}
>>>    has deletions [delFileName=_ncc_22s.del]
>>>    test: open reader.........OK [275881 deleted docs]
>>>    test: fields..............OK [51 fields]
>>>    test: field norms.........OK [51 fields]
>>>    test: terms, freq, prox...OK [17952652 terms; 174113812 terms/docs
>>> pairs;
>>> 204561440 tokens]
>>>    test: stored fields.......OK [45511958 total field count; avg 29.066
>>> fields per doc]
>>>    test: term vectors........OK [0 total vector count; avg 0 term/freq
>>> vector fields per doc]
>>>
>>>
>>> a few hours latter :
>>>
>>> *** broken segment :
>>>
>>>  1 of 17: name=_ncc docCount=1841685
>>>    compound=false
>>>    hasProx=true
>>>    numFiles=9
>>>    size (MB)=6,683.447
>>>    diagnostics = {optimize=false, mergeFactor=10,
>>> os.version=2.6.26-2-amd64,
>>> os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
>>> 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0
>>> _20, java.vendor=Sun Microsystems Inc.}
>>>    has deletions [delFileName=_ncc_24f.del]
>>>    test: open reader.........OK [278167 deleted docs]
>>>    test: fields..............OK [51 fields]
>>>    test: field norms.........OK [51 fields]
>>>    test: terms, freq, prox...ERROR [term post_id:1599104 docFreq=1 != num
>>> docs seen 0 + num docs deleted 0]
>>> java.lang.RuntimeException: term post_id:1599104 docFreq=1 != num docs
>>> seen
>>> 0 + num docs deleted 0
>>>        at
>>> org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:675)
>>>        at
>>> org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:530)
>>>        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
>>>    test: stored fields.......OK [45429565 total field count; avg 29.056
>>> fields per doc]
>>>    test: term vectors........OK [0 total vector count; avg 0 term/freq
>>> vector fields per doc]
>>> FAILED
>>>    WARNING: fixIndex() would remove reference to this segment; full
>>> exception:
>>> java.lang.RuntimeException: Term Index test failed
>>>        at
>>> org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:543)
>>>        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
>>>
>>>
>>> I'll activate infoStream for next time.
>>>
>>>
>>> Thanks,
>>>
>>>
>>> Le 12/01/2011 00:49, Michael McCandless a écrit :
>>>>
>>>> When you hit corruption is it always this same problem?:
>>>>
>>>>   java.lang.RuntimeException: term source:margolisphil docFreq=1 !=
>>>> num docs seen 0 + num docs deleted 0
>>>>
>>>> Can you run with Lucene's IndexWriter infoStream turned on, and catch
>>>> the output leading to the corruption?  If something is somehow messing
>>>> up the bits in the deletes file that could cause this.
>>>>
>>>> Mike
>>>>
>>>> On Mon, Jan 10, 2011 at 5:52 AM, Stéphane Delprat
>>>> <stephane.delp...@blogspirit.com>    wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> We are using :
>>>>> Solr Specification Version: 1.4.1
>>>>> Solr Implementation Version: 1.4.1 955763M - mark - 2010-06-17 18:06:42
>>>>> Lucene Specification Version: 2.9.3
>>>>> Lucene Implementation Version: 2.9.3 951790 - 2010-06-06 01:30:55
>>>>>
>>>>> # java -version
>>>>> java version "1.6.0_20"
>>>>> Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
>>>>> Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)
>>>>>
>>>>> We want to index 4M docs in one core (and when it works fine we will
>>>>> add
>>>>> other cores with 2M on the same server) (1 doc ~= 1kB)
>>>>>
>>>>> We use SOLR replication every 5 minutes to update the slave server
>>>>> (queries
>>>>> are executed on the slave only)
>>>>>
>>>>> Documents are changing very quickly, during a normal day we will have
>>>>> approx
>>>>> :
>>>>> * 200 000 updated docs
>>>>> * 1000 new docs
>>>>> * 200 deleted docs
>>>>>
>>>>>
>>>>> I attached the last good checkIndex : solr20110107.txt
>>>>> And the corrupted one : solr20110110.txt
>>>>>
>>>>>
>>>>> This is not the first time a segment gets corrupted on this server,
>>>>> that's
>>>>> why I ran frequent "checkIndex". (but as you can see the first segment
>>>>> is
>>>>> 1.800.000 docs and it works fine!)
>>>>>
>>>>>
>>>>> I can't find any "SEVER" "FATAL" or "exception" in the Solr logs.
>>>>>
>>>>>
>>>>> I also attached my schema.xml and solrconfig.xml
>>>>>
>>>>>
>>>>> Is there something wrong with what we are doing ? Do you need other
>>>>> info
>>>>> ?
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>
>>>
>>
>

Re: segment gets corrupted (after background merge ?)

Reply via email to