Re: Tracking down the input that hits an analysis chain bug

Benson Margulies Fri, 03 Jan 2014 12:37:47 -0800

Robert,

Yes, if the problem was not data-dependent, indeed I wouldn't need to
index anything. However, I've run a small mountain of data through our
tokenizer on my machine, and never seen the error, but my customer
gets these errors in the middle of a giant spew of data. As it
happens, I _was_ missing that call to clearAttributes(), (and the
usual implementation of end()), but I found and fixed that problem
precisely by creating a random data test case using checkRandomData().
Unfortunately, fixing that didn't make the customer's errors go away.


So I'm left needing to help them identify the data that provokes this,
because I've so far failed to come up with any.

--benson


On Fri, Jan 3, 2014 at 2:16 PM, Robert Muir <rcm...@gmail.com> wrote:
> This exception comes from OffsetAttributeImpl (e.g. you dont need to
> index anything to reproduce it).
>
> Maybe you have a missing clearAttributes() call (your tokenizer
> 'returns true' without calling that first)? This could explain it, if
> something like a StopFilter is also present in the chain: basically
> the offsets overflow.
>
> the test stuff in BaseTokenStreamTestCase should be able to detect
> this as well...
>
> On Fri, Jan 3, 2014 at 1:56 PM, Benson Margulies <ben...@basistech.com> wrote:
>> Using Solr Cloud with 4.3.1.
>>
>> We've got a problem with a tokenizer that manifests as calling
>> OffsetAtt.setOffsets() with invalid inputs. OK, so, we want to figure out
>> what input provokes our code into getting into this pickle.
>>
>> The problem happens on SolrCloud nodes.
>>
>> The problem manifests as this sort of thing:
>>
>> Jan 3, 2014 6:05:33 PM org.apache.solr.common.SolrException log
>> SEVERE: java.lang.IllegalArgumentException: startOffset must be
>> non-negative, and endOffset must be >= startOffset,
>> startOffset=-1811581632,endOffset=-1811581632
>>
>> How could we get a document ID so that we can tell which document was being
>> processed?

Re: Tracking down the input that hits an analysis chain bug

Reply via email to