Re: Error with highlighter and UTF-8 chars?

Koji Sekiguchi Mon, 23 Feb 2009 05:07:50 -0800

Jacob,

What Solr version are you using? There is a bug in SolrHighlighter ofSolr 1.3,

you may want to look at:


https://issues.apache.org/jira/browse/SOLR-925
https://issues.apache.org/jira/browse/LUCENE-1500

regards,

Koji


Jacob Singh wrote:

Hi,

We ran into a weird one today.  We have a document which is written in
German and everytime we make a query which matches it, we get the
following:

java.lang.StringIndexOutOfBoundsException: String index out of range: 2822
        at java.lang.String.substring(String.java:1935)
        at 
org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:274)


>From source diving it looks like Lucene's highlighter is trying to
subStr against an offset that is outside the bounds of the body field
which it is highlighting against.  Running a fq against the ID of the
doucment returns it fine (because no highlighting is done) and I took
the body and tried to cut the first 2822 chars and while it is near
the end of the body, it is still in range.

Here is the related code:

startOffset = tokenGroup.matchStartOffset;
endOffset = tokenGroup.matchEndOffset;
tokenText = text.substring(startOffset, endOffset);


This leads me to believe there is some problem with mb string encoding
and Lucene's counting.

Any ideas here?  Tomcat is configured with UTF-8 btw.

Best,
Jacob

Re: Error with highlighter and UTF-8 chars?

Reply via email to