Error with highlighter and UTF-8 chars?

Jacob Singh Sun, 22 Feb 2009 22:30:25 -0800

Hi,

We ran into a weird one today.  We have a document which is written in
German and everytime we make a query which matches it, we get the
following:


java.lang.StringIndexOutOfBoundsException: String index out of range: 2822
        at java.lang.String.substring(String.java:1935)
        at 
org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:274)


>From source diving it looks like Lucene's highlighter is trying to
subStr against an offset that is outside the bounds of the body field
which it is highlighting against.  Running a fq against the ID of the
doucment returns it fine (because no highlighting is done) and I took
the body and tried to cut the first 2822 chars and while it is near
the end of the body, it is still in range.

Here is the related code:

startOffset = tokenGroup.matchStartOffset;
endOffset = tokenGroup.matchEndOffset;
tokenText = text.substring(startOffset, endOffset);


This leads me to believe there is some problem with mb string encoding
and Lucene's counting.

Any ideas here?  Tomcat is configured with UTF-8 btw.

Best,
Jacob


-- 

+1 510 277-0891 (o)
+91 9999 33 7458 (m)

web: http://pajamadesign.com

Skype: pajamadesign
Yahoo: jacobsingh
AIM: jacobsingh
gTalk: [email protected]

Error with highlighter and UTF-8 chars?

Reply via email to