So - something in the highlighting code is counting bytes when it should be counting characters. Looks like a lucene bug, so I'm surprised others have not hit this before. Probably this is it: https://issues.apache.org/jira/browse/LUCENE-1500
-Peter On Tue, Feb 24, 2009 at 2:22 PM, Peter Wolanin <peter.wola...@acquia.com> wrote: > Here you can see a manifestation of it when trying to highlight with ?q=daß > > <lst name="highlighting"> > − > <lst name="ebdcc46ab3791a12dccd0f915a463bd2/node/11622"> > − > <arr name="body"> > − > <str> > -Community" einfach nicht mehr wahrnimmt. > Hätte mir am letzten Montag Nachmittag jemand gesagt, <strong>daß > </strong>ich am Abend > </str> > − > <str> > recht, wenn er sagte, d<strong>aß d</strong>as wirklich wertvolle an > Drupal schlichtweg seine (Entwickler- und Anwender-) > </str> > − > <str> > die Entstehungsgeschichte des Portals) auch dokumentiert worden, denn > Ihr vermutet schon richtig, da<strong>ß da</strong> > </str> > </arr> > </lst> > </lst> > > > You can see the "strong" tags each get offset one character more from > where they are supposed to be. > > > -Peter > > > > On Mon, Feb 23, 2009 at 8:24 AM, Peter Wolanin <peter.wola...@acquia.com> > wrote: >> We are using Solr trunk (1.4) - currently " nightly exported - yonik >> - 2009-02-05 08:06:00" >> >> -Peter >> >> On Mon, Feb 23, 2009 at 8:07 AM, Koji Sekiguchi <k...@r.email.ne.jp> wrote: >>> Jacob, >>> >>> What Solr version are you using? There is a bug in SolrHighlighter of Solr >>> 1.3, >>> you may want to look at: >>> >>> https://issues.apache.org/jira/browse/SOLR-925 >>> https://issues.apache.org/jira/browse/LUCENE-1500 >>> >>> regards, >>> >>> Koji >>> >>> >>> Jacob Singh wrote: >>>> >>>> Hi, >>>> >>>> We ran into a weird one today. We have a document which is written in >>>> German and everytime we make a query which matches it, we get the >>>> following: >>>> >>>> java.lang.StringIndexOutOfBoundsException: String index out of range: 2822 >>>> at java.lang.String.substring(String.java:1935) >>>> at >>>> org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:274) >>>> >>>> >>>> >From source diving it looks like Lucene's highlighter is trying to >>>> subStr against an offset that is outside the bounds of the body field >>>> which it is highlighting against. Running a fq against the ID of the >>>> doucment returns it fine (because no highlighting is done) and I took >>>> the body and tried to cut the first 2822 chars and while it is near >>>> the end of the body, it is still in range. >>>> >>>> Here is the related code: >>>> >>>> startOffset = tokenGroup.matchStartOffset; >>>> endOffset = tokenGroup.matchEndOffset; >>>> tokenText = text.substring(startOffset, endOffset); >>>> >>>> >>>> This leads me to believe there is some problem with mb string encoding >>>> and Lucene's counting. >>>> >>>> Any ideas here? Tomcat is configured with UTF-8 btw. >>>> >>>> Best, >>>> Jacob >>>> >>>> >>>> >>> >>> >> >> >> >> -- >> Peter M. Wolanin, Ph.D. >> Momentum Specialist, Acquia. Inc. >> peter.wola...@acquia.com >> > > > > -- > Peter M. Wolanin, Ph.D. > Momentum Specialist, Acquia. Inc. > peter.wola...@acquia.com > -- Peter M. Wolanin, Ph.D. Momentum Specialist, Acquia. Inc. peter.wola...@acquia.com