Here you can see a manifestation of it when trying to highlight with ?q=daß
<lst name="highlighting"> − <lst name="ebdcc46ab3791a12dccd0f915a463bd2/node/11622"> − <arr name="body"> − <str> -Community" einfach nicht mehr wahrnimmt. Hätte mir am letzten Montag Nachmittag jemand gesagt, <strong>daß </strong>ich am Abend </str> − <str> recht, wenn er sagte, d<strong>aß d</strong>as wirklich wertvolle an Drupal schlichtweg seine (Entwickler- und Anwender-) </str> − <str> die Entstehungsgeschichte des Portals) auch dokumentiert worden, denn Ihr vermutet schon richtig, da<strong>ß da</strong> </str> </arr> </lst> </lst> You can see the "strong" tags each get offset one character more from where they are supposed to be. -Peter On Mon, Feb 23, 2009 at 8:24 AM, Peter Wolanin <peter.wola...@acquia.com> wrote: > We are using Solr trunk (1.4) - currently " nightly exported - yonik > - 2009-02-05 08:06:00" > > -Peter > > On Mon, Feb 23, 2009 at 8:07 AM, Koji Sekiguchi <k...@r.email.ne.jp> wrote: >> Jacob, >> >> What Solr version are you using? There is a bug in SolrHighlighter of Solr >> 1.3, >> you may want to look at: >> >> https://issues.apache.org/jira/browse/SOLR-925 >> https://issues.apache.org/jira/browse/LUCENE-1500 >> >> regards, >> >> Koji >> >> >> Jacob Singh wrote: >>> >>> Hi, >>> >>> We ran into a weird one today. We have a document which is written in >>> German and everytime we make a query which matches it, we get the >>> following: >>> >>> java.lang.StringIndexOutOfBoundsException: String index out of range: 2822 >>> at java.lang.String.substring(String.java:1935) >>> at >>> org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:274) >>> >>> >>> >From source diving it looks like Lucene's highlighter is trying to >>> subStr against an offset that is outside the bounds of the body field >>> which it is highlighting against. Running a fq against the ID of the >>> doucment returns it fine (because no highlighting is done) and I took >>> the body and tried to cut the first 2822 chars and while it is near >>> the end of the body, it is still in range. >>> >>> Here is the related code: >>> >>> startOffset = tokenGroup.matchStartOffset; >>> endOffset = tokenGroup.matchEndOffset; >>> tokenText = text.substring(startOffset, endOffset); >>> >>> >>> This leads me to believe there is some problem with mb string encoding >>> and Lucene's counting. >>> >>> Any ideas here? Tomcat is configured with UTF-8 btw. >>> >>> Best, >>> Jacob >>> >>> >>> >> >> > > > > -- > Peter M. Wolanin, Ph.D. > Momentum Specialist, Acquia. Inc. > peter.wola...@acquia.com > -- Peter M. Wolanin, Ph.D. Momentum Specialist, Acquia. Inc. peter.wola...@acquia.com