Hi, You must *implement* the protected method correct(int offset) in your own charFilter, that does the following: call super.correct(offset) - (this is important if you chain several filters) and then return a corrected offset according to the transformations you did in your own charfilter. If e.g. the character at offset 3 corresponds to offset 5 in the filtered data, you must return 5 when the given offset (after calling super) is 3.
Unrelated to that: Catching the IOException and printing it to system out is suboptimal to implement such a filter. Just make your constructor throw IOException itself, so it bubbles up to Solr. In the factory you can re-throw a SolrException. Your code would silently index nonsense or NPE later. In general, a CharFilter should *not* read the whole input up-front in constructor and then transform it, instead it should implement the read(...) methods and transform the input on-the-fly. ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [email protected] > -----Original Message----- > From: Osullivan L. [mailto:[email protected]] > Sent: Thursday, September 13, 2012 12:43 PM > To: [email protected] > Subject: RE: charFilter > > Hi Folks, > > I'm getting the following error after using a custom filter: > > SEVERE: org.apache.solr.common.SolrException: > org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token PR > 2823.000000 A0.200000 S0.819880 exceeds length of provided text sized 15 > > As the error suggests, the input value is PR2823.A2S81988 (15 chars). I have > been informed that correctOffset() method of the CharFilter class can be used > to resolve this issue but as far as I can tell, all that does is return the > value - it > doesn't set it. > > I have included some details below. > > Kind Regards, > > Luke > > In my schema I have: > > <fieldType name="LCNormalized" class="solr.TextField" > sortMissingLast="true" omitNorms="true"> > <analyzer> > <charFilter > class="com.test.solr.analysis.LukesTestCharFilterFactory"/> > <tokenizer class="solr.KeywordTokenizerFactory"/> > </analyzer> > </fieldType> > > and the method is: > > public class LukesTestCharFilterFactory extends BaseCharFilterFactory { > > public CharStream create(CharStream input) { > return new LukesTestCharFilter(input); > } > } > > public final class LukesTestCharFilter extends BaseCharFilter { ... > public LukesTestCharFilter(CharStream input) { > super(input); > try { > // Load the whole input into a string > StringBuilder sb = new StringBuilder(); > char[] buf = new char[1024]; > > int len; > while ((len = input.read(buf)) >= 0) { > sb.append(buf, 0, len); > } > > String original = sb.toString(); > String modified = getLCShelfkey(original); > CharStream result = CharReader.get(new StringReader(modified)); > > this.input = result; > this.input.correctOffset(modified.length()); > } catch (IOException e) { > System.err.println("There was a problem parsing input. Skipping."); > } > } > ... > } > =
