Hi,

You must *implement* the protected method correct(int offset) in your own 
charFilter, that does the following: call super.correct(offset) - (this is 
important if you chain several filters) and then return a corrected offset 
according to the transformations you did in your own charfilter. If e.g. the 
character at offset 3 corresponds to offset 5 in the filtered data, you must 
return 5 when the given offset (after calling super) is 3.

Unrelated to that: Catching the IOException and printing it to system out is 
suboptimal to implement such a filter. Just make your constructor throw 
IOException itself, so it bubbles up to Solr. In the factory you can re-throw a 
SolrException. Your code would silently index nonsense or NPE later.

In general, a CharFilter should *not* read the whole input up-front in 
constructor and then transform it, instead it should implement the read(...) 
methods and transform the input on-the-fly.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [email protected]


> -----Original Message-----
> From: Osullivan L. [mailto:[email protected]]
> Sent: Thursday, September 13, 2012 12:43 PM
> To: [email protected]
> Subject: RE: charFilter
> 
> Hi Folks,
> 
> I'm getting the following error after using a custom filter:
> 
> SEVERE: org.apache.solr.common.SolrException:
> org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token PR
> 2823.000000 A0.200000 S0.819880 exceeds length of provided text sized 15
> 
> As the error suggests, the input value is PR2823.A2S81988 (15 chars). I have
> been informed that correctOffset() method of the CharFilter class can be used
> to resolve this issue but as far as I can tell, all that does is return the 
> value - it
> doesn't set it.
> 
> I have included some details below.
> 
> Kind Regards,
> 
> Luke
> 
> In my schema I have:
> 
>     <fieldType name="LCNormalized" class="solr.TextField"
> sortMissingLast="true" omitNorms="true">
>         <analyzer>
>           <charFilter 
> class="com.test.solr.analysis.LukesTestCharFilterFactory"/>
>           <tokenizer class="solr.KeywordTokenizerFactory"/>
>         </analyzer>
>     </fieldType>
> 
> and the method is:
> 
> public class LukesTestCharFilterFactory extends BaseCharFilterFactory {
> 
>       public CharStream create(CharStream input) {
>               return new LukesTestCharFilter(input);
>       }
> }
> 
> public final class LukesTestCharFilter extends BaseCharFilter {  ...
>   public LukesTestCharFilter(CharStream input)  {
>         super(input);
>         try {
>           // Load the whole input into a string
>           StringBuilder sb = new StringBuilder();
>           char[] buf = new char[1024];
> 
>           int len;
>           while ((len = input.read(buf)) >= 0) {
>               sb.append(buf, 0, len);
>           }
> 
>           String original = sb.toString();
>           String modified = getLCShelfkey(original);
>           CharStream result = CharReader.get(new StringReader(modified));
> 
>           this.input = result;
>           this.input.correctOffset(modified.length());
>       } catch (IOException e) {
>           System.err.println("There was a problem parsing input.  Skipping.");
>       }
>   }
>  ...
> }
> =

Reply via email to