RE: Tokenizing and searching named character entity references

Steven A Rowe Mon, 28 Jul 2008 14:40:06 -0700

Hi Frances,

HTMLStripWhitespaceTokenizerFactory wraps a WhitespaceTokenizer around an 
HTMLStripReader.


You could extend HTMLStripReader to not decode named character entities, e.g. 
by overriding HTMLStripReader.read() so that it calls an alternative 
readEntity(), which instead of converting entity references to characters would 
just leave the entity references as-is, something like:

public class MyHTMLStripReader extends HTMLStripReader {

  ///// override read() to call myReadEntity(), but no other changes
  public int read() throws IOException {
    ...
    switch (ch) {
      case '&':
        saveState();
        ch = myReadEntity(); ///// Change this line to call new method
        if (ch>=0) return ch;
        if (ch==MISMATCH) {
          restoreState();
          return '&';
        }
        break;
      ...
    }
  }

  private int myReadEntity() throws IOException {
    int ch = next();
    if (ch=='#') return readNumericEntity();
    return MISMATCH;  ///// Always a mismatch, except for numeric entities
  }
}

Then you could create a new Factory, something like:

public class MyHTMLStripWhitespaceTokenizerFactory extends BaseTokenizerFactory 
{
  public TokenStream create(Reader input) {
    return new WhitespaceTokenizer(new MyHTMLStripReader(input));
  }
}

Steve

On 07/24/2008 at 9:53 AM, F Knudson wrote:
> 
> Greetings:
> 
> I am working with many different data sources - some source
> employ "entity references" ; others do not.  My goal is to
> make the searching across sources as consistent as possible.
> 
> Example text -
> 
> Source1:   weakening H&delta; absorption
> Source1:   zero-field gap &omega;
> 
> Source2:  weakening H delta absorption
> Source2:  zero-field gap omega
> 
> Using the tokenizer solr.HTMLStripWhitespaceTokenizerFactory
> for Source1 - the entity is replaced with the "named character
> entity" - This works great.
> 
> But I want the searching tokens to be identical for each
> source.  I need to capture &delta;  as a token.
> 
> <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>        <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>        <filter class="solr.ISOLatin1AccentFilterFactory"/>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateA ll="0"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
> </fieldType>
> 
> Is this possible with the SOLR supplied tokenizers?  I
> experimented with different combinations and orders and was
> not successful.
> 
> Is this possible using synonyms?  I also experimented with
> this route but again was not successful.
> 
> Do I need to create a custom tokenizer?
> 
> Thanks 
> Frances

RE: Tokenizing and searching named character entity references

Reply via email to