RE: Tokenizing and searching named character entity references

Chris Hostetter Mon, 28 Jul 2008 16:03:32 -0700

: You could extend HTMLStripReader to not decode named character entities, 
: e.g. by overriding HTMLStripReader.read() so that it calls an 
: alternative readEntity(), which instead of converting entity references 
: to characters would just leave the entity references as-is, something 
: like:


Alternately: use SynonymFilterFactory to map any entity "names" to the 
real Unicode character so your "Source2" style docs get "omega" replaced 
with the same character the HTMLStrip*TokenizerFactories generate when 
they encounter the HTML entities.

generating the list of synonyms from the comment at the end of 
HTMLSripReader.java should be easy.


: > Source1:   weakening H&delta; absorption
: > Source1:   zero-field gap &omega;
: > 
: > Source2:  weakening H delta absorption
: > Source2:  zero-field gap omega



-Hoss

RE: Tokenizing and searching named character entity references

Reply via email to