: You could extend HTMLStripReader to not decode named character entities, 
: e.g. by overriding HTMLStripReader.read() so that it calls an 
: alternative readEntity(), which instead of converting entity references 
: to characters would just leave the entity references as-is, something 
: like:

Alternately: use SynonymFilterFactory to map any entity "names" to the 
real Unicode character so your "Source2" style docs get "omega" replaced 
with the same character the HTMLStrip*TokenizerFactories generate when 
they encounter the HTML entities.

generating the list of synonyms from the comment at the end of 
HTMLSripReader.java should be easy.


: > Source1:   weakening Hδ absorption
: > Source1:   zero-field gap ω
: > 
: > Source2:  weakening H delta absorption
: > Source2:  zero-field gap omega



-Hoss

Reply via email to