: You could extend HTMLStripReader to not decode named character entities, : e.g. by overriding HTMLStripReader.read() so that it calls an : alternative readEntity(), which instead of converting entity references : to characters would just leave the entity references as-is, something : like:
Alternately: use SynonymFilterFactory to map any entity "names" to the real Unicode character so your "Source2" style docs get "omega" replaced with the same character the HTMLStrip*TokenizerFactories generate when they encounter the HTML entities. generating the list of synonyms from the comment at the end of HTMLSripReader.java should be easy. : > Source1: weakening Hδ absorption : > Source1: zero-field gap ω : > : > Source2: weakening H delta absorption : > Source2: zero-field gap omega -Hoss