RE: Tokenizing and searching named character entity references

2008-07-28 Thread Chris Hostetter
: You could extend HTMLStripReader to not decode named character entities, : e.g. by overriding HTMLStripReader.read() so that it calls an : alternative readEntity(), which instead of converting entity references : to characters would just leave the entity references as-is, something : like:

RE: Tokenizing and searching named character entity references

2008-07-28 Thread Steven A Rowe
Hi Frances, HTMLStripWhitespaceTokenizerFactory wraps a WhitespaceTokenizer around an HTMLStripReader. You could extend HTMLStripReader to not decode named character entities, e.g. by overriding HTMLStripReader.read() so that it calls an alternative readEntity(), which instead of converting en