Hi Frances, HTMLStripWhitespaceTokenizerFactory wraps a WhitespaceTokenizer around an HTMLStripReader.
You could extend HTMLStripReader to not decode named character entities, e.g. by overriding HTMLStripReader.read() so that it calls an alternative readEntity(), which instead of converting entity references to characters would just leave the entity references as-is, something like: public class MyHTMLStripReader extends HTMLStripReader { ///// override read() to call myReadEntity(), but no other changes public int read() throws IOException { ... switch (ch) { case '&': saveState(); ch = myReadEntity(); ///// Change this line to call new method if (ch>=0) return ch; if (ch==MISMATCH) { restoreState(); return '&'; } break; ... } } private int myReadEntity() throws IOException { int ch = next(); if (ch=='#') return readNumericEntity(); return MISMATCH; ///// Always a mismatch, except for numeric entities } } Then you could create a new Factory, something like: public class MyHTMLStripWhitespaceTokenizerFactory extends BaseTokenizerFactory { public TokenStream create(Reader input) { return new WhitespaceTokenizer(new MyHTMLStripReader(input)); } } Steve On 07/24/2008 at 9:53 AM, F Knudson wrote: > > Greetings: > > I am working with many different data sources - some source > employ "entity references" ; others do not. My goal is to > make the searching across sources as consistent as possible. > > Example text - > > Source1: weakening Hδ absorption > Source1: zero-field gap ω > > Source2: weakening H delta absorption > Source2: zero-field gap omega > > Using the tokenizer solr.HTMLStripWhitespaceTokenizerFactory > for Source1 - the entity is replaced with the "named character > entity" - This works great. > > But I want the searching tokens to be identical for each > source. I need to capture δ as a token. > > <fieldType name="text" class="solr.TextField" > positionIncrementGap="100"> > <analyzer type="index"> > <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/> > <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" > ignoreCase="true" expand="true"/> > <filter class="solr.ISOLatin1AccentFilterFactory"/> > <filter class="solr.WordDelimiterFilterFactory" > generateWordParts="1" generateNumberParts="1" catenateWords="1" > catenateNumbers="1" catenateA ll="0"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> > </analyzer> > </fieldType> > > Is this possible with the SOLR supplied tokenizers? I > experimented with different combinations and orders and was > not successful. > > Is this possible using synonyms? I also experimented with > this route but again was not successful. > > Do I need to create a custom tokenizer? > > Thanks > Frances