Greetings: I am working with many different data sources - some source employ "entity references" ; others do not. My goal is to make the searching across sources as consistent as possible.
Example text - Source1: weakening Hδ absorption Source1: zero-field gap ω Source2: weakening H delta absorption Source2: zero-field gap omega Using the tokenizer solr.HTMLStripWhitespaceTokenizerFactory for Source1 - the entity is replaced with the "named character entity" - This works great. But I want the searching tokens to be identical for each source. I need to capture δ as a token. <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.ISOLatin1AccentFilterFactory"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateA ll="0"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldType> Is this possible with the SOLR supplied tokenizers? I experimented with different combinations and orders and was not successful. Is this possible using synonyms? I also experimented with this route but again was not successful. Do I need to create a custom tokenizer? Thanks Frances -- View this message in context: http://www.nabble.com/Tokenizing-and-searching-named-character-entity-references-tp18632403p18632403.html Sent from the Solr - User mailing list archive at Nabble.com.