I am perplexed by the behavior I am seeing of the Solr Analyzer and Filters with regard to Underscores.
1) I am trying to get rid of them when shingling, but seem unable to do so with a Stopwords Filter. And yet they are being removed when I am not even trying to by the WordDelimiter Filter. 2) Conversely, I would like to retain '$' symbols when they adjacent to numbers, but seem unable to without having to accept all forms of other syntax. My simple example configuration and test data and results are below. Most grateful for any guidance, Christopher Test Data: <doc> <field name="id">StopWordTestData</field> <field name="conSubSec-text_dc">PreShingled ThisIsNotAStopWord ThisIsAStopWord ThisIsAlsoAStopWord beforeaperiod. beforeacomma, beforeacollan: under_Score don't Peter's s $1.00 $1 $1,000 $200 $3,000,000 $3m - # -#- --#-- Yes X No _ __ ___ a and also about</field> </doc> Field 1 - Delimited_text: Index Analyzer: org.apache.solr.analysis.TokenizerChain Tokenizer Class: org.apache.solr.analysis.WhitespaceTokenizerFactory Filters: 1. org.apache.solr.analysis.WordDelimiterFilterFactory args:{splitOnCaseChange: 1 generateNumberParts: 0 catenateWords: 1 generateWordParts: 0 catenateAll: 1 catenateNumbers: 1 } org.apache.solr.analysis.LowerCaseFilterFactory args:{} Field 1 - Resulting Index Terms: Term # 100 2 1000 2 200 2 3 2 3000000 2 3m 2 a 2 about 2 also 2 and 2 beforeacollan 2 beforeacomma 2 beforeaperiod 2 dont 2 m 2 no 2 peter 2 preshingled 2 s 2 thisisalsoastopword 2 thisisastopword 2 thisisnotastopword 2 underscore 2 x 2 yes 2 1 2 Field2 - Shingled_Text: Index Analyzer: org.apache.solr.analysis.TokenizerChain Tokenizer Class: org.apache.solr.analysis.WhitespaceTokenizerFactory Filters: 2. 1. org.apache.solr.analysis.WordDelimiterFilterFactory args:{splitOnCaseChange: 1 generateNumberParts: 0 catenateWords: 1 stemEnglishPossessive: 0 generateWordParts: 0 catenateAll: 0 catenateNumbers: 1 } 3. 2. org.apache.solr.analysis.StopFilterFactory args:{words: StopWords-PreShingled.txt ignoreCase: true enablePositionIncrements: true } 4. 3. org.apache.solr.analysis.LowerCaseFilterFactory args:{} 5. 4. org.apache.solr.analysis.ShingleFilterFactory args:{outputUnigrams: false maxShingleSize: 5 } File: StopWords-PreShingled.txt s _ PreShingled __ ThisIsAStopWord ThisIsAlsoAStopWord Field2 - Resulting Index Terms (Sample): Term # _ 100 1 _ 100 1 1000 1 _ _ 1 _ _ beforeaperiod beforeacomma 1 _ beforeaperiod 1 _ beforeaperiod beforeacomma beforeacollan 1 _ thisisnotastopword 1 _ thisisnotastopword _ _ 1