I am perplexed by the behavior I am seeing of the Solr Analyzer and Filters
with regard to Underscores.

 

1) I am trying to get rid of them when shingling, but seem unable to do so
with a Stopwords Filter.

 

And yet they are being removed when I am not even trying to by the
WordDelimiter Filter.

 

2) Conversely, I would like to retain '$' symbols when they adjacent to
numbers, but seem unable to without having to accept all forms of other
syntax. 

 

My simple example configuration and test data and results are below.

 

Most grateful for any guidance,

 

Christopher

 

 

Test Data:

 

<doc>

<field name="id">StopWordTestData</field>
<field name="conSubSec-text_dc">PreShingled ThisIsNotAStopWord
ThisIsAStopWord ThisIsAlsoAStopWord beforeaperiod. beforeacomma,
beforeacollan: under_Score don't Peter's s $1.00 $1 $1,000 $200 $3,000,000
$3m - # -#- --#-- Yes X No _ __ ___ a and also about</field>

</doc>

 


 


Field 1 - Delimited_text:


Index Analyzer: org.apache.solr.analysis.TokenizerChain

Tokenizer Class: org.apache.solr.analysis.WhitespaceTokenizerFactory

Filters: 

1.       org.apache.solr.analysis.WordDelimiterFilterFactory
args:{splitOnCaseChange: 1 generateNumberParts: 0 catenateWords: 1
generateWordParts: 0 catenateAll: 1 catenateNumbers: 1 }


org.apache.solr.analysis.LowerCaseFilterFactory args:{}


 


Field 1 - Resulting Index Terms:


 



Term


#



100


2



1000


2



200


2



3


2



3000000


2



3m


2



a


2



about


2



also


2



and


2



beforeacollan


2



beforeacomma


2



beforeaperiod


2



dont


2



m


2



no


2



peter


2



preshingled


2



s


2



thisisalsoastopword


2



thisisastopword


2



thisisnotastopword


2



underscore


2



x


2



yes


2



1


2


Field2 - Shingled_Text:


Index Analyzer: org.apache.solr.analysis.TokenizerChain 

Tokenizer Class: org.apache.solr.analysis.WhitespaceTokenizerFactory

Filters:

2.          1. org.apache.solr.analysis.WordDelimiterFilterFactory
args:{splitOnCaseChange: 1 generateNumberParts: 0 catenateWords: 1
stemEnglishPossessive: 0 generateWordParts: 0 catenateAll: 0
catenateNumbers: 1 }

3.          2. org.apache.solr.analysis.StopFilterFactory args:{words:
StopWords-PreShingled.txt ignoreCase: true enablePositionIncrements: true }

4.          3. org.apache.solr.analysis.LowerCaseFilterFactory args:{}

5.          4. org.apache.solr.analysis.ShingleFilterFactory
args:{outputUnigrams: false maxShingleSize: 5 }


 


File: StopWords-PreShingled.txt


s


_


PreShingled


__


ThisIsAStopWord


ThisIsAlsoAStopWord


 


Field2 - Resulting Index Terms (Sample):


 



Term


#



_ 100


1



_ 100 1 1000


1



_ _


1



_ _ beforeaperiod beforeacomma


1



_ beforeaperiod


1



_ beforeaperiod beforeacomma beforeacollan


1



_ thisisnotastopword


1



_ thisisnotastopword _ _


1




 

 

 

Reply via email to