Problem with DictionaryCompoundWordTokenFilterFactory in Solr 3.2

Bernhard Schulz Fri, 10 Jun 2011 18:48:46 -0700

Hello everybody!


I am facing a problem with Solr's DictionaryCompoundWordTokenFilterFactory and 
hope you have some advice for me.
I am using the latest version Solr 3.2. (Had the same problem with Solr 3.1)

In the schema, I am using the settings like

<filter
class="solr.DictionaryCompoundWordTokenFilterFactory"
dictionary="words.german.txt"
minWordSize="5"
minSubwordSize="3"
maxSubwordSize="15"
onlyLongestMatch="true"
/>

Now, when I am analyzing the word "lederschuh" (means "leather shoe" in German) 
I am getting the following sub-words using the analyzer interface:
1.) lederschuh
2.) lederschuh
3.) der
4.) er
5.) schuh

Problem 1: I configured "minSubwordSize" to 3. Why does entry 4 ("er") appear 
which is shorter than 3 chars?
Problem 2: I configured "onlyLongestMatch" to true. There is a "lederschuh" 
entry in my dictionary. So the longestmatch would be "lederschuh" by itself and 
I do not expect to have that split up any further. Why is Solr still splitting 
that up? Is this a bug or did I misconfigure something?

Any advise is very welcome!

Thank you,
Bernhard

Problem with DictionaryCompoundWordTokenFilterFactory in Solr 3.2

Reply via email to