Hello everybody!
I am facing a problem with Solr's DictionaryCompoundWordTokenFilterFactory and hope you have some advice for me. I am using the latest version Solr 3.2. (Had the same problem with Solr 3.1) In the schema, I am using the settings like <filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="words.german.txt" minWordSize="5" minSubwordSize="3" maxSubwordSize="15" onlyLongestMatch="true" /> Now, when I am analyzing the word "lederschuh" (means "leather shoe" in German) I am getting the following sub-words using the analyzer interface: 1.) lederschuh 2.) lederschuh 3.) der 4.) er 5.) schuh Problem 1: I configured "minSubwordSize" to 3. Why does entry 4 ("er") appear which is shorter than 3 chars? Problem 2: I configured "onlyLongestMatch" to true. There is a "lederschuh" entry in my dictionary. So the longestmatch would be "lederschuh" by itself and I do not expect to have that split up any further. Why is Solr still splitting that up? Is this a bug or did I misconfigure something? Any advise is very welcome! Thank you, Bernhard