Sounds like you need some work on the analysis part. I would start by
using the Solr Admin Analysis tool and play around with your settings
for that TokenFilter. Sounds too me like you might want a different
approach to compound words. I'm not a German expert, so can't offer
too much there, but one thought that comes to mind is using phrases or
ngrams or if it is just that word, then put it in a protected words
list.
-Grant
On Feb 6, 2009, at 5:23 AM, Kraus, Ralf | pixelhouse GmbH wrote:
Hi,
Now I ran into another problem by using the
solr.DictionaryCompoundWordTokenFilterFactory :-(
If I search for the german word "Spargelcremesuppe" which contains
"Spargel", "Creme" and "Suppe" SOLR will find way to many result.
Its because SOLR finds EVERY entry with either one of the three
words in it :-(
Here is my schema.xml
<fieldType name="text_text" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter
class="solr.DictionaryCompoundWordTokenFilterFactory"
dictionary="dictionary.txt"
minWordSize="5"
minSubwordSize="2"
maxSubwordSize="15"
onlyLongestMatch="true" />
<filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter
class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory"
language="German" />
</analyzer>
</fieldType>
Any help ?
Greets,
Ralf Kraus
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika) using Solr/
Lucene:
http://www.lucidimagination.com/search