Sounds like you need some work on the analysis part. I would start by using the Solr Admin Analysis tool and play around with your settings for that TokenFilter. Sounds too me like you might want a different approach to compound words. I'm not a German expert, so can't offer too much there, but one thought that comes to mind is using phrases or ngrams or if it is just that word, then put it in a protected words list.

-Grant

On Feb 6, 2009, at 5:23 AM, Kraus, Ralf | pixelhouse GmbH wrote:

Hi,

Now I ran into another problem by using the solr.DictionaryCompoundWordTokenFilterFactory :-( If I search for the german word "Spargelcremesuppe" which contains "Spargel", "Creme" and "Suppe" SOLR will find way to many result. Its because SOLR finds EVERY entry with either one of the three words in it :-(

Here is my schema.xml

<fieldType name="text_text" class="solr.TextField" positionIncrementGap="100">
          <analyzer>
              <tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.DictionaryCompoundWordTokenFilterFactory"
                              dictionary="dictionary.txt"
                              minWordSize="5"
                              minSubwordSize="2"
                              maxSubwordSize="15"
                              onlyLongestMatch="true" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
              <filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/> <filter class="solr.SnowballPorterFilterFactory" language="German" />
          </analyzer>
      </fieldType>

Any help ?

Greets,

Ralf Kraus

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika) using Solr/ Lucene:
http://www.lucidimagination.com/search

Reply via email to