Punctuation marks in documents prevent recognition of synonyms at indexing?

G.S.J. Lobbestael Sat, 26 Sep 2009 10:31:19 -0700

Hi,

The wiki uses the example:


    <fieldtype name="syn" class="solr.TextField">
      <analyzer>
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.SynonymFilterFactory synonyms="syn.txt" 
ignoreCase="true" expand="false"/>
      </analyzer>
    </fieldtype>

With "dog, canine" in syn.txt and a document with "I have a dog, Bob.", "dog" 
is not seen as a synonym. With a document "I have a dog Bob" it is.

We could replace the WhitespaceTokenizerFactory with a PatternTokenizerFactory 
(in this case with a pattern="\s,"), but this may cause trouble further down 
the line, e.g. with the WordDelimiterFilterFactory if "-" is part of the 
pattern (suppose whe have a document with "MRI-scan" and a synonym for "MRI").

Or we could try to change the order of the filters (SynonymFilterFactory, 
StopFilterFactory, WordDelimiterFilterFactory, LowerCaseFilterFactory, 
SnowballPorterFilterFactory, RemoveDuplicatesTokenFilterFactory). The analysis 
tool shows that the comma is only removed at the WorldDelimiterFilter stage.

What's the best course?

Geert Lobbestael

Punctuation marks in documents prevent recognition of synonyms at indexing?

Reply via email to