Re: synonyms.txt file updated frequently

Grant Ingersoll Wed, 31 Dec 2008 06:08:01 -0800


On Dec 30, 2008, at 4:38 PM, Smiley, David W. wrote:

Grant, the Solr wiki recommends doing expansion at index time andgives reasons:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-2c461ac74b4ddd82e453dc68fcfc92da77358d46

I personally think "recommends" is too strong of a word, but thepoints are valid reasons to do index time synonyms. In Alexandar'scase, I think index-time is a bit more problematic, since he isfrequently updating the synonym list, meaning he would have to reindexevery time, otherwise his stats are going to be even more skewed.

As for multi-word expansions, the query parser can be fixed or analternate one used.

Query-time doesn't work for multi-word expansion. For everyone'sconvenience, I'll quote the remainder of the problems:
Even when you aren't worried about multi-word synonyms, idfdifferences still make index time synonyms a good idea. Consider thefollowing scenario:
* An index with a "text" field, which at query time uses theSynonymFilter with the synonym TV, Televesion and expand="true"
   *  Many thousands of documents containing the term "text:TV"
   *  A few hundred documents containing the term "text:Television"
A query for text:TV will expand into (text:TV text:Television) andthe lower docFreq for text:Television will give the documents thatmatch "Television" a much higher score then docs that match "TV"comparably -- which may be somewhat counter intuitive to the client.Index time expansion (or reduction) will result in the same idf forall documents regardless of which term the original text contained.
~ David Smiley

On 12/30/08 4:33 PM, "Grant Ingersoll" <gsing...@apache.org> wrote:



On Dec 30, 2008, at 11:05 AM, Alexander Ramos Jardim wrote:
Hey Grant,

Thanks for the info!

2008/12/30 Grant Ingersoll <gsing...@apache.org>
I'd probably write a new TokenFilter that was aware of the reload
policy
(in a generic way) such that I didn't have to go through a whole
core reload
every time. Are you just using them during query time or alsoduring
indexing?
I am using it at indexing time.
I think that is a bit more problematic.  How do you deal with new
documents having the new synonyms while old docs don't?

Any particular reason you use syns at indexing and not search?  Not
saying there aren't reasons to do it, just query side usually works
better for this very reason.


--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: synonyms.txt file updated frequently

Reply via email to