On Dec 30, 2008, at 4:38 PM, Smiley, David W. wrote:
Grant, the Solr wiki recommends doing expansion at index time and
gives reasons:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-2c461ac74b4ddd82e453dc68fcfc92da77358d46
I personally think "recommends" is too strong of a word, but the
points are valid reasons to do index time synonyms. In Alexandar's
case, I think index-time is a bit more problematic, since he is
frequently updating the synonym list, meaning he would have to reindex
every time, otherwise his stats are going to be even more skewed.
As for multi-word expansions, the query parser can be fixed or an
alternate one used.
Query-time doesn't work for multi-word expansion. For everyone's
convenience, I'll quote the remainder of the problems:
Even when you aren't worried about multi-word synonyms, idf
differences still make index time synonyms a good idea. Consider the
following scenario:
* An index with a "text" field, which at query time uses the
SynonymFilter with the synonym TV, Televesion and expand="true"
* Many thousands of documents containing the term "text:TV"
* A few hundred documents containing the term "text:Television"
A query for text:TV will expand into (text:TV text:Television) and
the lower docFreq for text:Television will give the documents that
match "Television" a much higher score then docs that match "TV"
comparably -- which may be somewhat counter intuitive to the client.
Index time expansion (or reduction) will result in the same idf for
all documents regardless of which term the original text contained.
~ David Smiley
On 12/30/08 4:33 PM, "Grant Ingersoll" <gsing...@apache.org> wrote:
On Dec 30, 2008, at 11:05 AM, Alexander Ramos Jardim wrote:
Hey Grant,
Thanks for the info!
2008/12/30 Grant Ingersoll <gsing...@apache.org>
I'd probably write a new TokenFilter that was aware of the reload
policy
(in a generic way) such that I didn't have to go through a whole
core reload
every time. Are you just using them during query time or also
during
indexing?
I am using it at indexing time.
I think that is a bit more problematic. How do you deal with new
documents having the new synonyms while old docs don't?
Any particular reason you use syns at indexing and not search? Not
saying there aren't reasons to do it, just query side usually works
better for this very reason.
--------------------------
Grant Ingersoll
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ