On Mon, 2010-10-25 at 11:50 +0200, Andrzej Bialecki wrote:
> * there is an exact solution to this problem, namely to make two
> distributed calls instead of one (first call to collect per-shard IDFs
> for given query terms, second call to submit a query rewritten with the
> global IDF-s). This solution is implemented in SOLR-1632, with some
> caching to reduce the cost for common queries.

I must admit that I have not tried the patch myself. Looking at
https://issues.apache.org/jira/browse/SOLR-1632
i see that the last comment is from LiLi with a failed patch, but as
there are no further comments it is unclear if the problem is general or
just with LiLi's setup. I might be a bit harsh here, but the other
comments for the JIRA issue also indicate that one would have to be
somewhat adventurous to run this in production. 

> * another reason is that in many many cases the difference between using
> exact global IDF and per-shard IDFs is not that significant. If shards
> are more or less homogenous (e.g. you assign documents to shards by
> hash(docId)) then term distributions will be also similar.

While I agree on the validity of the solution, it does put some serious
constraints on the shard-setup.

> To summarize, I would qualify your statement with: "...if the
> composition of your shards is drastically different". Otherwise the cost
> of using global IDF is not worth it, IMHO.

Do you know of any studies of the differences in ranking with regard to
indexing-distribution by hashing, logical grouping and distributed IDF?

Regards,
Toke Eskildsen

Reply via email to