On Mon, 2010-10-25 at 11:50 +0200, Andrzej Bialecki wrote: > * there is an exact solution to this problem, namely to make two > distributed calls instead of one (first call to collect per-shard IDFs > for given query terms, second call to submit a query rewritten with the > global IDF-s). This solution is implemented in SOLR-1632, with some > caching to reduce the cost for common queries.
I must admit that I have not tried the patch myself. Looking at https://issues.apache.org/jira/browse/SOLR-1632 i see that the last comment is from LiLi with a failed patch, but as there are no further comments it is unclear if the problem is general or just with LiLi's setup. I might be a bit harsh here, but the other comments for the JIRA issue also indicate that one would have to be somewhat adventurous to run this in production. > * another reason is that in many many cases the difference between using > exact global IDF and per-shard IDFs is not that significant. If shards > are more or less homogenous (e.g. you assign documents to shards by > hash(docId)) then term distributions will be also similar. While I agree on the validity of the solution, it does put some serious constraints on the shard-setup. > To summarize, I would qualify your statement with: "...if the > composition of your shards is drastically different". Otherwise the cost > of using global IDF is not worth it, IMHO. Do you know of any studies of the differences in ranking with regard to indexing-distribution by hashing, logical grouping and distributed IDF? Regards, Toke Eskildsen