On 2010-10-25 13:37, Toke Eskildsen wrote:
> On Mon, 2010-10-25 at 11:50 +0200, Andrzej Bialecki wrote:
>> * there is an exact solution to this problem, namely to make two
>> distributed calls instead of one (first call to collect per-shard IDFs
>> for given query terms, second call to submit a query rewritten with the
>> global IDF-s). This solution is implemented in SOLR-1632, with some
>> caching to reduce the cost for common queries.
> 
> I must admit that I have not tried the patch myself. Looking at
> https://issues.apache.org/jira/browse/SOLR-1632
> i see that the last comment is from LiLi with a failed patch, but as
> there are no further comments it is unclear if the problem is general or
> just with LiLi's setup. I might be a bit harsh here, but the other
> comments for the JIRA issue also indicate that one would have to be
> somewhat adventurous to run this in production. 

Oh, definitely this is not production quality yet - there are known
bugs, for example, that I need to fix, and then it needs to be
forward-ported to trunk. It shouldn't be too much work to bring it back
into usable state.

>> * another reason is that in many many cases the difference between using
>> exact global IDF and per-shard IDFs is not that significant. If shards
>> are more or less homogenous (e.g. you assign documents to shards by
>> hash(docId)) then term distributions will be also similar.
> 
> While I agree on the validity of the solution, it does put some serious
> constraints on the shard-setup.

True. But this is the simplest setup that just may be enough.

> 
>> To summarize, I would qualify your statement with: "...if the
>> composition of your shards is drastically different". Otherwise the cost
>> of using global IDF is not worth it, IMHO.
> 
> Do you know of any studies of the differences in ranking with regard to
> indexing-distribution by hashing, logical grouping and distributed IDF?

Unfortunately, this information is surprisingly scarce - research
predating year 2000 is often not applicable, and most current research
concentrates on P2P systems, which are really a different ball of wax.
Here's a few papers that I found that are related to this issue:

* Global Term Weights in Distributed Environments, H. Witschel, 2007
(Elsevier)

* KLEE: A Framework for Distributed Top-k Query Algorithms, S. Michel,
P. Triantafillou, G. Weikum, VLDB'05 (ACM)

* Exploring the Stability of IDF Term Weighting, Xin Fu and  Miao Chen,
2008 (Springer Verlag)

* A Comparison of Techniques for Estimating IDF Values to Generate
Lexical Signatures for the Web, M. Klein, M. Nelson, WIDM'08 (ACM)

* Comparison of dierent Collection Fusion Models in Distributed
Information Retrieval, Alexander Steidinger - this paper gives a nice
comparison framework for different strategies for joining partial
results; apparently we use the most primitive strategy explained there,
based on raw scores...

These papers likely don't fully answer your question, but at least they
provide a broader picture of the issue...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to