On 2010-10-25 13:37, Toke Eskildsen wrote: > On Mon, 2010-10-25 at 11:50 +0200, Andrzej Bialecki wrote: >> * there is an exact solution to this problem, namely to make two >> distributed calls instead of one (first call to collect per-shard IDFs >> for given query terms, second call to submit a query rewritten with the >> global IDF-s). This solution is implemented in SOLR-1632, with some >> caching to reduce the cost for common queries. > > I must admit that I have not tried the patch myself. Looking at > https://issues.apache.org/jira/browse/SOLR-1632 > i see that the last comment is from LiLi with a failed patch, but as > there are no further comments it is unclear if the problem is general or > just with LiLi's setup. I might be a bit harsh here, but the other > comments for the JIRA issue also indicate that one would have to be > somewhat adventurous to run this in production.
Oh, definitely this is not production quality yet - there are known bugs, for example, that I need to fix, and then it needs to be forward-ported to trunk. It shouldn't be too much work to bring it back into usable state. >> * another reason is that in many many cases the difference between using >> exact global IDF and per-shard IDFs is not that significant. If shards >> are more or less homogenous (e.g. you assign documents to shards by >> hash(docId)) then term distributions will be also similar. > > While I agree on the validity of the solution, it does put some serious > constraints on the shard-setup. True. But this is the simplest setup that just may be enough. > >> To summarize, I would qualify your statement with: "...if the >> composition of your shards is drastically different". Otherwise the cost >> of using global IDF is not worth it, IMHO. > > Do you know of any studies of the differences in ranking with regard to > indexing-distribution by hashing, logical grouping and distributed IDF? Unfortunately, this information is surprisingly scarce - research predating year 2000 is often not applicable, and most current research concentrates on P2P systems, which are really a different ball of wax. Here's a few papers that I found that are related to this issue: * Global Term Weights in Distributed Environments, H. Witschel, 2007 (Elsevier) * KLEE: A Framework for Distributed Top-k Query Algorithms, S. Michel, P. Triantafillou, G. Weikum, VLDB'05 (ACM) * Exploring the Stability of IDF Term Weighting, Xin Fu and Miao Chen, 2008 (Springer Verlag) * A Comparison of Techniques for Estimating IDF Values to Generate Lexical Signatures for the Web, M. Klein, M. Nelson, WIDM'08 (ACM) * Comparison of dierent Collection Fusion Models in Distributed Information Retrieval, Alexander Steidinger - this paper gives a nice comparison framework for different strategies for joining partial results; apparently we use the most primitive strategy explained there, based on raw scores... These papers likely don't fully answer your question, but at least they provide a broader picture of the issue... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com