That is true if you randomly distribute the documents. If they are distributed 
according to topic, there can be some big anomalies.

Also, the DFs for rare terms will have bigger errors. There is some statistical 
theorem about this, but I can't remember it right now. Thanks to Zipf, most of 
your terms are rare. Also, the most selective terms are rare.

wunder

On Aug 30, 2012, at 5:25 PM, Lance Norskog wrote:

> The math for "confidence values" in probability theory shows that
> distributed DF does not matter after not very many documents. If you
> have 10s of thousands of documents in each shard, don't worry.
> 
> On Thu, Aug 30, 2012 at 1:19 PM, Steven A Rowe <sar...@syr.edu> wrote:
>> Hi Ke,
>> 
>> Have you seen <https://issues.apache.org/jira/browse/SOLR-1632>?
>> 
>> Steve
>> 
>> -----Original Message-----
>> From: Eric Wu [mailto:eirik...@gmail.com]
>> Sent: Thursday, August 30, 2012 3:05 AM
>> To: solr-user@lucene.apache.org
>> Subject: Solr4 distributed IDF
>> 
>> Hi there,
>> 
>> Does there exist any issue ticket about the distributed IDF feature in
>> solr4? Or maybe there already have some patches that I can use? Thank you
>> very much.
>> 
>> --
>> Ke Wu,
>> Best Regards




Reply via email to