Re: Skewed IDF in multi lingual index, again

alessandro.benedetti Tue, 05 Dec 2017 07:29:13 -0800

Thanks Yonik and thanks Doug.

I agree with Doug in adding few generics test corpora Jenkins automatically
runs some metrics on, to evaluate Apache Lucene/Solr changes don't affect a
golden truth too much.
This of course can be very complex, but I think it is a direction the Apache
Lucene/Solr community should work on.


Given that, I do believe that in this case, moving from maxDocs(field
independent) to docCount(field dependent) was a good move ( and this
specific multi language use case is an example).

Actually I also believe that theoretically docCount(field dependent) is
still better than maxDocs(field dependent).
This is because docCount(field dependent) represents a state in time
associated to the current index while maxDocs represents an historical
consideration.
A corpus of documents can change in time, and how much a term is rare can
drastically change ( let's pick an highly dynamic domain such news).

Doug, were you able to generalise and abstract any consideration from what
happened to your customers and why they got regressions moving from maxDocs
to docCount(field dependent) ?




-----
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Skewed IDF in multi lingual index, again

Reply via email to