It is challenging as the performance of different use cases and domains
will by very dependent on the use case (there's no one globally perfect
relevance solution). But a good set of metrics to see *generally* how stock
Solr performs across a reasonable set of verticals would be nice.
My philosoph
Thanks Yonik and thanks Doug.
I agree with Doug in adding few generics test corpora Jenkins automatically
runs some metrics on, to evaluate Apache Lucene/Solr changes don't affect a
golden truth too much.
This of course can be very complex, but I think it is a direction the Apache
Lucene/Solr comm
Just a piece of feedback from clients on the original docCount change.
I have seen several cases with clients where the switch to docCount
surprised and harmed relevance.
More broadly, I’m concerned when we make these changes there’s not a
testing process against test corpuses with judgments and
On Tue, Dec 5, 2017 at 5:15 AM, alessandro.benedetti
wrote:
> "Lucene/Solr doesn't actually delete documents when you delete them, it
> just marks them as deleted. I'm pretty sure that the difference between
> docCount and maxDoc is deleted documents. Maybe I don't understand what
> I'm talking
"Lucene/Solr doesn't actually delete documents when you delete them, it
just marks them as deleted. I'm pretty sure that the difference between
docCount and maxDoc is deleted documents. Maybe I don't understand what
I'm talking about, but that is the best I can come up with. "
Thanks Shawn, y
On Mon, Dec 4, 2017 at 1:35 PM, Shawn Heisey wrote:
> I'm pretty sure that the difference between docCount and maxDoc is deleted
> documents.
docCount (not the best name) here is the number of documents with the
field being searched. docFreq (df) is the number of documents
actually containing t
On 12/4/2017 7:21 AM, alessandro.benedetti wrote:
the reason docCount was improving things is because it was using a docCount
relative to a specific field while maxDoc is global all over the index ?
Lucene/Solr doesn't actually delete documents when you delete them, it
just marks them as delet
Furthermore, taking a look to the code for BM25 similarity, it seems to me it
is currently working right :
- docCount is used per field if != -1
/**
* Computes a score factor for a simple term and returns an explanation
* for that score factor.
*
*
* The default implementation us
Hi Markus,
just out of interest, why did
" It was solved back then by using docCount instead of maxDoc when
calculating idf, it worked really well!" solve the problem ?
i assume you are using different fields, one per language.
Each field is appearing on a different number of docs I guess.
e.g.
t
y relevant documents in foreign languages,
> hence the deboost is not too low.
>
> Thanks,
> Markus
>
>
> -Original message-
>> From:Walter Underwood
>> Sent: Thursday 30th November 2017 17:29
>> To: solr-user@lucene.apache.org
>> Subject: R
uages, hence the deboost is
not too low.
Thanks,
Markus
-Original message-
> From:Walter Underwood
> Sent: Thursday 30th November 2017 17:29
> To: solr-user@lucene.apache.org
> Subject: Re: Skewed IDF in multi lingual index, again
>
> I’ve occasionally considered using U
I’ve occasionally considered using Unicode language tags (U+E001 and friends)
on each term. That would make a term specific to a language, so we would get
[en]LaserJet, [fr]LaserJet, [de]LaserJet, and so on. But that is a pretty big
hammer, because it restricts matches to the same language. If t
Hello,
We already discussed this problem five years ago [1]. In short: documents in
foreign languages are scored higher for some terms.
It was solved back then by using docCount instead of maxDoc when calculating
idf, it worked really well! But, probably due to index changes, the problem is
ba
13 matches
Mail list logo