It is challenging as the performance of different use cases and domains
will by very dependent on the use case (there's no one globally perfect
relevance solution). But a good set of metrics to see *generally* how stock
Solr performs across a reasonable set of verticals would be nice.
My philosoph
Thanks Yonik and thanks Doug.
I agree with Doug in adding few generics test corpora Jenkins automatically
runs some metrics on, to evaluate Apache Lucene/Solr changes don't affect a
golden truth too much.
This of course can be very complex, but I think it is a direction the Apache
Lucene/Solr comm
Just a piece of feedback from clients on the original docCount change.
I have seen several cases with clients where the switch to docCount
surprised and harmed relevance.
More broadly, I’m concerned when we make these changes there’s not a
testing process against test corpuses with judgments and
On Tue, Dec 5, 2017 at 5:15 AM, alessandro.benedetti
wrote:
> "Lucene/Solr doesn't actually delete documents when you delete them, it
> just marks them as deleted. I'm pretty sure that the difference between
> docCount and maxDoc is deleted documents. Maybe I don't understand what
> I'm talking
"Lucene/Solr doesn't actually delete documents when you delete them, it
just marks them as deleted. I'm pretty sure that the difference between
docCount and maxDoc is deleted documents. Maybe I don't understand what
I'm talking about, but that is the best I can come up with. "
Thanks Shawn, y
On Mon, Dec 4, 2017 at 1:35 PM, Shawn Heisey wrote:
> I'm pretty sure that the difference between docCount and maxDoc is deleted
> documents.
docCount (not the best name) here is the number of documents with the
field being searched. docFreq (df) is the number of documents
actually containing t
On 12/4/2017 7:21 AM, alessandro.benedetti wrote:
the reason docCount was improving things is because it was using a docCount
relative to a specific field while maxDoc is global all over the index ?
Lucene/Solr doesn't actually delete documents when you delete them, it
just marks them as delet
Furthermore, taking a look to the code for BM25 similarity, it seems to me it
is currently working right :
- docCount is used per field if != -1
/**
* Computes a score factor for a simple term and returns an explanation
* for that score factor.
*
*
* The default implementation us
Hi Markus,
just out of interest, why did
" It was solved back then by using docCount instead of maxDoc when
calculating idf, it worked really well!" solve the problem ?
i assume you are using different fields, one per language.
Each field is appearing on a different number of docs I guess.
e.g.
t
y relevant documents in foreign languages,
> hence the deboost is not too low.
>
> Thanks,
> Markus
>
>
> -Original message-
>> From:Walter Underwood
>> Sent: Thursday 30th November 2017 17:29
>> To: solr-user@lucene.apache.org
>> Subject: R
uages, hence the deboost is
not too low.
Thanks,
Markus
-Original message-
> From:Walter Underwood
> Sent: Thursday 30th November 2017 17:29
> To: solr-user@lucene.apache.org
> Subject: Re: Skewed IDF in multi lingual index, again
>
> I’ve occasionally considered using U
I’ve occasionally considered using Unicode language tags (U+E001 and friends)
on each term. That would make a term specific to a language, so we would get
[en]LaserJet, [fr]LaserJet, [de]LaserJet, and so on. But that is a pretty big
hammer, because it restricts matches to the same language. If t
ive boosts
> will be lower than the product of boosts similar boosts, lowering the
> document in rank instead of boosting it.
>
> -Original message-
> > From:Markus Jelsma
> > Sent: Fri 09-Nov-2012 10:23
> > To: solr-user@lucene.apache.org
> > Subject: RE: S
r@lucene.apache.org
> Subject: RE: Skewed IDF in multi lingual index
>
> Robert, Tom,
>
> That's it indeed! Using maxDoc as numerator opposed to docCount yields very
> skewed results for an unevenly distributed multi-lingual index. We have one
> language dominatin
-Original message-
> From:Robert Muir
> Sent: Thu 08-Nov-2012 17:44
> To: solr-user@lucene.apache.org
> Subject: Re: Skewed IDF in multi lingual index
>
> Hi Markus: how are the languages distributed across documents?
>
> Imagine I have a text_en field and a text_fr
Hi Markus,
No answers, but I am very interested in what you find out. We currently
index all languages in one index, which presents different IDF issues, but
are interested in exploring alternatives such as the one you describe.
Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search
Hi Markus: how are the languages distributed across documents?
Imagine I have a text_en field and a text_fr field. Lets say I have
100 documents, 95 are english and only 5 are french.
So the text_en field is populated 95% of the time, and the text_fr 5%
of the time.
But the default IDF computatio
17 matches
Mail list logo