subject:"RE\: Skewed IDF in multi lingual index"

Re: Skewed IDF in multi lingual index, again

2017-12-05 Thread Doug Turnbull

It is challenging as the performance of different use cases and domains will by very dependent on the use case (there's no one globally perfect relevance solution). But a good set of metrics to see *generally* how stock Solr performs across a reasonable set of verticals would be nice. My philosoph

Re: Skewed IDF in multi lingual index, again

2017-12-05 Thread alessandro.benedetti

Thanks Yonik and thanks Doug. I agree with Doug in adding few generics test corpora Jenkins automatically runs some metrics on, to evaluate Apache Lucene/Solr changes don't affect a golden truth too much. This of course can be very complex, but I think it is a direction the Apache Lucene/Solr comm

Re: Skewed IDF in multi lingual index, again

2017-12-05 Thread Doug Turnbull

Just a piece of feedback from clients on the original docCount change. I have seen several cases with clients where the switch to docCount surprised and harmed relevance. More broadly, I’m concerned when we make these changes there’s not a testing process against test corpuses with judgments and

Re: Skewed IDF in multi lingual index, again

2017-12-05 Thread Yonik Seeley

On Tue, Dec 5, 2017 at 5:15 AM, alessandro.benedetti wrote: > "Lucene/Solr doesn't actually delete documents when you delete them, it > just marks them as deleted. I'm pretty sure that the difference between > docCount and maxDoc is deleted documents. Maybe I don't understand what > I'm talking

Re: Skewed IDF in multi lingual index, again

2017-12-05 Thread alessandro.benedetti

"Lucene/Solr doesn't actually delete documents when you delete them, it just marks them as deleted. I'm pretty sure that the difference between docCount and maxDoc is deleted documents. Maybe I don't understand what I'm talking about, but that is the best I can come up with. " Thanks Shawn, y

Re: Skewed IDF in multi lingual index, again

2017-12-04 Thread Yonik Seeley

On Mon, Dec 4, 2017 at 1:35 PM, Shawn Heisey wrote: > I'm pretty sure that the difference between docCount and maxDoc is deleted > documents. docCount (not the best name) here is the number of documents with the field being searched. docFreq (df) is the number of documents actually containing t

Re: Skewed IDF in multi lingual index, again

2017-12-04 Thread Shawn Heisey

On 12/4/2017 7:21 AM, alessandro.benedetti wrote: the reason docCount was improving things is because it was using a docCount relative to a specific field while maxDoc is global all over the index ? Lucene/Solr doesn't actually delete documents when you delete them, it just marks them as delet

Re: Skewed IDF in multi lingual index, again

2017-12-04 Thread alessandro.benedetti

Furthermore, taking a look to the code for BM25 similarity, it seems to me it is currently working right : - docCount is used per field if != -1 /** * Computes a score factor for a simple term and returns an explanation * for that score factor. * * * The default implementation us

Re: Skewed IDF in multi lingual index, again

2017-12-04 Thread alessandro.benedetti

Hi Markus, just out of interest, why did " It was solved back then by using docCount instead of maxDoc when calculating idf, it worked really well!" solve the problem ? i assume you are using different fields, one per language. Each field is appearing on a different number of docs I guess. e.g. t

Re: Skewed IDF in multi lingual index, again

2017-11-30 Thread Walter Underwood

y relevant documents in foreign languages, > hence the deboost is not too low. > > Thanks, > Markus > > > -Original message- >> From:Walter Underwood >> Sent: Thursday 30th November 2017 17:29 >> To: solr-user@lucene.apache.org >> Subject: R

RE: Skewed IDF in multi lingual index, again

2017-11-30 Thread Markus Jelsma

uages, hence the deboost is not too low. Thanks, Markus -Original message- > From:Walter Underwood > Sent: Thursday 30th November 2017 17:29 > To: solr-user@lucene.apache.org > Subject: Re: Skewed IDF in multi lingual index, again > > I’ve occasionally considered using U

Re: Skewed IDF in multi lingual index, again

2017-11-30 Thread Walter Underwood

I’ve occasionally considered using Unicode language tags (U+E001 and friends) on each term. That would make a term specific to a language, so we would get [en]LaserJet, [fr]LaserJet, [de]LaserJet, and so on. But that is a pretty big hammer, because it restricts matches to the same language. If t

Re: Skewed IDF in multi lingual index

2012-11-26 Thread Robert Muir

ive boosts > will be lower than the product of boosts similar boosts, lowering the > document in rank instead of boosting it. > > -Original message- > > From:Markus Jelsma > > Sent: Fri 09-Nov-2012 10:23 > > To: solr-user@lucene.apache.org > > Subject: RE: S

RE: Skewed IDF in multi lingual index

2012-11-12 Thread Markus Jelsma

r@lucene.apache.org > Subject: RE: Skewed IDF in multi lingual index > > Robert, Tom, > > That's it indeed! Using maxDoc as numerator opposed to docCount yields very > skewed results for an unevenly distributed multi-lingual index. We have one > language dominatin

RE: Skewed IDF in multi lingual index

2012-11-09 Thread Markus Jelsma

-Original message- > From:Robert Muir > Sent: Thu 08-Nov-2012 17:44 > To: solr-user@lucene.apache.org > Subject: Re: Skewed IDF in multi lingual index > > Hi Markus: how are the languages distributed across documents? > > Imagine I have a text_en field and a text_fr

Re: Skewed IDF in multi lingual index

2012-11-08 Thread Tom Burton-West

Hi Markus, No answers, but I am very interested in what you find out. We currently index all languages in one index, which presents different IDF issues, but are interested in exploring alternatives such as the one you describe. Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search

Re: Skewed IDF in multi lingual index

2012-11-08 Thread Robert Muir

Hi Markus: how are the languages distributed across documents? Imagine I have a text_en field and a text_fr field. Lets say I have 100 documents, 95 are english and only 5 are french. So the text_en field is populated 95% of the time, and the text_fr 5% of the time. But the default IDF computatio

Re: Skewed IDF in multi lingual index, again

Re: Skewed IDF in multi lingual index, again

Re: Skewed IDF in multi lingual index, again

Re: Skewed IDF in multi lingual index, again

Re: Skewed IDF in multi lingual index, again

Re: Skewed IDF in multi lingual index, again

Re: Skewed IDF in multi lingual index, again

Re: Skewed IDF in multi lingual index, again

Re: Skewed IDF in multi lingual index, again

Re: Skewed IDF in multi lingual index, again

RE: Skewed IDF in multi lingual index, again

Re: Skewed IDF in multi lingual index, again

Re: Skewed IDF in multi lingual index

RE: Skewed IDF in multi lingual index

RE: Skewed IDF in multi lingual index

Re: Skewed IDF in multi lingual index

Re: Skewed IDF in multi lingual index

17 matches

Site Navigation

Mail list logo

Footer information