[
https://issues.apache.org/jira/browse/LUCENE-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Anshum Gupta moved SOLR-13752 to LUCENE-8984:
---------------------------------------------
Component/s: (was: MoreLikeThis)
Key: LUCENE-8984 (was: SOLR-13752)
Lucene Fields: New,Patch Available
Project: Lucene - Core (was: Solr)
Security: (was: Public)
> MoreLikeThis MLT is biased for uncommon fields
> ----------------------------------------------
>
> Key: LUCENE-8984
> URL: https://issues.apache.org/jira/browse/LUCENE-8984
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Andy Hind
> Assignee: Anshum Gupta
> Priority: Major
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> MLT always uses the total doc count and not the count of docs with the
> specific field
>
> To quote Maria Mestre from the discussion on the mailing list - 29/01/19
>
> {quote}The issue I have is that when retrieving the key scored terms
> (interestingTerms), the code uses the total number of documents in the index,
> not the total number of documents with populated “description” field. This is
> where it’s done in the code:
> [https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_lucene-2Dsolr_blob_master_lucene_queries_src_java_org_apache_lucene_queries_mlt_MoreLikeThis.java-23L651&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=XIYHWqjoenB2nuyYPl8m6c5xBIOD8PZJ4CWx0j6tQjA&m=gYOyL1Msgk2dpzigOsIvXq3CiFF0T7ApMLBVVDKW2dQ&s=v4mgEvgP3HWtMZcL3FTiKeY2nBOPJpTypmCpCBwPkQs&e=]
> The effect of this choice is that the “idf” does not vary much, given that
> numDocs >> number of documents with “description”, so the key terms end up
> being just the terms with the highest term frequencies.
> It is inconsistent because the MLT-search then uses these extracted key terms
> and scores all documents using an idf which is computed only on the subset of
> documents with “description”. So one part of the MLT uses a different numDocs
> than another part. This sounds like an odd choice, and not expected at all,
> and I wonder if I’m missing something.
> {quote}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]