[ 
https://issues.apache.org/jira/browse/LUCENE-9537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17201089#comment-17201089
 ] 

Cameron VandenBerg commented on LUCENE-9537:
--------------------------------------------

Hi Adrien,

Unfortunately, the smoothing score that we use is document specific, so I am 
not sure if I could make it "transferable".  I am definitely interested in 
brainstorming ways that we can make Indri fit into the Lucene architecture 
better though.  Perhaps an example of how Indri smoothing scores would be 
helpful.

 

Supposed we have an index with 4 documents (so sorry for the political nature 
of the documents... it's just what I can easily think of at the moment):

1) Donald Trump is the president of the United States.

2) There are three branches of government.  The president is the head of the 
executive branch.

3) Jane Doe is president of the PTO.

4) Trump was elected in the 2016 election.

 

Say that the query is: President Trump.

In this index, the term president occurs more than the term Trump.  The 
smoothing score acts like and idf for the query terms so that documents with 
just the term Trump will be ranked higher than documents with just the term 
president.

 

Consider documents 3&4, which have the same length and each have one search 
term, but Document 4 has the more rare search term.  Therefore the smoothing 
score for the term Trump in Document 3, will be lower than the smoothing score 
for the term president in Document 4.  The addition of the smoothing scores for 
the terms that don't exist allows Document 4 to get a higher score and be 
ranked above Document 3.  

 

Let me know whether this example makes sense.  Can you see a way that I can 
refactor the smoothing score so that it better fits into Lucene's existing 
architecture?  Or let me know if I misunderstood your comment and you still 
feel that what you suggested will work.

 

Thank you!

> Add Indri Search Engine Functionality to Lucene
> -----------------------------------------------
>
>                 Key: LUCENE-9537
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9537
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>            Reporter: Cameron VandenBerg
>            Priority: Major
>              Labels: patch
>         Attachments: LUCENE-INDRI.patch
>
>
> Indri ([http://lemurproject.org/indri.php]) is an academic search engine 
> developed by The University of Massachusetts and Carnegie Mellon University.  
> The major difference between Lucene and Indri is that Indri will give a 
> document a "smoothing score" to a document that does not contain the search 
> term, which has improved the search ranking accuracy in our experiments.  I 
> have created an Indri patch, which adds the search code needed to implement 
> the Indri AND logic as well as Indri's implementation of Dirichlet Smoothing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to