[GitHub] [lucene] thongnt99 opened a new issue, #11799: Indexing method for learned sparse retrieval

GitBox Wed, 21 Sep 2022 05:26:40 -0700


thongnt99 opened a new issue, #11799:
URL: https://github.com/apache/lucene/issues/11799


   ### Description
   
   Recent learned sparse retrieval methods 
([Splade](https://github.com/naver/splade), 
[uniCOIL](https://github.com/castorini/pyserini/blob/master/docs/experiments-unicoil.md))
 were trained to generate impact score directly (replacing tf-idf score).  
   For each document, they will generate a json file with terms and weights,  
e.g. `{";": 80, "the": 161, "of": 85, "and": 27, "to": 24, "was": 47, "as": 27, 
"their": 96, "what": 40, "over": 123, "only": 123, "important": 186, "project": 
208, "success": 215, "meant": 131, "lives": 140, "presence": 180, "scientific": 
200, "communication": 235, "thousands": 142, "hundreds": 144, "truly": 170, 
"hanging": 141, "cloud": 187, "engineers": 127, "achievement": 192, 
"researchers": 137, "innocent": 181, "manhattan": 244, "impressive": 191, 
"equally": 163, "##rated": 132, "minds": 137, "atomic": 214, "amid": 201, 
"##lite": 120, "intellect": 202, "ob": 140}}`
   Can we make a new feature that could index this type of document 
efficiently? 
   The current [work-around 
](https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/collection/JsonVectorCollection.java)
 I am aware of is to create a fake document by repeating the terms: e.g., `"the 
the the the .... of of of of of "`
   However, this way is not very efficient if the impact score gets bigger and 
also it requires impact score quantization before indexing. 
   I think it would be very useful for many people if we can index the json 
files directly with float impact scores. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] thongnt99 opened a new issue, #11799: Indexing method for learned sparse retrieval

Reply via email to