thongnt99 opened a new issue, #11799:
URL: https://github.com/apache/lucene/issues/11799

   ### Description
   
   Recent learned sparse retrieval methods 
([Splade](https://github.com/naver/splade), 
[uniCOIL](https://github.com/castorini/pyserini/blob/master/docs/experiments-unicoil.md))
 were trained to generate impact score directly (replacing tf-idf score).  
   For each document, they will generate a json file with terms and weights,  
e.g. `{";": 80, "the": 161, "of": 85, "and": 27, "to": 24, "was": 47, "as": 27, 
"their": 96, "what": 40, "over": 123, "only": 123, "important": 186, "project": 
208, "success": 215, "meant": 131, "lives": 140, "presence": 180, "scientific": 
200, "communication": 235, "thousands": 142, "hundreds": 144, "truly": 170, 
"hanging": 141, "cloud": 187, "engineers": 127, "achievement": 192, 
"researchers": 137, "innocent": 181, "manhattan": 244, "impressive": 191, 
"equally": 163, "##rated": 132, "minds": 137, "atomic": 214, "amid": 201, 
"##lite": 120, "intellect": 202, "ob": 140}}`
   Can we make a new feature that could index this type of document 
efficiently? 
   The current [work-around 
](https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/collection/JsonVectorCollection.java)
 I am aware of is to create a fake document by repeating the terms: e.g., `"the 
the the the .... of of of of of "`
   However, this way is not very efficient if the impact score gets bigger and 
also it requires impact score quantization before indexing. 
   I think it would be very useful for many people if we can index the json 
files directly with float impact scores. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to