[GitHub] [lucene] thongnt99 commented on issue #11799: Indexing method for learned sparse retrieval

GitBox Wed, 21 Sep 2022 12:12:01 -0700


thongnt99 commented on issue #11799:
URL: https://github.com/apache/lucene/issues/11799#issuecomment-1254119695

@jtibshirani The query side is same as document side, which is a dictionary
of terms and weights. To make it compatible with Lucene, people just repeat the
terms with its frequency. This is fine because queries are usually much
shorter.
Yes, FeatureField is something similar, but we want a single Field
containing a list of key-value pairs or a json formatted.
@msokolov @rmuir @mocobeta: I fould
[this](https://github.com/apache/lucene/blob/475fbd0bdde31c6a2ae62c59505cf9e8becd50e4/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/DelimitedTermFrequencyTokenFilter.java),
which could somehow achieves what we want; But I think it is not so flexible,
we need to turn the json file into a token stream formatted as:
[<term><delimiter><frequency>......] ... I think this step is redundant. Can
we just load the json file directly? For this I think we might have to move
away from TokenStream pipeline?
What do you think? Your thought is very much appreciated as I am not very
familiar with Lucene.

We can form a group to do this if you guys are interested in.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] thongnt99 commented on issue #11799: Indexing method for learned sparse retrieval

Reply via email to