jpountz commented on issue #13675: URL: https://github.com/apache/lucene/issues/13675#issuecomment-2303000074
I found this recent paper by well-known people in the IR efficiency space quite interesting: https://arxiv.org/pdf/2405.01117. It builds on inverted indexes and simple/intuitive ideas: - BP reordering, that Ben alluded to and that Lucene already supports, it naturally clusters documents with similar terms together, - Block-max WAND, which Lucene supports, - Anytime ranking on document ordered indexes (https://arxiv.org/pdf/2104.08976), ie. ranking ranges of doc IDs that have the best impact scores first in order to optimize pruning. Something Lucene doesn't support at the moment but that look doable and generally useful. - Unsafe top-k search via termination conditions and skipping blocks that are barely more competitive than the current top-k-th hit. - Query term pruning, which sounds like a good idea in general for learned sparse retrieval when the model generates many terms. - Scoring high-frequency / low-scoring terms via a forward index instead of an inverted index. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org