jpountz commented on issue #12696: URL: https://github.com/apache/lucene/issues/12696#issuecomment-1779221543
For reference, Lucene used to use FOR for postings and PFOR for positions in 8.x. This was changed in 9.0 via #69 to use PFOR for both postings and positions. This PR says it made the index 3% smaller with no performance impact, but I can believe that we are noticing an impact now as many things changed in the meantime. I'm +1 to switching back to FOR if it yields better performance. I have a preference for keeping PFOR for positions and only moving postings to FOR (essentially reverting #69). The benchmark in this issue description used wikimedium, which by design doesn't have much position data since all documents are truncated. Using PFOR for positions and FOR for postings sounds like a good trade-off to me as positions are less important for performance typically. And if someone wants better performance for their phrase queries, it would likely be a better idea to use a `CommonGramsFilter` than to switch positions from PFOR to FOR? I remember we observed a [15% reduction](https://twitter.com/jpountz/status/1486608300905009156) of our inverted indexes for logs when Lucene moved from FOR to PFOR at Elastic, but I don't think it should block this change, Elasticsearch can maintain its own postings format that uses PFOR for postings. I'm just mentioning it as a way to highlight that I'm expecting that some users will observe an increased disk usage that is more than 3%. Regarding backward compatibility, let's do it with codecs as usual: fork `Lucene90PostingsFormat` into a new `Lucene99PostingsFormat` that uses PFOR for postings. Then the codec infrastructure will make sure to keep using the old postings format for existing segments and the new postings format for new segments (including merged ones). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org