jpountz commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1779221543

   For reference, Lucene used to use FOR for postings and PFOR for positions in 
8.x. This was changed in 9.0 via #69 to use PFOR for both postings and 
positions. This PR says it made the index 3% smaller with no performance 
impact, but I can believe that we are noticing an impact now as many things 
changed in the meantime. I'm +1 to switching back to FOR if it yields better 
performance.
   
   I have a preference for keeping PFOR for positions and only moving postings 
to FOR (essentially reverting #69). The benchmark in this issue description 
used wikimedium, which by design doesn't have much position data since all 
documents are truncated. Using PFOR for positions and FOR for postings sounds 
like a good trade-off to me as positions are less important for performance 
typically. And if someone wants better performance for their phrase queries, it 
would likely be a better idea to use a `CommonGramsFilter` than to switch 
positions from PFOR to FOR?
   
   I remember we observed a [15% 
reduction](https://twitter.com/jpountz/status/1486608300905009156) of our 
inverted indexes for logs when Lucene moved from FOR to PFOR at Elastic, but I 
don't think it should block this change, Elasticsearch can maintain its own 
postings format that uses PFOR for postings. I'm just mentioning it as a way to 
highlight that I'm expecting that some users will observe an increased disk 
usage that is more than 3%.
   
   Regarding backward compatibility, let's do it with codecs as usual: fork 
`Lucene90PostingsFormat` into a new `Lucene99PostingsFormat` that uses PFOR for 
postings. Then the codec infrastructure will make sure to keep using the old 
postings format for existing segments and the new postings format for new 
segments (including merged ones).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to