martijnvg opened a new issue, #14881:
URL: https://github.com/apache/lucene/issues/14881

   ### Description
   
   Today when index sorting is enabled and stored fields get flushed then the 
`SortingStoredFieldsConsumer` gets used in order to store stored fields in the 
order in which index sorting is configured. This class writes temp files to 
disk that then get read completely twice. The first time is to do an integrity 
check and the second time the temp files are read in random order. This to 
write stored fields in the right order (defined by index sorting) in the new 
segment. 
   
   During heavy indexing the fact that the stored field temp files are read 
twice is expensive. Especially given that these temp files will be removed 
after flushing has completed. In other formats (postings, bkd tree, quantized 
vectors), tmp files that get created during writing seem to be read only once. 
During reading either integrity is check using `Directory#openChecksumInput()` 
(only possible if temp file is read from beginning to end) or there is a footer 
check (reads and validates CRC, footer magic and algorithm id).
   
   I wonder whether it makes sense to remove the full separate [integrity 
check](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/SortingStoredFieldsConsumer.java#L106)
 in `SortingStoredFieldsConsumer`? This can be quite costly, especially the 
integrity check for the temp fdt file and also there is already some light 
integrity checking via `CodecUtil.checkFooter(...)` in 
`Lucene90CompressingStoredFieldsReader`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to