Re: [I] [RFC] Add Random Access Write Support to IndexOutput [lucene]

via GitHub Fri, 14 Nov 2025 00:14:03 -0800


uschindler commented on issue #15420:
URL: https://github.com/apache/lucene/issues/15420#issuecomment-3531498532


   Hi,
   thats all fine if it fitrs to your own ("we"?) search engine. But what you 
are propsing here is a complete change to the fundanmentals of how an Lucene 
index works. Especially the Phase 3 in your last comment is something which is 
a no-go for indexes, because the "write-once" is a basic design principle of 
the transactional system behind Lucene. Lucene supports snapshotting and also 
requires that an IndexReader which is already open does not change its contents 
unless it is reloaded/refreshed to last commit point. Therefore writing to 
existing files is a no-go. There is not much more to say.
   
   If you are arguing: but we want updates to existing graphs: Yes, that's 
possible, look how it works with index deletes or doc-values, which are 
updateable. The pattern behind all those features is by using "delta files". So 
you don't start a new segment, but that changes to the current segment (e.g., 
deletes or updated docvalues) are written to a separate file. The separate 
files disappear during merging.
   
   For phase 1 and phase 2, there are no random access writes neede, although 
this is a different dicussion point not fitting in that issue.
   
   In general, if you want to use parallelism and therefor want to write graphs 
in parallel, there are multiple ways:
   - As robert said: use multiple threads to write documents. Every thread 
produces a separate segment. This works well with documents, but this also 
allows to build the graph in parallel
   - If you want to parallelize writing when building a single segment, a codec 
could simply write the parts to different files. Theres no need to seek 
*within* one file.
   
   > Can you share the reasoning why it was removed from earlier versions? 
Where these benefits been considered back then? Would be great to have data 
driven discussion around those aspects to weigh cons/pros outside of 
idiomatic/syntactic preferences.
   
   In earlier versions of Lucene there was sometimes the need to write the 
header (with sizes) of a file at end. Because this does not work with checksums 
and because the general writing of index files was otherwise sequential, the 
IndexOutput's seek methods were removed. The Lucene 4+ codecs use the same 
pattern as said before: Split index files. E.g., for CFS theres now a CFE file 
with additional information which cannot be written without seeking. The same 
applies for many other index codecs: They have multiple files for the same 
piece of information. The same applies for graphs: you can split a file 
containing the graph, if it improves writing speed by working in parallel.
   
   In summary: Theres no need to allow random access for writing.
   
   Uwe


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] [RFC] Add Random Access Write Support to IndexOutput [lucene]

Reply via email to