Re: [I] [RFC] Add Random Access Write Support to IndexOutput [lucene]

via GitHub Fri, 14 Nov 2025 08:13:46 -0800


sam-herman commented on issue #15420:
URL: https://github.com/apache/lucene/issues/15420#issuecomment-3533499673


   > In general, if you want to use parallelism and therefor want to write 
graphs in parallel, there are multiple ways:
   >
   > As robert said: use multiple threads to write documents. Every thread 
produces a separate segment. This works well with documents, but this also 
allows to build the graph in parallel
   If you want to parallelize writing when building a single segment, a codec 
could simply write the parts to different files. Theres no need to seek within 
one file.
   
   I want to clarify a few points where it seems we may be talking past each 
other, especially regarding graph-based structures and why segments themselves 
(not files) are the root of the challenge I'm trying to address.
   
   **1. The core issue is the multiplicity of segments, not multiplicity of 
files**
   
   For graph/vector indices, the dominant cost during query execution is not 
file I/O — it is the independent traversal of each segment-local graph. A 
HNSW-style query is typically log N over the graph in a single segment. But 
when the same document population is spread across K segments, we effectively 
incur: `K × log(N/K)`
   
   This becomes linear in K and forces multi-segment search to become 
CPU-parallel rather than algorithmically efficient. Even when each segment is 
small, we pay the per-segment entry-point overhead and separate graph traversal 
cost.
   
   So while parallelism can certainly be applied at the file level, or at the 
per-thread writer level, it does not address the fact that each segment is 
itself a standalone graph that must be searched in full. Ultimately, segment 
count — not file count — is the scalability bottleneck for graph query 
workloads. This is the main reason why I'm exploring ways to keep the number of 
graph-active segments small.
   
   **2. Why multi-file writes still cannot form a single logical graph**
   
   Even if we shard the graph across multiple files, Lucene still treats all 
files belonging to a segment as one isolated search universe. There is 
currently:
        •       no ability to share graph entry points across segments,
        •       no cross-segment edges,
        •       no global routing layer,
        •       no concept of a segment-spanning graph structure.
   
   Advanced graph layouts (e.g., coarse-to-fine partitioned graphs like in 
https://dl.acm.org/doi/10.1145/3600006.3613166) fundamentally rely on a single 
logical graph with a global routing layer. Under Lucene’s model, the moment 
data is split into segments, those partitions cannot cooperate as one graph — 
even if they physically reside in multiple files.
   
   So multi-file writing improves I/O performance, but it does not solve the 
architectural constraint.
   
   > In earlier versions of Lucene there was sometimes the need to write the 
header (with sizes) of a file at end. Because this does not work with checksums 
and because the general writing of index files was otherwise sequential, the 
IndexOutput's seek methods were removed. The Lucene 4+ codecs use the same 
pattern as said before: Split index files. E.g., for CFS theres now a CFE file 
with additional information which cannot be written without seeking. The same 
applies for many other index codecs: They have multiple files for the same 
piece of information. The same applies for graphs: you can split a file 
containing the graph, if it improves writing speed by working in parallel.
   
   Thanks for the context, the motivation this time however is different, is 
not header rewriting; it’s to enable parallel construction of a single large 
graph segment to reduce the number of queried graphs, without paying the IO 
cost of full segment rebuild during merges.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] [RFC] Add Random Access Write Support to IndexOutput [lucene]

Reply via email to