sam-herman commented on issue #15420:
URL: https://github.com/apache/lucene/issues/15420#issuecomment-3533499673
> In general, if you want to use parallelism and therefor want to write
graphs in parallel, there are multiple ways:
>
> As robert said: use multiple threads to write documents. Every thread
produces a separate segment. This works well with documents, but this also
allows to build the graph in parallel
If you want to parallelize writing when building a single segment, a codec
could simply write the parts to different files. Theres no need to seek within
one file.
I want to clarify a few points where it seems we may be talking past each
other, especially regarding graph-based structures and why segments themselves
(not files) are the root of the challenge I'm trying to address.
**1. The core issue is the multiplicity of segments, not multiplicity of
files**
For graph/vector indices, the dominant cost during query execution is not
file I/O — it is the independent traversal of each segment-local graph. A
HNSW-style query is typically log N over the graph in a single segment. But
when the same document population is spread across K segments, we effectively
incur: `K × log(N/K)`
This becomes linear in K and forces multi-segment search to become
CPU-parallel rather than algorithmically efficient. Even when each segment is
small, we pay the per-segment entry-point overhead and separate graph traversal
cost.
So while parallelism can certainly be applied at the file level, or at the
per-thread writer level, it does not address the fact that each segment is
itself a standalone graph that must be searched in full. Ultimately, segment
count — not file count — is the scalability bottleneck for graph query
workloads. This is the main reason why I'm exploring ways to keep the number of
graph-active segments small.
**2. Why multi-file writes still cannot form a single logical graph**
Even if we shard the graph across multiple files, Lucene still treats all
files belonging to a segment as one isolated search universe. There is
currently:
• no ability to share graph entry points across segments,
• no cross-segment edges,
• no global routing layer,
• no concept of a segment-spanning graph structure.
Advanced graph layouts (e.g., coarse-to-fine partitioned graphs like in
https://dl.acm.org/doi/10.1145/3600006.3613166) fundamentally rely on a single
logical graph with a global routing layer. Under Lucene’s model, the moment
data is split into segments, those partitions cannot cooperate as one graph —
even if they physically reside in multiple files.
So multi-file writing improves I/O performance, but it does not solve the
architectural constraint.
> In earlier versions of Lucene there was sometimes the need to write the
header (with sizes) of a file at end. Because this does not work with checksums
and because the general writing of index files was otherwise sequential, the
IndexOutput's seek methods were removed. The Lucene 4+ codecs use the same
pattern as said before: Split index files. E.g., for CFS theres now a CFE file
with additional information which cannot be written without seeking. The same
applies for many other index codecs: They have multiple files for the same
piece of information. The same applies for graphs: you can split a file
containing the graph, if it improves writing speed by working in parallel.
Thanks for the context, the motivation this time however is different, is
not header rewriting; it’s to enable parallel construction of a single large
graph segment to reduce the number of queried graphs, without paying the IO
cost of full segment rebuild during merges.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]