uschindler commented on issue #15420:
URL: https://github.com/apache/lucene/issues/15420#issuecomment-3531498532
Hi,
thats all fine if it fitrs to your own ("we"?) search engine. But what you
are propsing here is a complete change to the fundanmentals of how an Lucene
index works. Especially the Phase 3 in your last comment is something which is
a no-go for indexes, because the "write-once" is a basic design principle of
the transactional system behind Lucene. Lucene supports snapshotting and also
requires that an IndexReader which is already open does not change its contents
unless it is reloaded/refreshed to last commit point. Therefore writing to
existing files is a no-go. There is not much more to say.
If you are arguing: but we want updates to existing graphs: Yes, that's
possible, look how it works with index deletes or doc-values, which are
updateable. The pattern behind all those features is by using "delta files". So
you don't start a new segment, but that changes to the current segment (e.g.,
deletes or updated docvalues) are written to a separate file. The separate
files disappear during merging.
For phase 1 and phase 2, there are no random access writes neede, although
this is a different dicussion point not fitting in that issue.
In general, if you want to use parallelism and therefor want to write graphs
in parallel, there are multiple ways:
- As robert said: use multiple threads to write documents. Every thread
produces a separate segment. This works well with documents, but this also
allows to build the graph in parallel
- If you want to parallelize writing when building a single segment, a codec
could simply write the parts to different files. Theres no need to seek
*within* one file.
> Can you share the reasoning why it was removed from earlier versions?
Where these benefits been considered back then? Would be great to have data
driven discussion around those aspects to weigh cons/pros outside of
idiomatic/syntactic preferences.
In earlier versions of Lucene there was sometimes the need to write the
header (with sizes) of a file at end. Because this does not work with checksums
and because the general writing of index files was otherwise sequential, the
IndexOutput's seek methods were removed. The Lucene 4+ codecs use the same
pattern as said before: Split index files. E.g., for CFS theres now a CFE file
with additional information which cannot be written without seeking. The same
applies for many other index codecs: They have multiple files for the same
piece of information. The same applies for graphs: you can split a file
containing the graph, if it improves writing speed by working in parallel.
In summary: Theres no need to allow random access for writing.
Uwe
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]