Re: [PR] Add a de-duplicating flat vector format [lucene]

via GitHub Thu, 11 Jun 2026 08:37:30 -0700


mikemccand commented on PR #15979:
URL: https://github.com/apache/lucene/pull/15979#issuecomment-4682279492


   I'm working through my slow human review process still, but I asked my local 
CC Opus 4.7 to review the PR.  I haven't reviewed its review!  Even THAT is 
time consuming!  Though I see it leaks some of my "steering doc" preferences, 
heh.
   
   Not sure I agree with all of it, e.g. the code dup across all Codec impls 
... we need to be cautious there (changing the format of Codec B should not 
accidentally then change the format of Codec A).
   
   Humans reviewing AI stuff has become the bottleneck now!
   
   >   Review of PR #15979 — DedupFlatVectorsFormat
   > 
   >   High-level concerns
   > 
   >  1. Format versioning / class naming breaks Lucene convention. Lucene 
formats are named LuceneXXFooFormat and frozen on release; new versions live in 
new packages. DedupFlatVectorsFormat in codecs.dedup has no version prefix, so 
there's no migration path if the
   >  on-disk layout (.dvc/.dvm) ever changes. This is likely a blocker for 
committing as-is. Either name it Lucene10XDedupFlatVectorsFormat (or similar, 
matching whatever release it lands in) and place it under codecs/lucene10X/, or 
land it in the sandbox module rather
   >  than core while still experimental.
   >
   >  2. ~3.6K added LOC duplicates a lot of Lucene99FlatVectorsFormat. 
OffHeapDedup{Float,Byte}VectorValues reimplements the entire Dense/Sparse/Empty 
hierarchy that already exists in OffHeap{Float,Byte}VectorValues. Reviewing, 
maintaining, and porting future
   >  improvements doubles this surface. Before committing, please look at 
whether the dedup variants can either subclass the existing off-heap classes or 
compose them by holding an int[] docOrdToVecOrd + a KnnVectorValues pool view. 
The Sparse/Dense/Empty/PoolView
   >  boilerplate is mostly mechanical in both files.
   >
   >  3. Cross-format generality. This is a wrapper over 
Lucene99HnswVectorsWriter/Reader specifically. If we add a new HNSW format 
(Lucene10X), do we add DedupHnswVectorsFormatLucene10X? The flat format is 
reusable, so this works for the flat case, but the HNSW wrapper
   >  doesn't compose nicely. Worth thinking about how it interacts with 
quantized formats (msokolov already asked) before we commit to the API surface.
   >
   >  4. Missing CHANGES.txt entry. New public format requires one.
   >
   >  5. PR description says "Details about the format are included in a 
markdown doc in the PR" — I don't see any .md file in the diff. Either it was 
not pushed, or the description should be updated. The format docs in the format 
Javadoc are decent but a design doc
   >  justifying the cross-field-pool design (vs. e.g. a single shared field) 
would help reviewers.
   >
   >  Correctness
   >
   >  6. MergePool.intern byte-verify allocates ~`byteSize per probe 
(readSourceBytesdoesnew byte[byteSize]`). For 1024-dim FLOAT32 = 4KB per call. 
Hash collisions are rare, but probe chains over many same-hash entries on 
hash-collision aren't, and probing past
   >  non-empty unrelated slots also needs no allocation. Reuse a single 
scratch byte[] held on the MergePool.
   >
   >  7. MergePool.intern calls vectorValue(refSourceOrd[candidateOrd]) on a 
KnnVectorValues instance that may be the same one currently being iterated by 
the merge (because srcValues is forwarded straight from the merger). The 
contract is that vectorValue(int) is
   >  random-access, but the implementations reuse a value buffer and a 
lastVecOrd cache. Confirm by inspection that no caller retains a returned 
vector across an intermediate vectorValue call on the same instance — 
mergeFloatField/mergeByteField do
   >  System.arraycopy/bb.put immediately, so this looks fine, but please add a 
comment that documents the invariant. A single misuse here would silently 
corrupt the dedup pool.
   >
   >  8. DedupHash substitute-for-zero (0xDEDDEADBEEFCAFE0L) is itself a 
legitimate murmur output. A vector that genuinely hashes to that substitute 
will collide-probe alongside the (vanishingly rare) zero-hashing ones. Behavior 
is correct (byte-verify catches it), so
   >  this is more of a "be honest in the comment" — the substitute doesn't 
avoid collisions, it just remaps a bias. The text "itself can never match a 
legitimate murmur output for a different vector with non-trivial probability" 
overstates this.
   >
   >  9. Float-equality verify in FlushPool.internFloats (line 1145): comment 
says Arrays.equals(float[]...) is OK because NaN/0.0 cases "would yield 
different bytes → different hash → different slot." That's true for the first 
stored candidate, but the verify runs
   >  against an existing vectorFloats.get(candidateOrd) whose bytes may have 
hash-collided with bytesCandidate. If the colliding stored vector and the 
candidate have bytes differing only by +0.0/-0.0 patterns... they have 
different bytes, different hashes, different
   >  slots — so they wouldn't collide at all. So the comment's conclusion 
holds, but the reasoning needs reordering. Cleaner: just compare bytes (you 
have them), and drop the parallel vectorFloats byte-equality logic. The float[] 
cache is for read-back, not
   >  verification.
   >
   >  10. vecOrdScratch = new int[ArrayUtil.oversize(numNodes, Integer.BYTES)] 
in TranslatingScorer/TranslatingUpdateableScorer: the bytesPerElement arg to 
oversize is per-element bytes, so 4 is correct for int[]. Just confirming since 
the name Integer.BYTES is
   >  misleading-looking next to numNodes.
   >
   >  Test code
   >
   >  11. TestDedupFlatVectorsFormat.poolUniqueCount reflects through three 
layers (PerFieldKnnVectorsFormat$FieldsReader, Asserting...$delegate|in, 
Lucene99HnswVectorsReader.flatVectorsReader). This is brittle — any rename in 
the wrapping chain silently breaks tests
   >  that aren't testing what they look like they're testing. Add a 
package-private accessor on the relevant readers and unwrap via that, or expose 
a stat through FieldInfos/ramBytesUsed.
   >
   >  12. for (boolean u : used) if (u) usedCount++; (line 3516) — one-liner. 
Lucene style-wise OK, but readable equivalent would be cleaner.
   >
   >  Smaller items
   >
   >  - MergePool.srcInternCache is Map<Integer, int[]> with autoboxing. The 
reader uses IntObjectHashMap<FieldEntry> — same here would avoid boxing on 
every fast-path call.
   >  - static record PoolKey — static is redundant on a nested record (always 
implicitly static).
   >  - getMaxDimensions returns hardcoded 1024 with no comment why; Lucene99 
returns 1024 too, so this is consistent, but worth a @inheritDoc or one-line 
note.
   >  - if (vals == null) continue; (writer L878, L933) — fine, but Lucene 
tends to brace one-liners.
   >  - assert cached.length == srcUniqueCount in srcVecToMergedMap: good 
defensive check — but its message references "readerIdx mismatch" which is 
wrong; the failure means the source pool size changed between calls, which 
shouldn't happen.
   >
   >  Author's AI disclosure
   >
   >  The author was upfront that this is largely AI-authored. Reviewers should 
weight that toward reading every code path manually — AI-authored code is 
fluent but tends to over-document, over-defend, and sometimes encode invariants 
in comments rather than asserts. The
   >  "Level A" naming, the parenthetical "(perf win)" / "(correctness)" 
asides, the long discursive Javadocs about why caches exist — these are AI 
tells. Tighten before committing per Lucene convention (Mike: matches your "no 
dwelling on implementation details in
   >  javadocs" rule).
   >
   >  Verdict
   >
   >  Direction is good and benchmark numbers (10% faster on the 1M Cohere 
benchmark) are encouraging. Not ready to commit as-is — needs (1) format 
versioning/naming aligned to Lucene convention, (2) significant deduplication 
against existing OffHeap*VectorValues, (3)
   >  CHANGES entry and the missing design doc, (4) the brittle reflective test 
plumbing fixed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add a de-duplicating flat vector format [lucene]

Reply via email to