mikemccand commented on PR #15979:
URL: https://github.com/apache/lucene/pull/15979#issuecomment-4682279492
I'm working through my slow human review process still, but I asked my local
CC Opus 4.7 to review the PR. I haven't reviewed its review! Even THAT is
time consuming! Though I see it leaks some of my "steering doc" preferences,
heh.
Not sure I agree with all of it, e.g. the code dup across all Codec impls
... we need to be cautious there (changing the format of Codec B should not
accidentally then change the format of Codec A).
Humans reviewing AI stuff has become the bottleneck now!
> Review of PR #15979 — DedupFlatVectorsFormat
>
> High-level concerns
>
> 1. Format versioning / class naming breaks Lucene convention. Lucene
formats are named LuceneXXFooFormat and frozen on release; new versions live in
new packages. DedupFlatVectorsFormat in codecs.dedup has no version prefix, so
there's no migration path if the
> on-disk layout (.dvc/.dvm) ever changes. This is likely a blocker for
committing as-is. Either name it Lucene10XDedupFlatVectorsFormat (or similar,
matching whatever release it lands in) and place it under codecs/lucene10X/, or
land it in the sandbox module rather
> than core while still experimental.
>
> 2. ~3.6K added LOC duplicates a lot of Lucene99FlatVectorsFormat.
OffHeapDedup{Float,Byte}VectorValues reimplements the entire Dense/Sparse/Empty
hierarchy that already exists in OffHeap{Float,Byte}VectorValues. Reviewing,
maintaining, and porting future
> improvements doubles this surface. Before committing, please look at
whether the dedup variants can either subclass the existing off-heap classes or
compose them by holding an int[] docOrdToVecOrd + a KnnVectorValues pool view.
The Sparse/Dense/Empty/PoolView
> boilerplate is mostly mechanical in both files.
>
> 3. Cross-format generality. This is a wrapper over
Lucene99HnswVectorsWriter/Reader specifically. If we add a new HNSW format
(Lucene10X), do we add DedupHnswVectorsFormatLucene10X? The flat format is
reusable, so this works for the flat case, but the HNSW wrapper
> doesn't compose nicely. Worth thinking about how it interacts with
quantized formats (msokolov already asked) before we commit to the API surface.
>
> 4. Missing CHANGES.txt entry. New public format requires one.
>
> 5. PR description says "Details about the format are included in a
markdown doc in the PR" — I don't see any .md file in the diff. Either it was
not pushed, or the description should be updated. The format docs in the format
Javadoc are decent but a design doc
> justifying the cross-field-pool design (vs. e.g. a single shared field)
would help reviewers.
>
> Correctness
>
> 6. MergePool.intern byte-verify allocates ~`byteSize per probe
(readSourceBytesdoesnew byte[byteSize]`). For 1024-dim FLOAT32 = 4KB per call.
Hash collisions are rare, but probe chains over many same-hash entries on
hash-collision aren't, and probing past
> non-empty unrelated slots also needs no allocation. Reuse a single
scratch byte[] held on the MergePool.
>
> 7. MergePool.intern calls vectorValue(refSourceOrd[candidateOrd]) on a
KnnVectorValues instance that may be the same one currently being iterated by
the merge (because srcValues is forwarded straight from the merger). The
contract is that vectorValue(int) is
> random-access, but the implementations reuse a value buffer and a
lastVecOrd cache. Confirm by inspection that no caller retains a returned
vector across an intermediate vectorValue call on the same instance —
mergeFloatField/mergeByteField do
> System.arraycopy/bb.put immediately, so this looks fine, but please add a
comment that documents the invariant. A single misuse here would silently
corrupt the dedup pool.
>
> 8. DedupHash substitute-for-zero (0xDEDDEADBEEFCAFE0L) is itself a
legitimate murmur output. A vector that genuinely hashes to that substitute
will collide-probe alongside the (vanishingly rare) zero-hashing ones. Behavior
is correct (byte-verify catches it), so
> this is more of a "be honest in the comment" — the substitute doesn't
avoid collisions, it just remaps a bias. The text "itself can never match a
legitimate murmur output for a different vector with non-trivial probability"
overstates this.
>
> 9. Float-equality verify in FlushPool.internFloats (line 1145): comment
says Arrays.equals(float[]...) is OK because NaN/0.0 cases "would yield
different bytes → different hash → different slot." That's true for the first
stored candidate, but the verify runs
> against an existing vectorFloats.get(candidateOrd) whose bytes may have
hash-collided with bytesCandidate. If the colliding stored vector and the
candidate have bytes differing only by +0.0/-0.0 patterns... they have
different bytes, different hashes, different
> slots — so they wouldn't collide at all. So the comment's conclusion
holds, but the reasoning needs reordering. Cleaner: just compare bytes (you
have them), and drop the parallel vectorFloats byte-equality logic. The float[]
cache is for read-back, not
> verification.
>
> 10. vecOrdScratch = new int[ArrayUtil.oversize(numNodes, Integer.BYTES)]
in TranslatingScorer/TranslatingUpdateableScorer: the bytesPerElement arg to
oversize is per-element bytes, so 4 is correct for int[]. Just confirming since
the name Integer.BYTES is
> misleading-looking next to numNodes.
>
> Test code
>
> 11. TestDedupFlatVectorsFormat.poolUniqueCount reflects through three
layers (PerFieldKnnVectorsFormat$FieldsReader, Asserting...$delegate|in,
Lucene99HnswVectorsReader.flatVectorsReader). This is brittle — any rename in
the wrapping chain silently breaks tests
> that aren't testing what they look like they're testing. Add a
package-private accessor on the relevant readers and unwrap via that, or expose
a stat through FieldInfos/ramBytesUsed.
>
> 12. for (boolean u : used) if (u) usedCount++; (line 3516) — one-liner.
Lucene style-wise OK, but readable equivalent would be cleaner.
>
> Smaller items
>
> - MergePool.srcInternCache is Map<Integer, int[]> with autoboxing. The
reader uses IntObjectHashMap<FieldEntry> — same here would avoid boxing on
every fast-path call.
> - static record PoolKey — static is redundant on a nested record (always
implicitly static).
> - getMaxDimensions returns hardcoded 1024 with no comment why; Lucene99
returns 1024 too, so this is consistent, but worth a @inheritDoc or one-line
note.
> - if (vals == null) continue; (writer L878, L933) — fine, but Lucene
tends to brace one-liners.
> - assert cached.length == srcUniqueCount in srcVecToMergedMap: good
defensive check — but its message references "readerIdx mismatch" which is
wrong; the failure means the source pool size changed between calls, which
shouldn't happen.
>
> Author's AI disclosure
>
> The author was upfront that this is largely AI-authored. Reviewers should
weight that toward reading every code path manually — AI-authored code is
fluent but tends to over-document, over-defend, and sometimes encode invariants
in comments rather than asserts. The
> "Level A" naming, the parenthetical "(perf win)" / "(correctness)"
asides, the long discursive Javadocs about why caches exist — these are AI
tells. Tighten before committing per Lucene convention (Mike: matches your "no
dwelling on implementation details in
> javadocs" rule).
>
> Verdict
>
> Direction is good and benchmark numbers (10% faster on the 1M Cohere
benchmark) are encouraging. Not ready to commit as-is — needs (1) format
versioning/naming aligned to Lucene convention, (2) significant deduplication
against existing OffHeap*VectorValues, (3)
> CHANGES entry and the missing design doc, (4) the brittle reflective test
plumbing fixed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]