Re: [PR] Taxonomy counts are incorrect due to ordinal sorting (#14008) [lucene]

2024-11-22 Thread via GitHub
stefanvodita commented on PR #14010: URL: https://github.com/apache/lucene/pull/14010#issuecomment-2493291352 At the same time, I don't think we need to rush bug fix releases, since this functionality was broken from the time it was released. -- This is an automated message from the Apach

Re: [PR] Introduces IndexInput#updateReadAdvice to change the ReadAdvice while merging vectors [lucene]

2024-11-22 Thread via GitHub
ChrisHegarty commented on PR #13985: URL: https://github.com/apache/lucene/pull/13985#issuecomment-2493554763 > .. > Took a look, The mismatch between `mergeInstanceCount` and `mergeInstance` is because mergeInstanceCount is being updated in parent and mergeInstance is updated to true du

Re: [PR] Introduces IndexInput#updateReadAdvice to change the ReadAdvice while merging vectors [lucene]

2024-11-22 Thread via GitHub
ChrisHegarty merged PR #13985: URL: https://github.com/apache/lucene/pull/13985 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucen

Re: [I] Can we store only quantized vectors to reduce disk footprint? [lucene]

2024-11-22 Thread via GitHub
jpountz commented on issue #14007: URL: https://github.com/apache/lucene/issues/14007#issuecomment-2494379468 In my opinion, we should not have lossy codecs. This creates weird situations where the errors could compound in weird ways over time, e.g. when you switch file formats. I'd

Re: [I] How to configure TieredMergePolicy for very low segment count? [lucene]

2024-11-22 Thread via GitHub
jpountz commented on issue #14004: URL: https://github.com/apache/lucene/issues/14004#issuecomment-2494383950 I ran the IndexGeoNames benchmark with 1 indexing thread, SerialMergeScheduler, 10k buffered docs, 100MB floor segment size, 2 segments per tier. This made the total indexing time g

Re: [I] Can we store only quantized vectors to reduce disk footprint? [lucene]

2024-11-22 Thread via GitHub
benwtrent commented on issue #14007: URL: https://github.com/apache/lucene/issues/14007#issuecomment-2493691474 > I think we still need for indexing and merging as vigyasharma@ comment. I don't know if its strictly necessary to keep the raw vectors for merging. Once a certain limit i

Re: [PR] Reduce allocation rate in HNSW concurrent merge [lucene]

2024-11-22 Thread via GitHub
benwtrent commented on PR #14011: URL: https://github.com/apache/lucene/pull/14011#issuecomment-2494478036 @msokolov we should have `CHANGES` for this & it could be back ported to 10.1. Its a nice optimization that we should track. -- This is an automated message from the Apache Git Servi

Re: [I] Can we store only quantized vectors to reduce disk footprint? [lucene]

2024-11-22 Thread via GitHub
benwtrent commented on issue #14007: URL: https://github.com/apache/lucene/issues/14007#issuecomment-2494527716 > In my opinion, we should not have lossy codecs. This creates weird situations where the errors could compound in weird ways over time, e.g. when you switch file formats.

Re: [I] Can we store only quantized vectors to reduce disk footprint? [lucene]

2024-11-22 Thread via GitHub
mikemccand commented on issue #14007: URL: https://github.com/apache/lucene/issues/14007#issuecomment-2494487655 Ahh sorry @dungba88 also referenced the issue above! https://github.com/apache/lucene/issues/13158 -- This is an automated message from the Apache Git Service. To respond to t

[PR] Simplify logic in ScoreCachingWrappingScorer [lucene]

2024-11-22 Thread via GitHub
msfroh opened a new pull request, #14012: URL: https://github.com/apache/lucene/pull/14012 ### Description This is functionally equivalent to the logic that was present, but makes the behavior clearer. -- This is an automated message from the Apache Git Service. To respond to the m

[PR] Run filtered disjunctions with MaxScoreBulkScorer. [lucene]

2024-11-22 Thread via GitHub
jpountz opened a new pull request, #14014: URL: https://github.com/apache/lucene/pull/14014 Running filtered disjunctions with a specialized bulk scorer seems to yield a good speedup. For what it's worth, I also tried to implement a MAXSCORE-based scorer to see if it had to do with the `Bul

Re: [I] Can we store only quantized vectors to reduce disk footprint? [lucene]

2024-11-22 Thread via GitHub
mikemccand commented on issue #14007: URL: https://github.com/apache/lucene/issues/14007#issuecomment-2494478945 > I'd rather like it to be done on top of the codec API. E.g. computing a good scalar quantization for a given model offline, and then using it in the way in to index vectors dir

Re: [PR] Update lastDoc in ScoreCachingWrappingScorer [lucene]

2024-11-22 Thread via GitHub
msfroh commented on PR #13987: URL: https://github.com/apache/lucene/pull/13987#issuecomment-2494486990 > @msfroh FWIW I'm happy to merge this PR when we remove the double call to LeafCollector#collect on the same doc ID in tests. In that case, the unit test that I added can be remove

Re: [I] Can we store only quantized vectors to reduce disk footprint? [lucene]

2024-11-22 Thread via GitHub
mikemccand commented on issue #14007: URL: https://github.com/apache/lucene/issues/14007#issuecomment-2494484887 Also, note that, at least for the current scalar quantization (`int7`, `int4`), those full precision `float[]` vectors remain on disk during searching. They are only used during

Re: [I] [Discuss] Reducing allocations in HnswUtil::markRooted [lucene]

2024-11-22 Thread via GitHub
msokolov commented on issue #14002: URL: https://github.com/apache/lucene/issues/14002#issuecomment-2494609732 I do think it's worth improving. Another way could be to measure empirically the stack depth - maybe it scales in a predictable way with total number of vectors? And then we can us

Re: [PR] Improve checksum calculations [lucene]

2024-11-22 Thread via GitHub
jpountz commented on code in PR #13989: URL: https://github.com/apache/lucene/pull/13989#discussion_r1854638074 ## lucene/core/src/test/org/apache/lucene/store/TestBufferedChecksum.java: ## @@ -63,4 +67,127 @@ public void testRandom() { } assertEquals(c1.getValue(), c2

[PR] fix JavaDoc: use TopDocs instead of Hits [lucene]

2024-11-22 Thread via GitHub
saschaszott opened a new pull request, #14015: URL: https://github.com/apache/lucene/pull/14015 ### Description Class `Hits` was removed from the Lucene API with Lucene version 3.0. Use class `TopDocs` instead. -- This is an automated message from the Apache Git Service. To respond

Re: [PR] fix JavaDoc: use TopDocs instead of Hits [lucene]

2024-11-22 Thread via GitHub
vigyasharma merged PR #14015: URL: https://github.com/apache/lucene/pull/14015 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene

Re: [PR] Remove vector values copy() methods, moving IndexInput.clone() and temp storage into lower-level interfaces [lucene]

2024-11-22 Thread via GitHub
github-actions[bot] commented on PR #13872: URL: https://github.com/apache/lucene/pull/13872#issuecomment-2495139672 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contributi

Re: [PR] Introduces IndexInput#updateReadAdvice to change the ReadAdvice while merging vectors [lucene]

2024-11-22 Thread via GitHub
shatejas commented on PR #13985: URL: https://github.com/apache/lucene/pull/13985#issuecomment-2494021476 Thanks a lot @ChrisHegarty for adding tight tests and merging this! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [PR] Reduce allocation rate in HNSW concurrent merge [lucene]

2024-11-22 Thread via GitHub
msokolov commented on PR #14011: URL: https://github.com/apache/lucene/pull/14011#issuecomment-2494249748 Thanks @villam-durina! This looks good. I guess we were trying to enforce access to the rows, requiring that callers acquire a lock to obtain them, but it was really just a fig leaf any

Re: [PR] Add IndexInput isLoaded [lucene]

2024-11-22 Thread via GitHub
jpountz commented on PR #13998: URL: https://github.com/apache/lucene/pull/13998#issuecomment-2494266353 Sorry for derailing the PR, let's not implement it on ByteBuffersIndexInput then. We can look into it in a separate PR if we want. -- This is an automated message from the Apache Git S

Re: [PR] Update lastDoc in ScoreCachingWrappingScorer [lucene]

2024-11-22 Thread via GitHub
jpountz commented on PR #13987: URL: https://github.com/apache/lucene/pull/13987#issuecomment-2494272651 @msfroh FWIW I'm happy to merge this PR when we remove the double call to LeafCollector#collect on the same doc ID in tests. -- This is an automated message from the Apache Git Service

Re: [PR] Make CombinedFieldQuery eligible for WAND/MAXSCORE. [lucene]

2024-11-22 Thread via GitHub
jpountz merged PR #13999: URL: https://github.com/apache/lucene/pull/13999 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

Re: [PR] Only consider clauses whose cost is less than the lead cost to compute block boundaries in WANDScorer. [lucene]

2024-11-22 Thread via GitHub
jpountz merged PR #14003: URL: https://github.com/apache/lucene/pull/14003 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

Re: [PR] Add IndexInput isLoaded [lucene]

2024-11-22 Thread via GitHub
ChrisHegarty commented on PR #13998: URL: https://github.com/apache/lucene/pull/13998#issuecomment-2494090457 > > yeah, I think that this prob makes sense. Lemme satisfy myself that it will always be true. > > it won't be in core if currently swapped out, no? I don't think a hardcode

Re: [PR] Add IndexInput isLoaded [lucene]

2024-11-22 Thread via GitHub
rmuir commented on PR #13998: URL: https://github.com/apache/lucene/pull/13998#issuecomment-2493975727 > yeah, I think that this prob makes sense. Lemme satisfy myself that it will always be true. it won't be in core if currently swapped out, no? I don't think a hardcoded `true` work

Re: [I] How to configure TieredMergePolicy for very low segment count? [lucene]

2024-11-22 Thread via GitHub
mikemccand commented on issue #14004: URL: https://github.com/apache/lucene/issues/14004#issuecomment-2494470229 > Interestingly, an index that is less than 1GB can still have 10 segments with the above merge policy because of the constraint to not run merges where the resulting segment is