Re: [PR] Inline skip data into postings lists [lucene]

2024-08-01 Thread via GitHub
jpountz commented on PR #13585: URL: https://github.com/apache/lucene/pull/13585#issuecomment-2262631387 Things got a bit better later on (https://github.com/apache/lucene/pull/13585#issuecomment-2246112137), but your reading is correct that some queries get slower. This seems to especially

Re: [PR] SparseFixedBitSet#firstDoc: reduce number of `indices` iterations for a bit set that is not fully built yet. [lucene]

2024-08-01 Thread via GitHub
epotyom commented on code in PR #13559: URL: https://github.com/apache/lucene/pull/13559#discussion_r1699871277 ## lucene/core/src/java/org/apache/lucene/util/BitSet.java: ## @@ -92,6 +92,12 @@ public void clear() { */ public abstract int nextSetBit(int index); + /** +

Re: [PR] Optimize binary search call [lucene]

2024-08-01 Thread via GitHub
dungba88 commented on PR #13595: URL: https://github.com/apache/lucene/pull/13595#issuecomment-2262677876 The `advance` will keep reducing the array size and we will generally advance small steps ahead right? Then I think exponential search makes sense. I'll try to use `IntArrayDocIdSetIter

Re: [PR] Reduce memory usage of SkipListWriter [lucene]

2024-08-01 Thread via GitHub
mikemccand commented on PR #13576: URL: https://github.com/apache/lucene/pull/13576#issuecomment-2262801593 This looks correct to me too -- we know the max number of skip levels for this particular segment and should size the arrays based on that, not based on the global worst case max ever

Re: [PR] Inline skip data into postings lists [lucene]

2024-08-01 Thread via GitHub
mikemccand commented on PR #13585: URL: https://github.com/apache/lucene/pull/13585#issuecomment-2262812551 Nice pop in the nightly benchmarks from this! [`OrHighMedium`](https://home.apache.org/~mikemccand/lucenebench/OrHighMed.html) jumped. Even [`Phrase`](https://home.apache.org/~mike

Re: [PR] Reduce memory usage of SkipListWriter [lucene]

2024-08-01 Thread via GitHub
bugmakerr commented on PR #13576: URL: https://github.com/apache/lucene/pull/13576#issuecomment-2262823557 @mikemccand  ok, I will close this PR. Btw, I think when we can do the same optimization for the skip reader, and we usually need to read old indices. I want to know if I should o

Re: [PR] Inline skip data into postings lists [lucene]

2024-08-01 Thread via GitHub
jpountz commented on PR #13585: URL: https://github.com/apache/lucene/pull/13585#issuecomment-2262842204 Hmm, [`CombinedHighHigh`](https://home.apache.org/~mikemccand/lucenebench/CombinedHighHigh.html) is angry. I had not benchmarked it while developping, I'll check it out. Some spee

Re: [PR] SparseFixedBitSet#firstDoc: reduce number of `indices` iterations for a bit set that is not fully built yet. [lucene]

2024-08-01 Thread via GitHub
gsmiller merged PR #13559: URL: https://github.com/apache/lucene/pull/13559 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.ap

Re: [PR] SparseFixedBitSet#firstDoc: reduce number of `indices` iterations for a bit set that is not fully built yet. [lucene]

2024-08-01 Thread via GitHub
gsmiller commented on PR #13559: URL: https://github.com/apache/lucene/pull/13559#issuecomment-2263190898 @epotyom just merged. Thanks for the change! I just now noticed you put the changes entry under 10.0 but I don't see any reason we can't backport to 9.12. I'm going to backport now and

Re: [PR] HnswLock: access locks via hash and only use for concurrent indexing [lucene]

2024-08-01 Thread via GitHub
msokolov commented on PR #13581: URL: https://github.com/apache/lucene/pull/13581#issuecomment-2263206986 I'm going to merge as-is and we can follow up with the additional safety measure -- This is an automated message from the Apache Git Service. To respond to the message, please log on

Re: [PR] HnswLock: access locks via hash and only use for concurrent indexing [lucene]

2024-08-01 Thread via GitHub
msokolov merged PR #13581: URL: https://github.com/apache/lucene/pull/13581 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.ap

Re: [PR] Inline skip data into postings lists [lucene]

2024-08-01 Thread via GitHub
jpountz commented on PR #13585: URL: https://github.com/apache/lucene/pull/13585#issuecomment-2263216720 I found the problem with `CombinedHighHigh`, the logic for lazily decoding frequencies was broken and we'd decode the whole block of frequencies on every freq() calls. It's now fixed so

Re: [PR] Optimize binary search call [lucene]

2024-08-01 Thread via GitHub
gsmiller commented on PR #13595: URL: https://github.com/apache/lucene/pull/13595#issuecomment-2263219440 I _think_ exponential search will only outperform binary search in this case if we expect the next target to be relatively close to the "min" we're constantly "pushing up" (thanks to yo

Re: [PR] HnswLock: access locks via hash and only use for concurrent indexing [lucene]

2024-08-01 Thread via GitHub
msokolov commented on PR #13581: URL: https://github.com/apache/lucene/pull/13581#issuecomment-2263229682 Also backported to 9x -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific commen

Re: [PR] Inline skip data into postings lists [lucene]

2024-08-01 Thread via GitHub
mikemccand commented on PR #13585: URL: https://github.com/apache/lucene/pull/13585#issuecomment-2263239458 Phew, thanks for catching the performance regression and tracking it down @jpountz. GO BENCHMARKING! -- This is an automated message from the Apache Git Service. To respond to the

Re: [PR] Optimize binary search call [lucene]

2024-08-01 Thread via GitHub
jpountz commented on PR #13595: URL: https://github.com/apache/lucene/pull/13595#issuecomment-2263244861 If `DocIdSetIterator#advance` gets called on large increments, then there are only so many calls that can be done because the doc ID space is quickly exhausted. However, if you only adva

Re: [PR] Optimize binary search call [lucene]

2024-08-01 Thread via GitHub
gsmiller commented on PR #13595: URL: https://github.com/apache/lucene/pull/13595#issuecomment-2263350507 Ah yeah, OK thanks @jpountz. Makes sense. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[PR] Remove some BitSet#nextSetBit code duplication [lucene]

2024-08-01 Thread via GitHub
gsmiller opened a new pull request, #13625: URL: https://github.com/apache/lucene/pull/13625 After merging #13559 I noticed an opportunity to remove some redundant code in the `nextSetBit` implementations. -- This is an automated message from the Apache Git Service. To respond to the mess

Re: [PR] KMeans clustering algorithm [lucene]

2024-08-01 Thread via GitHub
mayya-sharipova commented on code in PR #13604: URL: https://github.com/apache/lucene/pull/13604#discussion_r1700548368 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/quantization/KMeans.java: ## @@ -0,0 +1,348 @@ +/* + * Licensed to the Apache Software Foundation (

Re: [PR] KMeans clustering algorithm [lucene]

2024-08-01 Thread via GitHub
mayya-sharipova commented on code in PR #13604: URL: https://github.com/apache/lucene/pull/13604#discussion_r1700548368 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/quantization/KMeans.java: ## @@ -0,0 +1,348 @@ +/* + * Licensed to the Apache Software Foundation (

Re: [PR] KMeans clustering algorithm [lucene]

2024-08-01 Thread via GitHub
mayya-sharipova commented on code in PR #13604: URL: https://github.com/apache/lucene/pull/13604#discussion_r1700550981 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/quantization/KMeans.java: ## @@ -0,0 +1,348 @@ +/* + * Licensed to the Apache Software Foundation (

Re: [PR] KMeans clustering algorithm [lucene]

2024-08-01 Thread via GitHub
mayya-sharipova commented on code in PR #13604: URL: https://github.com/apache/lucene/pull/13604#discussion_r1700550122 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/quantization/KMeans.java: ## @@ -0,0 +1,348 @@ +/* + * Licensed to the Apache Software Foundation (

Re: [PR] KMeans clustering algorithm [lucene]

2024-08-01 Thread via GitHub
mayya-sharipova commented on code in PR #13604: URL: https://github.com/apache/lucene/pull/13604#discussion_r1700551363 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/quantization/KMeans.java: ## @@ -0,0 +1,348 @@ +/* + * Licensed to the Apache Software Foundation (

Re: [PR] KMeans clustering algorithm [lucene]

2024-08-01 Thread via GitHub
mayya-sharipova commented on code in PR #13604: URL: https://github.com/apache/lucene/pull/13604#discussion_r1700552308 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/quantization/KMeans.java: ## @@ -0,0 +1,348 @@ +/* + * Licensed to the Apache Software Foundation (

Re: [PR] KMeans clustering algorithm [lucene]

2024-08-01 Thread via GitHub
mayya-sharipova commented on PR #13604: URL: https://github.com/apache/lucene/pull/13604#issuecomment-2263589226 @tveasey Thank you for your detailed feedback, it was addressed in the last commit. -- This is an automated message from the Apache Git Service. To respond to the message, plea

Re: [PR] KMeans clustering algorithm [lucene]

2024-08-01 Thread via GitHub
benwtrent commented on code in PR #13604: URL: https://github.com/apache/lucene/pull/13604#discussion_r1700553805 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/quantization/KMeans.java: ## @@ -0,0 +1,344 @@ +/* + * Licensed to the Apache Software Foundation (ASF) u

Re: [PR] Revert cosine deprecation [lucene]

2024-08-01 Thread via GitHub
benwtrent merged PR #13613: URL: https://github.com/apache/lucene/pull/13613 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.a

Re: [I] Are we properly accounting for `NeighborArray.rwlock`? [lucene]

2024-08-01 Thread via GitHub
msokolov closed issue #13580: Are we properly accounting for `NeighborArray.rwlock`? URL: https://github.com/apache/lucene/issues/13580 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific c

Re: [I] Are we properly accounting for `NeighborArray.rwlock`? [lucene]

2024-08-01 Thread via GitHub
msokolov commented on issue #13580: URL: https://github.com/apache/lucene/issues/13580#issuecomment-2264000352 There's still a question about whether to add more centralized lock enforcement since it is up to the caller to decide how to place locks around `OnHeapHnswGraph.getNeighbors()` bu

Re: [PR] Remove some BitSet#nextSetBit code duplication [lucene]

2024-08-01 Thread via GitHub
gsmiller commented on PR #13625: URL: https://github.com/apache/lucene/pull/13625#issuecomment-2264023189 @epotyom this minor refactoring occurred to me after merging your recent work. WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please lo

Re: [I] DocumentsWriterDeleteQueue.getNextSequenceNumber assertion failure seqNo=9 vs maxSeqNo=8 [lucene]

2024-08-01 Thread via GitHub
benwtrent commented on issue #13571: URL: https://github.com/apache/lucene/issues/13571#issuecomment-2264033141 I dug a little bit into this. I tried protecting by putting `synchronized` on `getNextSequenceNumber` that didn't work. I tried putting `synchronize` on the DW when it flush

Re: [I] DocumentsWriterDeleteQueue.getNextSequenceNumber assertion failure seqNo=9 vs maxSeqNo=8 [lucene]

2024-08-01 Thread via GitHub
aoli-al commented on issue #13571: URL: https://github.com/apache/lucene/issues/13571#issuecomment-2264062711 Thanks for confirming this! Yes, I found the bug extremely tricky to trigger while trying to reproduce. Making `DocumentsWriterFlushControl:obtainAndLock` synchronized will m

Re: [PR] Remove some BitSet#nextSetBit code duplication [lucene]

2024-08-01 Thread via GitHub
epotyom commented on code in PR #13625: URL: https://github.com/apache/lucene/pull/13625#discussion_r1700906253 ## lucene/core/src/java/org/apache/lucene/util/SparseFixedBitSet.java: ## @@ -337,34 +337,23 @@ private int firstDoc(int i4096, int i4096upper) { @Override pub

Re: [PR] Aggregate files from the same segment into a single Arena [lucene]

2024-08-01 Thread via GitHub
uschindler commented on PR #13570: URL: https://github.com/apache/lucene/pull/13570#issuecomment-2264150313 Do we have a backport PR? Should I work on it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above t

Re: [PR] Move synonym map off-heap for SynonymGraphFilter [lucene]

2024-08-01 Thread via GitHub
msfroh commented on code in PR #13054: URL: https://github.com/apache/lucene/pull/13054#discussion_r1700987077 ## lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/SynonymMapDirectory.java: ## @@ -0,0 +1,191 @@ +/* + * Licensed to the Apache Software Foundation

Re: [PR] Move synonym map off-heap for SynonymGraphFilter [lucene]

2024-08-01 Thread via GitHub
msfroh commented on code in PR #13054: URL: https://github.com/apache/lucene/pull/13054#discussion_r1700999887 ## lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/SynonymMap.java: ## @@ -291,11 +306,35 @@ public SynonymMap build() throws IOException {

Re: [PR] Move synonym map off-heap for SynonymGraphFilter [lucene]

2024-08-01 Thread via GitHub
msfroh commented on code in PR #13054: URL: https://github.com/apache/lucene/pull/13054#discussion_r1700987077 ## lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/SynonymMapDirectory.java: ## @@ -0,0 +1,191 @@ +/* + * Licensed to the Apache Software Foundation

Re: [PR] Move synonym map off-heap for SynonymGraphFilter [lucene]

2024-08-01 Thread via GitHub
msfroh commented on code in PR #13054: URL: https://github.com/apache/lucene/pull/13054#discussion_r1701014637 ## lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/SynonymMapDirectory.java: ## @@ -0,0 +1,191 @@ +/* + * Licensed to the Apache Software Foundation

Re: [PR] Move synonym map off-heap for SynonymGraphFilter [lucene]

2024-08-01 Thread via GitHub
msfroh commented on code in PR #13054: URL: https://github.com/apache/lucene/pull/13054#discussion_r1701025192 ## lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/SynonymMapDirectory.java: ## @@ -0,0 +1,191 @@ +/* + * Licensed to the Apache Software Foundation

Re: [PR] Move synonym map off-heap for SynonymGraphFilter [lucene]

2024-08-01 Thread via GitHub
msfroh commented on code in PR #13054: URL: https://github.com/apache/lucene/pull/13054#discussion_r1701032411 ## lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/SynonymMapDirectory.java: ## @@ -0,0 +1,191 @@ +/* + * Licensed to the Apache Software Foundation

Re: [I] TestIDVersionPostingsFormat failure [lucene]

2024-08-01 Thread via GitHub
benwtrent commented on issue #13127: URL: https://github.com/apache/lucene/issues/13127#issuecomment-2264272195 OK, I added a bunch of logging and it seems like the issue is around `DWPTP#getAndLock`. I can see the following occurring, new DWPTs being created, each with the first ge

Re: [I] TestIDVersionPostingsFormat failure [lucene]

2024-08-01 Thread via GitHub
benwtrent commented on issue #13127: URL: https://github.com/apache/lucene/issues/13127#issuecomment-2264278817 Yeah, looking at `markForFullFlush`, it seems like we mark the generation to gather that `seqNo`, then unlock DWFC, and this allows new DWPT to be returned with the old generation

Re: [I] TestIDVersionPostingsFormat failure [lucene]

2024-08-01 Thread via GitHub
benwtrent commented on issue #13127: URL: https://github.com/apache/lucene/issues/13127#issuecomment-2264281268 Another option is continue to lock after the deleteQueue generation creation until the DWPT are removed. -- This is an automated message from the Apache Git Service. To respond

[I] NullPointerException: Cannot read field "vectorEncoding" because "fieldEntry" is null [lucene]

2024-08-01 Thread via GitHub
david-sitsky opened a new issue, #13626: URL: https://github.com/apache/lucene/issues/13626 ### Description One of our internal users hit this error when merging their index after loading all documents which contain some vector fields. I couldn't reproduce this myself. Is there a c

Re: [I] TestIDVersionPostingsFormat failure [lucene]

2024-08-01 Thread via GitHub
benwtrent commented on issue #13127: URL: https://github.com/apache/lucene/issues/13127#issuecomment-2264299983 Ah, another option is to switch the logic that is used to mark it free and send it back to the freeList. We can check if the deleteQueue is advanced, and just unlock it instead of

Re: [PR] New JMH benchmark method - vdot8s that implement int8 dotProduct in C… [lucene]

2024-08-01 Thread via GitHub
goankur commented on code in PR #13572: URL: https://github.com/apache/lucene/pull/13572#discussion_r1701070122 ## lucene/core/src/c/dotProduct.h: ## @@ -0,0 +1,4 @@ + +int32_t vdot8s_sve(int8_t* vec1[], int8_t* vec2, int32_t limit); +int32_t vdot8s_neon(int8_t* vec1[], int8_t*

Re: [PR] New JMH benchmark method - vdot8s that implement int8 dotProduct in C… [lucene]

2024-08-01 Thread via GitHub
goankur commented on code in PR #13572: URL: https://github.com/apache/lucene/pull/13572#discussion_r1701070880 ## lucene/core/src/c/dotProduct.c: ## @@ -0,0 +1,143 @@ +// dotProduct.c + +#include +#include + +#ifdef __ARM_ACLE +#include +#endif + +#if (defined(__ARM_FEATURE_

Re: [PR] New JMH benchmark method - vdot8s that implement int8 dotProduct in C… [lucene]

2024-08-01 Thread via GitHub
goankur commented on code in PR #13572: URL: https://github.com/apache/lucene/pull/13572#discussion_r1701071456 ## lucene/core/src/c/dotProduct.c: ## @@ -0,0 +1,143 @@ +// dotProduct.c + +#include +#include + +#ifdef __ARM_ACLE +#include +#endif + +#if (defined(__ARM_FEATURE_

[PR] Fix race condition on flush for DWPT seqNo generation [lucene]

2024-08-01 Thread via GitHub
benwtrent opened a new pull request, #13627: URL: https://github.com/apache/lucene/pull/13627 There is a tricky race condition with DWPT threads. It is possible that a flush starts by advancing the deleteQueue (in charge of creating seqNo). Thus, the referenced deleteQueue, there should be

Re: [PR] New JMH benchmark method - vdot8s that implement int8 dotProduct in C… [lucene]

2024-08-01 Thread via GitHub
goankur commented on code in PR #13572: URL: https://github.com/apache/lucene/pull/13572#discussion_r1701088228 ## lucene/core/src/c/dotProduct.c: ## @@ -0,0 +1,143 @@ +// dotProduct.c + +#include +#include + +#ifdef __ARM_ACLE +#include +#endif + +#if (defined(__ARM_FEATURE_

Re: [PR] New JMH benchmark method - vdot8s that implement int8 dotProduct in C… [lucene]

2024-08-01 Thread via GitHub
goankur commented on code in PR #13572: URL: https://github.com/apache/lucene/pull/13572#discussion_r1701099939 ## lucene/core/src/c/dotProduct.c: ## @@ -0,0 +1,143 @@ +// dotProduct.c + +#include +#include + +#ifdef __ARM_ACLE +#include +#endif + +#if (defined(__ARM_FEATURE_

Re: [PR] New JMH benchmark method - vdot8s that implement int8 dotProduct in C… [lucene]

2024-08-01 Thread via GitHub
goankur commented on code in PR #13572: URL: https://github.com/apache/lucene/pull/13572#discussion_r1701088228 ## lucene/core/src/c/dotProduct.c: ## @@ -0,0 +1,143 @@ +// dotProduct.c + +#include +#include + +#ifdef __ARM_ACLE +#include +#endif + +#if (defined(__ARM_FEATURE_

Re: [PR] New JMH benchmark method - vdot8s that implement int8 dotProduct in C… [lucene]

2024-08-01 Thread via GitHub
goankur commented on code in PR #13572: URL: https://github.com/apache/lucene/pull/13572#discussion_r1701104686 ## lucene/core/src/c/dotProduct.c: ## @@ -0,0 +1,143 @@ +// dotProduct.c + +#include +#include + +#ifdef __ARM_ACLE +#include +#endif + +#if (defined(__ARM_FEATURE_

Re: [PR] Remove some BitSet#nextSetBit code duplication [lucene]

2024-08-01 Thread via GitHub
gsmiller merged PR #13625: URL: https://github.com/apache/lucene/pull/13625 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.ap

Re: [PR] Remove some BitSet#nextSetBit code duplication [lucene]

2024-08-01 Thread via GitHub
gsmiller commented on PR #13625: URL: https://github.com/apache/lucene/pull/13625#issuecomment-2264403347 Thanks @epotyom for the feedback! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spe

Re: [PR] New JMH benchmark method - vdot8s that implement int8 dotProduct in C… [lucene]

2024-08-01 Thread via GitHub
goankur commented on code in PR #13572: URL: https://github.com/apache/lucene/pull/13572#discussion_r1701126777 ## lucene/core/src/c/dotProduct.c: ## @@ -0,0 +1,143 @@ +// dotProduct.c + +#include +#include + +#ifdef __ARM_ACLE +#include +#endif + +#if (defined(__ARM_FEATURE_

Re: [PR] New JMH benchmark method - vdot8s that implement int8 dotProduct in C… [lucene]

2024-08-01 Thread via GitHub
goankur commented on code in PR #13572: URL: https://github.com/apache/lucene/pull/13572#discussion_r1701126777 ## lucene/core/src/c/dotProduct.c: ## @@ -0,0 +1,143 @@ +// dotProduct.c + +#include +#include + +#ifdef __ARM_ACLE +#include +#endif + +#if (defined(__ARM_FEATURE_

Re: [PR] New JMH benchmark method - vdot8s that implement int8 dotProduct in C… [lucene]

2024-08-01 Thread via GitHub
goankur commented on PR #13572: URL: https://github.com/apache/lucene/pull/13572#issuecomment-2264420154 > But I think it makes the build more straightforward: it builds native as you expect, if you want to use different compiler set `CC` env vars etc differently. I still get compila

Re: [PR] Optimize binary search call [lucene]

2024-08-01 Thread via GitHub
dungba88 commented on PR #13595: URL: https://github.com/apache/lucene/pull/13595#issuecomment-2264454968 @jpountz I was reading `IntArrayDocIdSetIterator`, it is a private class only exposed through `IntArrayDocIdSet`. I think we need to extend the capability here (storing the score, havin

Re: [PR] Add reopen method in PerThreadPKLookup [lucene]

2024-08-01 Thread via GitHub
vsop-479 commented on code in PR #13596: URL: https://github.com/apache/lucene/pull/13596#discussion_r1701382048 ## lucene/test-framework/src/java/org/apache/lucene/tests/index/PerThreadPKLookup.java: ## @@ -97,5 +111,82 @@ public int lookup(BytesRef id) throws IOException {