Re: [PR] [DRAFT] Load vector data directly from the memory segment [lucene]

2023-10-25 Thread via GitHub
ChrisHegarty commented on PR #12703: URL: https://github.com/apache/lucene/pull/12703#issuecomment-1778821678 > > Well... as simple wrapping of float[] into MemorySegment is not going to work out, the Vector API does not like it due to alignment constraints (which seems overly pedantic sinc

Re: [PR] Run merge-on-full-flush even though no changes got flushed. [lucene]

2023-10-25 Thread via GitHub
s1monw commented on PR #12549: URL: https://github.com/apache/lucene/pull/12549#issuecomment-1778953836 I think this is only triggered because of your change but the problem was already there. We hold the lock in MDW#close() such that we can not run a concurrent merge. We could either preve

Re: [PR] Make IndexSearcher#getSlices final and clarify docs [lucene]

2023-10-25 Thread via GitHub
s1monw commented on code in PR #12718: URL: https://github.com/apache/lucene/pull/12718#discussion_r1371505561 ## lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java: ## @@ -425,11 +425,12 @@ public int count(Query query) throws IOException { } /** - * Ret

Re: [PR] Prevent users from using document block APIs when sort is configured [lucene]

2023-10-25 Thread via GitHub
s1monw commented on PR #12711: URL: https://github.com/apache/lucene/pull/12711#issuecomment-1778962096 > Another question: do we have any testing around this sort-stability / block-preservation today? I'm getting nervous now that we are relying on an undocumented feature that just happens

[I] [Sort] Numeric field sort query performance degrades dramatically with more deleted entries in segment [lucene]

2023-10-25 Thread via GitHub
gashutos opened a new issue, #12720: URL: https://github.com/apache/lucene/issues/12720 ### Description ### Problem With higher number of deleted entries in a segment, the sort query shows up to `10x` degradation after one point. We did this experiment using [nyc_taxis](https://gi

Re: [PR] [DRAFT] Load vector data directly from the memory segment [lucene]

2023-10-25 Thread via GitHub
ChrisHegarty commented on PR #12703: URL: https://github.com/apache/lucene/pull/12703#issuecomment-1779052160 For what it's worth, the changes currently in this PR do not perform generally well, since we can have a mix of how we represent the underlying vector values, and where they come fr

[I] Compute gain with vector API in BPIndexReorderer [lucene]

2023-10-25 Thread via GitHub
gf2121 opened a new issue, #12721: URL: https://github.com/apache/lucene/issues/12721 ### Description An immature idea ! :) I noticed that `BPIndexReorderer$ComputeGainsTask#computeGain()` took a lot in CPU profile: ``` PERCENT CPU SAMPLES STACK 4.75%

Re: [PR] Run merge-on-full-flush even though no changes got flushed. [lucene]

2023-10-25 Thread via GitHub
jpountz commented on PR #12549: URL: https://github.com/apache/lucene/pull/12549#issuecomment-1779155292 Thanks! I was thinking of something along the lines of the diff you shared, I had not thought of the SerialMergeScheduler approach. I'll check it works and push this change. -- This i

Re: [I] Should we handle negative scores due to floating point arithmetic errors? [lucene]

2023-10-25 Thread via GitHub
jpountz commented on issue #12700: URL: https://github.com/apache/lucene/issues/12700#issuecomment-1779179636 We made changes to similarities to guarantee monotonicity with tf and norm (e.g. https://github.com/apache/lucene/issues/9063) despite floating-point rounding errors. I think we sho

Re: [I] Adding option to codec to disable patching in Lucene's PFOR encoding [lucene]

2023-10-25 Thread via GitHub
jpountz commented on issue #12696: URL: https://github.com/apache/lucene/issues/12696#issuecomment-1779221543 For reference, Lucene used to use FOR for postings and PFOR for positions in 8.x. This was changed in 9.0 via #69 to use PFOR for both postings and positions. This PR says it made t

Re: [I] [Sort] Numeric field sort query performance degrades dramatically with more deleted entries in segment [lucene]

2023-10-25 Thread via GitHub
jpountz commented on issue #12720: URL: https://github.com/apache/lucene/issues/12720#issuecomment-1779265288 Having many deleted documents competitive is definitely a worst-case scenario for any kind of dynamic pruning that Lucene does. I'm not sure if there is something that we can do abo

Re: [I] MultiSimilarity.MultiSimScorer should sum up scores into a double [lucene]

2023-10-25 Thread via GitHub
jpountz commented on issue #12675: URL: https://github.com/apache/lucene/issues/12675#issuecomment-1779267844 Thas has been addressed by #12682. Thanks @KunalSanghvi for contributing and @benwtrent for merging! -- This is an automated message from the Apache Git Service. To respond to the

Re: [I] Compute gain with vector API in BPIndexReorderer [lucene]

2023-10-25 Thread via GitHub
gf2121 commented on issue #12721: URL: https://github.com/apache/lucene/issues/12721#issuecomment-1779316042 > did something like intVector = intVector.max(BROAD_1) Great idea! Here is the benchmark result : ``` Benchmark (maxTerm) (termsNum) Mode

Re: [I] Optimize FST suffix sharing for block tree index [lucene]

2023-10-25 Thread via GitHub
gf2121 commented on issue #12702: URL: https://github.com/apache/lucene/issues/12702#issuecomment-1779338184 on `wikimediumall` **Queries (Nothing changed obviously):** ``` TaskQPS baseline StdDevQPS my_modified_version StdDev

Re: [I] Improve hash mixing in FST's double-barrel LRU hash [lucene]

2023-10-25 Thread via GitHub
dweiss commented on issue #12704: URL: https://github.com/apache/lucene/issues/12704#issuecomment-1779342648 If you'd like to do so, I'd suggest moving such a "scattering remix" utility to a separate class and reusing it elsewhere, much like here: https://github.com/carrotsearch/hppc/blo

[PR] Disable suffix sharing for block tree index [lucene]

2023-10-25 Thread via GitHub
gf2121 opened a new pull request, #12722: URL: https://github.com/apache/lucene/pull/12722 closes https://github.com/apache/lucene/issues/12702 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] Disable suffix sharing for block tree index [lucene]

2023-10-25 Thread via GitHub
mikemccand commented on code in PR #12722: URL: https://github.com/apache/lucene/pull/12722#discussion_r1371870374 ## lucene/CHANGES.txt: ## @@ -227,6 +227,8 @@ Optimizations * GITHUB#12712: Speed up sorting postings file with an offline radix sorter in BPIndexReader. (Guo F

Re: [PR] Improve hash mixing in FST's double-barrel LRU hash [lucene]

2023-10-25 Thread via GitHub
mikemccand commented on PR #12716: URL: https://github.com/apache/lucene/pull/12716#issuecomment-1779416642 > Thank you so much for the help Mike ! Thank you! > I have never run the fst benchmark but seems like its straightforward java script?. I could give it a try as well (so

Re: [PR] Improve hash mixing in FST's double-barrel LRU hash [lucene]

2023-10-25 Thread via GitHub
mikemccand commented on PR #12716: URL: https://github.com/apache/lucene/pull/12716#issuecomment-1779443135 OK I ran it twice on `main`: ``` saved FST to "fst.bin": 294815624 bytes; 59.874 sec saved FST to "fst.bin": 294815624 bytes; 60.255 sec ``` And twice with th

Re: [PR] Concurrent HNSW Merge [lucene]

2023-10-25 Thread via GitHub
benwtrent commented on code in PR #12660: URL: https://github.com/apache/lucene/pull/12660#discussion_r1371954603 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsWriter.java: ## @@ -557,6 +566,12 @@ public void close() throws IOException {

Re: [I] [Sort] Numeric field sort query performance degrades dramatically with more deleted entries in segment [lucene]

2023-10-25 Thread via GitHub
RS146BIJAY commented on issue #12720: URL: https://github.com/apache/lucene/issues/12720#issuecomment-1779548832 @jpountz so is it same to conclude that user increasing merging rate (which will remove these obsolete entries) (by tuning some parameters or doing a force merge) is the only way

Re: [PR] Improve hash mixing in FST's double-barrel LRU hash [lucene]

2023-10-25 Thread via GitHub
bruno-roustant commented on PR #12716: URL: https://github.com/apache/lucene/pull/12716#issuecomment-1779668350 Oh, the numbers are disappointing. I expected to be both a little more compact and little faster. I wonder what is the cause, the rehash threshold, the linear scan, or the mult

Re: [I] xml.TestCoreParser#testSpanNearQueryWithoutSlopXML fails because of changed exception message [lucene]

2023-10-25 Thread via GitHub
dweiss commented on issue #12708: URL: https://github.com/apache/lucene/issues/12708#issuecomment-1779762455 Should we add an assumption to this test so that it is ignored on JDK22, at least until the issue is resolved? Causes some noise on the builds mailing list. -- This is an automated

Re: [PR] Add support for similarity-based vector searches [lucene]

2023-10-25 Thread via GitHub
kaivalnp commented on PR #12679: URL: https://github.com/apache/lucene/pull/12679#issuecomment-1779796454 Hi @benwtrent! Curious to hear if you've been able to reproduce the benchmark? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

Re: [PR] Run merge-on-full-flush even though no changes got flushed. [lucene]

2023-10-25 Thread via GitHub
jpountz commented on PR #12549: URL: https://github.com/apache/lucene/pull/12549#issuecomment-1779816530 I ended up implementing your other suggestion. MDW generally expects that this IndexWriter instantiation will not do merges. -- This is an automated message from the Apache Git Service

Re: [PR] Record if block API has been used in SegmentInfo [lucene]

2023-10-25 Thread via GitHub
jpountz commented on PR #12685: URL: https://github.com/apache/lucene/pull/12685#issuecomment-1779837662 FYI we've seen failures on TestIndexWriter recently, which are reproducible (e.g. https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-9.x/720/). I ran git bisect and it poin

Re: [PR] Add support for similarity-based vector searches [lucene]

2023-10-25 Thread via GitHub
benwtrent commented on PR #12679: URL: https://github.com/apache/lucene/pull/12679#issuecomment-1779866529 @kaivalnp I have been busy doing other things. I hope to look into this in the next week or so. -- This is an automated message from the Apache Git Service. To respond to the message

Re: [I] xml.TestCoreParser#testSpanNearQueryWithoutSlopXML fails because of changed exception message [lucene]

2023-10-25 Thread via GitHub
uschindler commented on issue #12708: URL: https://github.com/apache/lucene/issues/12708#issuecomment-1780013296 I will update JDK tomorrow or Friday and the issue should be gone. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHu

Re: [PR] Record if block API has been used in SegmentInfo [lucene]

2023-10-25 Thread via GitHub
s1monw commented on PR #12685: URL: https://github.com/apache/lucene/pull/12685#issuecomment-1780091805 I pushed fixes... thanks @jpountz -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spec

Re: [PR] Add support for similarity-based vector searches [lucene]

2023-10-25 Thread via GitHub
kaivalnp commented on PR #12679: URL: https://github.com/apache/lucene/pull/12679#issuecomment-1780186180 Thank you! I'll try to incorporate earlier suggestions in the meanwhile -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [PR] Consolidate FSTStore and BytesStore in FST [lucene]

2023-10-25 Thread via GitHub
mikemccand merged PR #12709: URL: https://github.com/apache/lucene/pull/12709 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.

Re: [PR] Consolidate FSTStore and BytesStore in FST [lucene]

2023-10-25 Thread via GitHub
mikemccand commented on PR #12709: URL: https://github.com/apache/lucene/pull/12709#issuecomment-1780274592 Thanks @dungba88 -- I just merged. We can open a new PR when it's time to backport ... -- This is an automated message from the Apache Git Service. To respond to the message, pleas

Re: [I] FSTCompiler's NodeHash should fully duplicate `byte[]` slices from the growing FST [lucene]

2023-10-25 Thread via GitHub
mikemccand commented on issue #12714: URL: https://github.com/apache/lucene/issues/12714#issuecomment-1780282563 I made a quick hackity change, just to measure the number of additional bytes we'd "typically" have to copy in order to duplicate suffix bytes from the growing (forced append-onl

Re: [PR] Disable suffix sharing for block tree index [lucene]

2023-10-25 Thread via GitHub
gf2121 commented on code in PR #12722: URL: https://github.com/apache/lucene/pull/12722#discussion_r1372542501 ## lucene/CHANGES.txt: ## @@ -227,6 +227,8 @@ Optimizations * GITHUB#12712: Speed up sorting postings file with an offline radix sorter in BPIndexReader. (Guo Feng)

Re: [PR] Disable suffix sharing for block tree index [lucene]

2023-10-25 Thread via GitHub
gf2121 merged PR #12722: URL: https://github.com/apache/lucene/pull/12722 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apac

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-25 Thread via GitHub
jmazanec15 commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1372593680 ## lucene/core/src/java/org/apache/lucene/util/ScalarQuantizer.java: ## @@ -0,0 +1,267 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more

Re: [PR] Concurrent HNSW Merge [lucene]

2023-10-25 Thread via GitHub
zhaih commented on code in PR #12660: URL: https://github.com/apache/lucene/pull/12660#discussion_r1372595375 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java: ## @@ -151,61 +159,128 @@ public OnHeapHnswGraph build(int maxOrd) throws IOException {

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-25 Thread via GitHub
jmazanec15 commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1372593680 ## lucene/core/src/java/org/apache/lucene/util/ScalarQuantizer.java: ## @@ -0,0 +1,267 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more

Re: [PR] Concurrent HNSW Merge [lucene]

2023-10-25 Thread via GitHub
zhaih commented on code in PR #12660: URL: https://github.com/apache/lucene/pull/12660#discussion_r1372606067 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsFormat.java: ## @@ -198,14 +218,25 @@ public Lucene99HnswVectorsFormat( + ";