Re: [PR] Concurrent HNSW Merge [lucene]

2023-10-23 Thread via GitHub
zhaih commented on code in PR #12660: URL: https://github.com/apache/lucene/pull/12660#discussion_r1369644221 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java: ## @@ -221,34 +296,50 @@ private long printGraphBuildStatus(int node, long start, long t) {

Re: [I] Specialize arc store for continuous label in FST [lucene]

2023-10-23 Thread via GitHub
gf2121 commented on issue #12701: URL: https://github.com/apache/lucene/issues/12701#issuecomment-1776577174 @mikemccand Thanks for feedback, glad you like this :) > did you close it because it's similar / same as the direct addressing case? Yes, this case could be considered as

Re: [PR] Concurrent HNSW Merge [lucene]

2023-10-23 Thread via GitHub
zhaih commented on code in PR #12660: URL: https://github.com/apache/lucene/pull/12660#discussion_r1369658959 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -35,6 +38,9 @@ public class NeighborArray { float[] score; int[] node; private int

[PR] Improve hash mixing in FST's double-barrel LRU hash [lucene]

2023-10-24 Thread via GitHub
shubhamvishu opened a new pull request, #12716: URL: https://github.com/apache/lucene/pull/12716 ### Description Addresses #12704. Below is the comment that inspired this ([link](https://github.com/apache/lucene/pull/12633#discussion_r1366847986)), ``` Instead, we shoul

Re: [I] Improve hash mixing in FST's double-barrel LRU hash [lucene]

2023-10-24 Thread via GitHub
shubhamvishu commented on issue #12704: URL: https://github.com/apache/lucene/issues/12704#issuecomment-1776661739 I opened a PR to make use of this constant : #12716 Also, I was thinking if this constant could be utilised in other hash function implementations as well in the codebas

[I] Use max BPV encoding in postings if doc buffer size less than ForUtil.BLOCK_SIZE [lucene]

2023-10-24 Thread via GitHub
easyice opened a new issue, #12717: URL: https://github.com/apache/lucene/issues/12717 ### Description Currently we use vint encoding the doc IDs if the doc buffer < 128, then decode in `Lucene90PostingsReader#readVIntBlock`. In the high cardinality field, it it possibly slow to

Re: [PR] Prevent users from using document block APIs when sort is configured [lucene]

2023-10-24 Thread via GitHub
s1monw commented on PR #12711: URL: https://github.com/apache/lucene/pull/12711#issuecomment-1776831638 I spoke to @jpountz about this topic and we discussed a different approach. We could get away with not having the check at all and make blocks a first class citizen by recording the paren

Re: [PR] Make IndexSearcher#getSlices final and clarify docs [lucene]

2023-10-24 Thread via GitHub
javanna commented on code in PR #12718: URL: https://github.com/apache/lucene/pull/12718#discussion_r1369871488 ## lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java: ## @@ -425,11 +425,12 @@ public int count(Query query) throws IOException { } /** - * Re

Re: [PR] Make IndexSearcher#getSlices final and clarify docs [lucene]

2023-10-24 Thread via GitHub
javanna commented on code in PR #12718: URL: https://github.com/apache/lucene/pull/12718#discussion_r1369871845 ## lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java: ## @@ -115,10 +115,10 @@ public class IndexSearcher { protected final List leafContexts; /

Re: [PR] Clean up ByteBlockPool [lucene]

2023-10-24 Thread via GitHub
iverase commented on PR #12506: URL: https://github.com/apache/lucene/pull/12506#issuecomment-1776973771 I like the introduction of `ByteSlicePool` but I wonder if the naming is correct as it does not feel a generic slicer class but very tied to the format used by TermsHashPerField. Just

Re: [PR] Speed up the sort when building forward index [lucene]

2023-10-24 Thread via GitHub
gf2121 commented on PR #12712: URL: https://github.com/apache/lucene/pull/12712#issuecomment-1776994927 ``` BPIndexReorderer reorderer = new BPIndexReorderer(); reorderer.setMinDocFreq(16384); reorderer.setMaxIters(3); reorderer.setMinPartitionSize(8192); mp = new BPReorderingM

Re: [PR] Clean up ByteBlockPool [lucene]

2023-10-24 Thread via GitHub
iverase commented on code in PR #12506: URL: https://github.com/apache/lucene/pull/12506#discussion_r1369987969 ## lucene/core/src/java/org/apache/lucene/util/ByteSlicePool.java: ## @@ -0,0 +1,138 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + *

Re: [PR] Concurrent HNSW Merge [lucene]

2023-10-24 Thread via GitHub
benwtrent commented on code in PR #12660: URL: https://github.com/apache/lucene/pull/12660#discussion_r1370023343 ## lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java: ## @@ -146,18 +148,24 @@ public final class Lucene95HnswVectorsFormat exten

Re: [PR] Concurrent HNSW Merge [lucene]

2023-10-24 Thread via GitHub
benwtrent commented on code in PR #12660: URL: https://github.com/apache/lucene/pull/12660#discussion_r1370028382 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java: ## @@ -221,34 +296,50 @@ private long printGraphBuildStatus(int node, long start, long t)

Re: [PR] Concurrent HNSW Merge [lucene]

2023-10-24 Thread via GitHub
benwtrent commented on code in PR #12660: URL: https://github.com/apache/lucene/pull/12660#discussion_r1370034532 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -35,6 +38,9 @@ public class NeighborArray { float[] score; int[] node; private

Re: [PR] Speed up the sort when building forward index [lucene]

2023-10-24 Thread via GitHub
gf2121 commented on PR #12712: URL: https://github.com/apache/lucene/pull/12712#issuecomment-1777056056 I indexed `wikimidumall` with: * BPIndexReorder monfig mentioned [here](https://github.com/apache/lucene/issues/12665#issuecomment-1770827026). * BPMergePolicy on this [commit](htt

Re: [PR] Prevent users from using document block APIs when sort is configured [lucene]

2023-10-24 Thread via GitHub
msokolov commented on PR #12711: URL: https://github.com/apache/lucene/pull/12711#issuecomment-1777060875 The idea of more explicitly modeling doc-blocks makes sense to me, but I wonder if it would really enable "That way we can sort only on the parent document". What about the case (suppor

Re: [PR] Prevent users from using document block APIs when sort is configured [lucene]

2023-10-24 Thread via GitHub
msokolov commented on PR #12711: URL: https://github.com/apache/lucene/pull/12711#issuecomment-1777062768 Another question: do we have any testing around this sort-stability / block-preservation today? I'm getting nervous now that we are relying on an undocumented feature that just happens

Re: [PR] Concurrent HNSW Merge [lucene]

2023-10-24 Thread via GitHub
msokolov commented on code in PR #12660: URL: https://github.com/apache/lucene/pull/12660#discussion_r1370046773 ## lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java: ## @@ -635,17 +667,31 @@ private static DocsWithFieldSet writeVectorData(

[PR] Add a specialized bulk scorer for regular conjunctions. [lucene]

2023-10-24 Thread via GitHub
jpountz opened a new pull request, #12719: URL: https://github.com/apache/lucene/pull/12719 PR #12382 added a bulk scorer for top-k hits on conjunctions that yielded a significant speedup (annotation [FP](http://people.apache.org/~mikemccand/lucenebench/AndHighHigh.html)). This change pr

Re: [PR] Add a specialized bulk scorer for regular conjunctions. [lucene]

2023-10-24 Thread via GitHub
jpountz commented on PR #12719: URL: https://github.com/apache/lucene/pull/12719#issuecomment-1777078920 Wikibigall: ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value Prefix3

Re: [PR] Concurrent HNSW Merge [lucene]

2023-10-24 Thread via GitHub
msokolov commented on code in PR #12660: URL: https://github.com/apache/lucene/pull/12660#discussion_r1370054483 ## lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java: ## @@ -146,18 +148,24 @@ public final class Lucene95HnswVectorsFormat extend

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-10-24 Thread via GitHub
dungba88 commented on PR #12624: URL: https://github.com/apache/lucene/pull/12624#issuecomment-1777157731 @mikemccand I rebased and created some implementation of DataOutput-based FSTWriter. I think I need to write tests, but let me know what you think. -- This is an automated mes

Re: [PR] Improve hash mixing in FST's double-barrel LRU hash [lucene]

2023-10-24 Thread via GitHub
mikemccand commented on code in PR #12716: URL: https://github.com/apache/lucene/pull/12716#discussion_r1370148505 ## lucene/CHANGES.txt: ## @@ -190,6 +190,8 @@ Improvements * GITHUB#12705, GITHUB#12705: Improve handling of NullPointerException and IllegalStateException in

Re: [PR] Speed up the sort when building forward index [lucene]

2023-10-24 Thread via GitHub
jpountz commented on code in PR #12712: URL: https://github.com/apache/lucene/pull/12712#discussion_r1370176150 ## lucene/misc/src/java/org/apache/lucene/misc/index/BPIndexReorderer.java: ## @@ -991,4 +939,233 @@ static int readMonotonicInts(DataInput in, int[] ints) throws IOE

Re: [PR] Capture build scans on ge.apache.org to benefit from deep build insights [lucene]

2023-10-24 Thread via GitHub
dsmiley commented on PR #12293: URL: https://github.com/apache/lucene/pull/12293#issuecomment-1777343300 @clayburn An espoused benefit to this PR is that, as an Apache committer, I could do builds on my own machine and have the analysis be published for viewing on ge.apache.org. How do I d

Re: [PR] Create a task executor when executor is not provided [lucene]

2023-10-24 Thread via GitHub
javanna commented on code in PR #12606: URL: https://github.com/apache/lucene/pull/12606#discussion_r1370308080 ## lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java: ## @@ -420,13 +418,12 @@ public int count(Query query) throws IOException { } /** - * Re

Re: [PR] Consolidate FSTStore and BytesStore in FST [lucene]

2023-10-24 Thread via GitHub
dungba88 commented on PR #12709: URL: https://github.com/apache/lucene/pull/12709#issuecomment-1777396050 I added an entry in the CHANGES.txt under Lucene 10.0 (as we are not backporting) -- This is an automated message from the Apache Git Service. To respond to the message, please log on

Re: [PR] Run merge-on-full-flush even though no changes got flushed. [lucene]

2023-10-24 Thread via GitHub
jpountz merged PR #12549: URL: https://github.com/apache/lucene/pull/12549 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

Re: [PR] Speed up the sort when building forward index [lucene]

2023-10-24 Thread via GitHub
gf2121 commented on code in PR #12712: URL: https://github.com/apache/lucene/pull/12712#discussion_r1370377164 ## lucene/misc/src/java/org/apache/lucene/misc/index/BPIndexReorderer.java: ## @@ -991,4 +939,233 @@ static int readMonotonicInts(DataInput in, int[] ints) throws IOEx

Re: [PR] Speed up the sort when building forward index [lucene]

2023-10-24 Thread via GitHub
jpountz commented on code in PR #12712: URL: https://github.com/apache/lucene/pull/12712#discussion_r1370397139 ## lucene/misc/src/java/org/apache/lucene/misc/index/BPIndexReorderer.java: ## @@ -991,4 +939,166 @@ static int readMonotonicInts(DataInput in, int[] ints) throws IOE

Re: [PR] Capture build scans on ge.apache.org to benefit from deep build insights [lucene]

2023-10-24 Thread via GitHub
clayburn commented on PR #12293: URL: https://github.com/apache/lucene/pull/12293#issuecomment-1777496879 @dsmiley - Excellent questions: > An espoused benefit to this PR is that, as an Apache committer, I could do builds on my own machine and have the analysis be published for viewin

Re: [PR] TaskExecutor to cancel all tasks on exception [lucene]

2023-10-24 Thread via GitHub
javanna merged PR #12689: URL: https://github.com/apache/lucene/pull/12689 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

Re: [PR] TaskExecutor to cancel all tasks on exception [lucene]

2023-10-24 Thread via GitHub
javanna commented on PR #12689: URL: https://github.com/apache/lucene/pull/12689#issuecomment-1777505662 Thanks for the review @jpountz ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the s

Re: [PR] Run merge-on-full-flush even though no changes got flushed. [lucene]

2023-10-24 Thread via GitHub
jpountz commented on PR #12549: URL: https://github.com/apache/lucene/pull/12549#issuecomment-1777530886 I reverted as this causes a deadlock in TestStressIndexing: ``` "TEST-TestStressIndexing.testStressIndexAndSearching-seed#[D4B60FA81EB58FF3]" #42 [282516] prio=5 os_prio=0 cpu=

Re: [PR] Sometimes intersect the essential clause and the best non-essential clause. [lucene]

2023-10-24 Thread via GitHub
jpountz merged PR #12589: URL: https://github.com/apache/lucene/pull/12589 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

Re: [PR] Capture build scans on ge.apache.org to benefit from deep build insights [lucene]

2023-10-24 Thread via GitHub
dsmiley commented on PR #12293: URL: https://github.com/apache/lucene/pull/12293#issuecomment-1777598613 Here's [my build I ran locally](https://ge.apache.org/s/nsaeazuf3tkcu/timeline) for anyone who is interested. I observe that Lucene tests seem well balanced (judging from the cool time

Re: [PR] Speed up the sort when building forward index [lucene]

2023-10-24 Thread via GitHub
jpountz commented on PR #12712: URL: https://github.com/apache/lucene/pull/12712#issuecomment-1777605586 Unrelated to this change: I sent you an email to your Apache address, can you check it out? (Sorry for the noise on this PR, I don't know how else to contact you). -- This is an autom

Re: [PR] Capture build scans on ge.apache.org to benefit from deep build insights [lucene]

2023-10-24 Thread via GitHub
dsmiley merged PR #12293: URL: https://github.com/apache/lucene/pull/12293 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

Re: [PR] Speed up the sort when building forward index [lucene]

2023-10-24 Thread via GitHub
gf2121 commented on PR #12712: URL: https://github.com/apache/lucene/pull/12712#issuecomment-1777640134 > I sent you an email to your Apache address, can you check it out? Sorry for missing the email. Thank you so much for reminding me here! -- This is an automated message from the

Re: [PR] Deprecated public constructor of FSTCompiler in favor of the Builder. [lucene]

2023-10-24 Thread via GitHub
cavorite commented on code in PR #12715: URL: https://github.com/apache/lucene/pull/12715#discussion_r1370542730 ## lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java: ## @@ -122,8 +122,11 @@ public class FSTCompiler { /** * Instantiates an FST/FSA builder w

Re: [PR] Improve hash mixing in FST's double-barrel LRU hash [lucene]

2023-10-24 Thread via GitHub
shubhamvishu commented on code in PR #12716: URL: https://github.com/apache/lucene/pull/12716#discussion_r1370588249 ## lucene/CHANGES.txt: ## @@ -190,6 +190,8 @@ Improvements * GITHUB#12705, GITHUB#12705: Improve handling of NullPointerException and IllegalStateException i

Re: [PR] Improve hash mixing in FST's double-barrel LRU hash [lucene]

2023-10-24 Thread via GitHub
shubhamvishu commented on PR #12716: URL: https://github.com/apache/lucene/pull/12716#issuecomment-130978 Thanks for the quick review @mikemccand! I have addressed the comments in the new revision. > Could you also change the probe from quadratic (what it is now) to a simple line

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-24 Thread via GitHub
benwtrent merged PR #12582: URL: https://github.com/apache/lucene/pull/12582 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.a

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-10-24 Thread via GitHub
dungba88 commented on PR #12624: URL: https://github.com/apache/lucene/pull/12624#issuecomment-1778336652 (A small note: Tantivy use a value-based LRU cache with 2-item bucket, items will be evicted per bucket: https://github.com/BurntSushi/fst/blob/a0936e9b25a888a0d5b9f94b91997216253e7088/

Re: [PR] Use Arrays#mismatch for Outputs#common operations [lucene]

2023-10-24 Thread via GitHub
gf2121 merged PR #12710: URL: https://github.com/apache/lucene/pull/12710 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apac

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-10-24 Thread via GitHub
dungba88 commented on PR #12624: URL: https://github.com/apache/lucene/pull/12624#issuecomment-1778455090 There is one thing that baffled me is that we are writing the metadata, including the numBytes & start node in the beginning of the DataOutput. That means once the FST is completed, we

Re: [PR] Speed up the sort when building forward index [lucene]

2023-10-24 Thread via GitHub
gf2121 merged PR #12712: URL: https://github.com/apache/lucene/pull/12712 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apac

Re: [PR] Concurrent HNSW Merge [lucene]

2023-10-24 Thread via GitHub
zhaih commented on code in PR #12660: URL: https://github.com/apache/lucene/pull/12660#discussion_r1371207682 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java: ## @@ -221,34 +296,50 @@ private long printGraphBuildStatus(int node, long start, long t) {

Re: [PR] Concurrent HNSW Merge [lucene]

2023-10-24 Thread via GitHub
zhaih commented on code in PR #12660: URL: https://github.com/apache/lucene/pull/12660#discussion_r1371210641 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java: ## @@ -221,34 +296,50 @@ private long printGraphBuildStatus(int node, long start, long t) {

Re: [PR] Concurrent HNSW Merge [lucene]

2023-10-24 Thread via GitHub
zhaih commented on code in PR #12660: URL: https://github.com/apache/lucene/pull/12660#discussion_r1371211783 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -35,6 +38,9 @@ public class NeighborArray { float[] score; int[] node; private int

Re: [PR] Concurrent HNSW Merge [lucene]

2023-10-24 Thread via GitHub
zhaih commented on code in PR #12660: URL: https://github.com/apache/lucene/pull/12660#discussion_r1371235805 ## lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java: ## @@ -146,18 +148,24 @@ public final class Lucene95HnswVectorsFormat extends

Re: [PR] Concurrent HNSW Merge [lucene]

2023-10-24 Thread via GitHub
zhaih commented on PR #12660: URL: https://github.com/apache/lucene/pull/12660#issuecomment-1778626211 @msokolov @benwtrent I removed almost all `nocommit` (except the renaming one) and rebased to main, please take a look if you have time. @benwtrent Please check whether the rebase an

Re: [PR] [DRAFT] Load vector data directly from the memory segment [lucene]

2023-10-25 Thread via GitHub
ChrisHegarty commented on PR #12703: URL: https://github.com/apache/lucene/pull/12703#issuecomment-1778821678 > > Well... as simple wrapping of float[] into MemorySegment is not going to work out, the Vector API does not like it due to alignment constraints (which seems overly pedantic sinc

Re: [PR] Run merge-on-full-flush even though no changes got flushed. [lucene]

2023-10-25 Thread via GitHub
s1monw commented on PR #12549: URL: https://github.com/apache/lucene/pull/12549#issuecomment-1778953836 I think this is only triggered because of your change but the problem was already there. We hold the lock in MDW#close() such that we can not run a concurrent merge. We could either preve

Re: [PR] Make IndexSearcher#getSlices final and clarify docs [lucene]

2023-10-25 Thread via GitHub
s1monw commented on code in PR #12718: URL: https://github.com/apache/lucene/pull/12718#discussion_r1371505561 ## lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java: ## @@ -425,11 +425,12 @@ public int count(Query query) throws IOException { } /** - * Ret

Re: [PR] Prevent users from using document block APIs when sort is configured [lucene]

2023-10-25 Thread via GitHub
s1monw commented on PR #12711: URL: https://github.com/apache/lucene/pull/12711#issuecomment-1778962096 > Another question: do we have any testing around this sort-stability / block-preservation today? I'm getting nervous now that we are relying on an undocumented feature that just happens

[I] [Sort] Numeric field sort query performance degrades dramatically with more deleted entries in segment [lucene]

2023-10-25 Thread via GitHub
gashutos opened a new issue, #12720: URL: https://github.com/apache/lucene/issues/12720 ### Description ### Problem With higher number of deleted entries in a segment, the sort query shows up to `10x` degradation after one point. We did this experiment using [nyc_taxis](https://gi

Re: [PR] [DRAFT] Load vector data directly from the memory segment [lucene]

2023-10-25 Thread via GitHub
ChrisHegarty commented on PR #12703: URL: https://github.com/apache/lucene/pull/12703#issuecomment-1779052160 For what it's worth, the changes currently in this PR do not perform generally well, since we can have a mix of how we represent the underlying vector values, and where they come fr

[I] Compute gain with vector API in BPIndexReorderer [lucene]

2023-10-25 Thread via GitHub
gf2121 opened a new issue, #12721: URL: https://github.com/apache/lucene/issues/12721 ### Description An immature idea ! :) I noticed that `BPIndexReorderer$ComputeGainsTask#computeGain()` took a lot in CPU profile: ``` PERCENT CPU SAMPLES STACK 4.75%

Re: [PR] Run merge-on-full-flush even though no changes got flushed. [lucene]

2023-10-25 Thread via GitHub
jpountz commented on PR #12549: URL: https://github.com/apache/lucene/pull/12549#issuecomment-1779155292 Thanks! I was thinking of something along the lines of the diff you shared, I had not thought of the SerialMergeScheduler approach. I'll check it works and push this change. -- This i

Re: [I] Should we handle negative scores due to floating point arithmetic errors? [lucene]

2023-10-25 Thread via GitHub
jpountz commented on issue #12700: URL: https://github.com/apache/lucene/issues/12700#issuecomment-1779179636 We made changes to similarities to guarantee monotonicity with tf and norm (e.g. https://github.com/apache/lucene/issues/9063) despite floating-point rounding errors. I think we sho

Re: [I] Adding option to codec to disable patching in Lucene's PFOR encoding [lucene]

2023-10-25 Thread via GitHub
jpountz commented on issue #12696: URL: https://github.com/apache/lucene/issues/12696#issuecomment-1779221543 For reference, Lucene used to use FOR for postings and PFOR for positions in 8.x. This was changed in 9.0 via #69 to use PFOR for both postings and positions. This PR says it made t

Re: [I] [Sort] Numeric field sort query performance degrades dramatically with more deleted entries in segment [lucene]

2023-10-25 Thread via GitHub
jpountz commented on issue #12720: URL: https://github.com/apache/lucene/issues/12720#issuecomment-1779265288 Having many deleted documents competitive is definitely a worst-case scenario for any kind of dynamic pruning that Lucene does. I'm not sure if there is something that we can do abo

Re: [I] MultiSimilarity.MultiSimScorer should sum up scores into a double [lucene]

2023-10-25 Thread via GitHub
jpountz commented on issue #12675: URL: https://github.com/apache/lucene/issues/12675#issuecomment-1779267844 Thas has been addressed by #12682. Thanks @KunalSanghvi for contributing and @benwtrent for merging! -- This is an automated message from the Apache Git Service. To respond to the

Re: [I] Compute gain with vector API in BPIndexReorderer [lucene]

2023-10-25 Thread via GitHub
gf2121 commented on issue #12721: URL: https://github.com/apache/lucene/issues/12721#issuecomment-1779316042 > did something like intVector = intVector.max(BROAD_1) Great idea! Here is the benchmark result : ``` Benchmark (maxTerm) (termsNum) Mode

Re: [I] Optimize FST suffix sharing for block tree index [lucene]

2023-10-25 Thread via GitHub
gf2121 commented on issue #12702: URL: https://github.com/apache/lucene/issues/12702#issuecomment-1779338184 on `wikimediumall` **Queries (Nothing changed obviously):** ``` TaskQPS baseline StdDevQPS my_modified_version StdDev

Re: [I] Improve hash mixing in FST's double-barrel LRU hash [lucene]

2023-10-25 Thread via GitHub
dweiss commented on issue #12704: URL: https://github.com/apache/lucene/issues/12704#issuecomment-1779342648 If you'd like to do so, I'd suggest moving such a "scattering remix" utility to a separate class and reusing it elsewhere, much like here: https://github.com/carrotsearch/hppc/blo

[PR] Disable suffix sharing for block tree index [lucene]

2023-10-25 Thread via GitHub
gf2121 opened a new pull request, #12722: URL: https://github.com/apache/lucene/pull/12722 closes https://github.com/apache/lucene/issues/12702 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] Disable suffix sharing for block tree index [lucene]

2023-10-25 Thread via GitHub
mikemccand commented on code in PR #12722: URL: https://github.com/apache/lucene/pull/12722#discussion_r1371870374 ## lucene/CHANGES.txt: ## @@ -227,6 +227,8 @@ Optimizations * GITHUB#12712: Speed up sorting postings file with an offline radix sorter in BPIndexReader. (Guo F

Re: [PR] Improve hash mixing in FST's double-barrel LRU hash [lucene]

2023-10-25 Thread via GitHub
mikemccand commented on PR #12716: URL: https://github.com/apache/lucene/pull/12716#issuecomment-1779416642 > Thank you so much for the help Mike ! Thank you! > I have never run the fst benchmark but seems like its straightforward java script?. I could give it a try as well (so

Re: [PR] Improve hash mixing in FST's double-barrel LRU hash [lucene]

2023-10-25 Thread via GitHub
mikemccand commented on PR #12716: URL: https://github.com/apache/lucene/pull/12716#issuecomment-1779443135 OK I ran it twice on `main`: ``` saved FST to "fst.bin": 294815624 bytes; 59.874 sec saved FST to "fst.bin": 294815624 bytes; 60.255 sec ``` And twice with th

Re: [PR] Concurrent HNSW Merge [lucene]

2023-10-25 Thread via GitHub
benwtrent commented on code in PR #12660: URL: https://github.com/apache/lucene/pull/12660#discussion_r1371954603 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsWriter.java: ## @@ -557,6 +566,12 @@ public void close() throws IOException {

Re: [I] [Sort] Numeric field sort query performance degrades dramatically with more deleted entries in segment [lucene]

2023-10-25 Thread via GitHub
RS146BIJAY commented on issue #12720: URL: https://github.com/apache/lucene/issues/12720#issuecomment-1779548832 @jpountz so is it same to conclude that user increasing merging rate (which will remove these obsolete entries) (by tuning some parameters or doing a force merge) is the only way

Re: [PR] Improve hash mixing in FST's double-barrel LRU hash [lucene]

2023-10-25 Thread via GitHub
bruno-roustant commented on PR #12716: URL: https://github.com/apache/lucene/pull/12716#issuecomment-1779668350 Oh, the numbers are disappointing. I expected to be both a little more compact and little faster. I wonder what is the cause, the rehash threshold, the linear scan, or the mult

Re: [I] xml.TestCoreParser#testSpanNearQueryWithoutSlopXML fails because of changed exception message [lucene]

2023-10-25 Thread via GitHub
dweiss commented on issue #12708: URL: https://github.com/apache/lucene/issues/12708#issuecomment-1779762455 Should we add an assumption to this test so that it is ignored on JDK22, at least until the issue is resolved? Causes some noise on the builds mailing list. -- This is an automated

Re: [PR] Add support for similarity-based vector searches [lucene]

2023-10-25 Thread via GitHub
kaivalnp commented on PR #12679: URL: https://github.com/apache/lucene/pull/12679#issuecomment-1779796454 Hi @benwtrent! Curious to hear if you've been able to reproduce the benchmark? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

Re: [PR] Run merge-on-full-flush even though no changes got flushed. [lucene]

2023-10-25 Thread via GitHub
jpountz commented on PR #12549: URL: https://github.com/apache/lucene/pull/12549#issuecomment-1779816530 I ended up implementing your other suggestion. MDW generally expects that this IndexWriter instantiation will not do merges. -- This is an automated message from the Apache Git Service

Re: [PR] Record if block API has been used in SegmentInfo [lucene]

2023-10-25 Thread via GitHub
jpountz commented on PR #12685: URL: https://github.com/apache/lucene/pull/12685#issuecomment-1779837662 FYI we've seen failures on TestIndexWriter recently, which are reproducible (e.g. https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-9.x/720/). I ran git bisect and it poin

Re: [PR] Add support for similarity-based vector searches [lucene]

2023-10-25 Thread via GitHub
benwtrent commented on PR #12679: URL: https://github.com/apache/lucene/pull/12679#issuecomment-1779866529 @kaivalnp I have been busy doing other things. I hope to look into this in the next week or so. -- This is an automated message from the Apache Git Service. To respond to the message

Re: [I] xml.TestCoreParser#testSpanNearQueryWithoutSlopXML fails because of changed exception message [lucene]

2023-10-25 Thread via GitHub
uschindler commented on issue #12708: URL: https://github.com/apache/lucene/issues/12708#issuecomment-1780013296 I will update JDK tomorrow or Friday and the issue should be gone. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHu

Re: [PR] Record if block API has been used in SegmentInfo [lucene]

2023-10-25 Thread via GitHub
s1monw commented on PR #12685: URL: https://github.com/apache/lucene/pull/12685#issuecomment-1780091805 I pushed fixes... thanks @jpountz -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spec

Re: [PR] Add support for similarity-based vector searches [lucene]

2023-10-25 Thread via GitHub
kaivalnp commented on PR #12679: URL: https://github.com/apache/lucene/pull/12679#issuecomment-1780186180 Thank you! I'll try to incorporate earlier suggestions in the meanwhile -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [PR] Consolidate FSTStore and BytesStore in FST [lucene]

2023-10-25 Thread via GitHub
mikemccand merged PR #12709: URL: https://github.com/apache/lucene/pull/12709 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.

Re: [PR] Consolidate FSTStore and BytesStore in FST [lucene]

2023-10-25 Thread via GitHub
mikemccand commented on PR #12709: URL: https://github.com/apache/lucene/pull/12709#issuecomment-1780274592 Thanks @dungba88 -- I just merged. We can open a new PR when it's time to backport ... -- This is an automated message from the Apache Git Service. To respond to the message, pleas

Re: [I] FSTCompiler's NodeHash should fully duplicate `byte[]` slices from the growing FST [lucene]

2023-10-25 Thread via GitHub
mikemccand commented on issue #12714: URL: https://github.com/apache/lucene/issues/12714#issuecomment-1780282563 I made a quick hackity change, just to measure the number of additional bytes we'd "typically" have to copy in order to duplicate suffix bytes from the growing (forced append-onl

Re: [PR] Disable suffix sharing for block tree index [lucene]

2023-10-25 Thread via GitHub
gf2121 commented on code in PR #12722: URL: https://github.com/apache/lucene/pull/12722#discussion_r1372542501 ## lucene/CHANGES.txt: ## @@ -227,6 +227,8 @@ Optimizations * GITHUB#12712: Speed up sorting postings file with an offline radix sorter in BPIndexReader. (Guo Feng)

Re: [PR] Disable suffix sharing for block tree index [lucene]

2023-10-25 Thread via GitHub
gf2121 merged PR #12722: URL: https://github.com/apache/lucene/pull/12722 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apac

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-25 Thread via GitHub
jmazanec15 commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1372593680 ## lucene/core/src/java/org/apache/lucene/util/ScalarQuantizer.java: ## @@ -0,0 +1,267 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more

Re: [PR] Concurrent HNSW Merge [lucene]

2023-10-25 Thread via GitHub
zhaih commented on code in PR #12660: URL: https://github.com/apache/lucene/pull/12660#discussion_r1372595375 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java: ## @@ -151,61 +159,128 @@ public OnHeapHnswGraph build(int maxOrd) throws IOException {

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-25 Thread via GitHub
jmazanec15 commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1372593680 ## lucene/core/src/java/org/apache/lucene/util/ScalarQuantizer.java: ## @@ -0,0 +1,267 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more

Re: [PR] Concurrent HNSW Merge [lucene]

2023-10-25 Thread via GitHub
zhaih commented on code in PR #12660: URL: https://github.com/apache/lucene/pull/12660#discussion_r1372606067 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsFormat.java: ## @@ -198,14 +218,25 @@ public Lucene99HnswVectorsFormat( + ";

Re: [PR] Make IndexSearcher#getSlices final and clarify docs [lucene]

2023-10-26 Thread via GitHub
javanna commented on PR #12718: URL: https://github.com/apache/lucene/pull/12718#issuecomment-1780674911 thanks @s1monw for the review! I will add a changes entry and merge this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-26 Thread via GitHub
tveasey commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1372862574 ## lucene/core/src/java/org/apache/lucene/util/ScalarQuantizer.java: ## @@ -0,0 +1,267 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more +

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-26 Thread via GitHub
tveasey commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1372862574 ## lucene/core/src/java/org/apache/lucene/util/ScalarQuantizer.java: ## @@ -0,0 +1,267 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more +

[PR] Feat: Use DocIdSetIterator Reduuce bkd docvalues iteration [lucene]

2023-10-26 Thread via GitHub
luyuncheng opened a new pull request, #12723: URL: https://github.com/apache/lucene/pull/12723 ### Description I see some hot_thread like following stack, ``` java.lang.Thread.State: RUNNABLE at org.apache.lucene.store.DataInput.readVInt(DataInput.java:112)

Re: [PR] Make IndexSearcher#getSlices final and clarify docs [lucene]

2023-10-26 Thread via GitHub
javanna merged PR #12718: URL: https://github.com/apache/lucene/pull/12718 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-26 Thread via GitHub
jmazanec15 commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1373421235 ## lucene/core/src/java/org/apache/lucene/util/ScalarQuantizer.java: ## @@ -0,0 +1,267 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more

Re: [PR] Improve hash mixing in FST's double-barrel LRU hash [lucene]

2023-10-26 Thread via GitHub
shubhamvishu commented on PR #12716: URL: https://github.com/apache/lucene/pull/12716#issuecomment-1781810333 @mikemccand @bruno-roustant So I also ran the FST construction time benchmarks on `wikimediumall` index using `IndexToFST` over a couple of combinations of tweakable parameters to

Re: [PR] Random access term dictionary [lucene]

2023-10-26 Thread via GitHub
Tony-X commented on code in PR #12688: URL: https://github.com/apache/lucene/pull/12688#discussion_r1373799812 ## lucene/core/src/java/module-info.java: ## @@ -35,6 +35,7 @@ exports org.apache.lucene.codecs.lucene95; exports org.apache.lucene.codecs.lucene90.blocktree;

<    11   12   13   14   15   16   17   18   19   20   >