Re: [PR] Concurrent HNSW Merge [lucene]

2023-10-24 Thread via GitHub
zhaih commented on PR #12660: URL: https://github.com/apache/lucene/pull/12660#issuecomment-1778626211 @msokolov @benwtrent I removed almost all `nocommit` (except the renaming one) and rebased to main, please take a look if you have time. @benwtrent Please check whether the rebase an

Re: [PR] Concurrent HNSW Merge [lucene]

2023-10-24 Thread via GitHub
zhaih commented on code in PR #12660: URL: https://github.com/apache/lucene/pull/12660#discussion_r1371235805 ## lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java: ## @@ -146,18 +148,24 @@ public final class Lucene95HnswVectorsFormat extends

Re: [PR] Concurrent HNSW Merge [lucene]

2023-10-24 Thread via GitHub
zhaih commented on code in PR #12660: URL: https://github.com/apache/lucene/pull/12660#discussion_r1371211783 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -35,6 +38,9 @@ public class NeighborArray { float[] score; int[] node; private int

Re: [PR] Concurrent HNSW Merge [lucene]

2023-10-24 Thread via GitHub
zhaih commented on code in PR #12660: URL: https://github.com/apache/lucene/pull/12660#discussion_r1371210641 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java: ## @@ -221,34 +296,50 @@ private long printGraphBuildStatus(int node, long start, long t) {

Re: [PR] Concurrent HNSW Merge [lucene]

2023-10-24 Thread via GitHub
zhaih commented on code in PR #12660: URL: https://github.com/apache/lucene/pull/12660#discussion_r1371207682 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java: ## @@ -221,34 +296,50 @@ private long printGraphBuildStatus(int node, long start, long t) {

Re: [PR] Speed up the sort when building forward index [lucene]

2023-10-24 Thread via GitHub
gf2121 merged PR #12712: URL: https://github.com/apache/lucene/pull/12712 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apac

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-10-24 Thread via GitHub
dungba88 commented on PR #12624: URL: https://github.com/apache/lucene/pull/12624#issuecomment-1778455090 There is one thing that baffled me is that we are writing the metadata, including the numBytes & start node in the beginning of the DataOutput. That means once the FST is completed, we

Re: [PR] Use Arrays#mismatch for Outputs#common operations [lucene]

2023-10-24 Thread via GitHub
gf2121 merged PR #12710: URL: https://github.com/apache/lucene/pull/12710 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apac

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-10-24 Thread via GitHub
dungba88 commented on PR #12624: URL: https://github.com/apache/lucene/pull/12624#issuecomment-1778336652 (A small note: Tantivy use a value-based LRU cache with 2-item bucket, items will be evicted per bucket: https://github.com/BurntSushi/fst/blob/a0936e9b25a888a0d5b9f94b91997216253e7088/

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-24 Thread via GitHub
benwtrent merged PR #12582: URL: https://github.com/apache/lucene/pull/12582 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.a

Re: [PR] Improve hash mixing in FST's double-barrel LRU hash [lucene]

2023-10-24 Thread via GitHub
shubhamvishu commented on PR #12716: URL: https://github.com/apache/lucene/pull/12716#issuecomment-130978 Thanks for the quick review @mikemccand! I have addressed the comments in the new revision. > Could you also change the probe from quadratic (what it is now) to a simple line

Re: [PR] Improve hash mixing in FST's double-barrel LRU hash [lucene]

2023-10-24 Thread via GitHub
shubhamvishu commented on code in PR #12716: URL: https://github.com/apache/lucene/pull/12716#discussion_r1370588249 ## lucene/CHANGES.txt: ## @@ -190,6 +190,8 @@ Improvements * GITHUB#12705, GITHUB#12705: Improve handling of NullPointerException and IllegalStateException i

Re: [PR] Deprecated public constructor of FSTCompiler in favor of the Builder. [lucene]

2023-10-24 Thread via GitHub
cavorite commented on code in PR #12715: URL: https://github.com/apache/lucene/pull/12715#discussion_r1370542730 ## lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java: ## @@ -122,8 +122,11 @@ public class FSTCompiler { /** * Instantiates an FST/FSA builder w

Re: [PR] Speed up the sort when building forward index [lucene]

2023-10-24 Thread via GitHub
gf2121 commented on PR #12712: URL: https://github.com/apache/lucene/pull/12712#issuecomment-1777640134 > I sent you an email to your Apache address, can you check it out? Sorry for missing the email. Thank you so much for reminding me here! -- This is an automated message from the

Re: [PR] Capture build scans on ge.apache.org to benefit from deep build insights [lucene]

2023-10-24 Thread via GitHub
dsmiley merged PR #12293: URL: https://github.com/apache/lucene/pull/12293 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

Re: [PR] Speed up the sort when building forward index [lucene]

2023-10-24 Thread via GitHub
jpountz commented on PR #12712: URL: https://github.com/apache/lucene/pull/12712#issuecomment-1777605586 Unrelated to this change: I sent you an email to your Apache address, can you check it out? (Sorry for the noise on this PR, I don't know how else to contact you). -- This is an autom

Re: [PR] Capture build scans on ge.apache.org to benefit from deep build insights [lucene]

2023-10-24 Thread via GitHub
dsmiley commented on PR #12293: URL: https://github.com/apache/lucene/pull/12293#issuecomment-1777598613 Here's [my build I ran locally](https://ge.apache.org/s/nsaeazuf3tkcu/timeline) for anyone who is interested. I observe that Lucene tests seem well balanced (judging from the cool time

Re: [PR] Sometimes intersect the essential clause and the best non-essential clause. [lucene]

2023-10-24 Thread via GitHub
jpountz merged PR #12589: URL: https://github.com/apache/lucene/pull/12589 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

Re: [PR] Run merge-on-full-flush even though no changes got flushed. [lucene]

2023-10-24 Thread via GitHub
jpountz commented on PR #12549: URL: https://github.com/apache/lucene/pull/12549#issuecomment-1777530886 I reverted as this causes a deadlock in TestStressIndexing: ``` "TEST-TestStressIndexing.testStressIndexAndSearching-seed#[D4B60FA81EB58FF3]" #42 [282516] prio=5 os_prio=0 cpu=

Re: [PR] TaskExecutor to cancel all tasks on exception [lucene]

2023-10-24 Thread via GitHub
javanna commented on PR #12689: URL: https://github.com/apache/lucene/pull/12689#issuecomment-1777505662 Thanks for the review @jpountz ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the s

Re: [PR] TaskExecutor to cancel all tasks on exception [lucene]

2023-10-24 Thread via GitHub
javanna merged PR #12689: URL: https://github.com/apache/lucene/pull/12689 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

Re: [PR] Capture build scans on ge.apache.org to benefit from deep build insights [lucene]

2023-10-24 Thread via GitHub
clayburn commented on PR #12293: URL: https://github.com/apache/lucene/pull/12293#issuecomment-1777496879 @dsmiley - Excellent questions: > An espoused benefit to this PR is that, as an Apache committer, I could do builds on my own machine and have the analysis be published for viewin

Re: [PR] Speed up the sort when building forward index [lucene]

2023-10-24 Thread via GitHub
jpountz commented on code in PR #12712: URL: https://github.com/apache/lucene/pull/12712#discussion_r1370397139 ## lucene/misc/src/java/org/apache/lucene/misc/index/BPIndexReorderer.java: ## @@ -991,4 +939,166 @@ static int readMonotonicInts(DataInput in, int[] ints) throws IOE

Re: [PR] Speed up the sort when building forward index [lucene]

2023-10-24 Thread via GitHub
gf2121 commented on code in PR #12712: URL: https://github.com/apache/lucene/pull/12712#discussion_r1370377164 ## lucene/misc/src/java/org/apache/lucene/misc/index/BPIndexReorderer.java: ## @@ -991,4 +939,233 @@ static int readMonotonicInts(DataInput in, int[] ints) throws IOEx

Re: [PR] Run merge-on-full-flush even though no changes got flushed. [lucene]

2023-10-24 Thread via GitHub
jpountz merged PR #12549: URL: https://github.com/apache/lucene/pull/12549 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

Re: [PR] Consolidate FSTStore and BytesStore in FST [lucene]

2023-10-24 Thread via GitHub
dungba88 commented on PR #12709: URL: https://github.com/apache/lucene/pull/12709#issuecomment-1777396050 I added an entry in the CHANGES.txt under Lucene 10.0 (as we are not backporting) -- This is an automated message from the Apache Git Service. To respond to the message, please log on

Re: [PR] Create a task executor when executor is not provided [lucene]

2023-10-24 Thread via GitHub
javanna commented on code in PR #12606: URL: https://github.com/apache/lucene/pull/12606#discussion_r1370308080 ## lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java: ## @@ -420,13 +418,12 @@ public int count(Query query) throws IOException { } /** - * Re

Re: [PR] Capture build scans on ge.apache.org to benefit from deep build insights [lucene]

2023-10-24 Thread via GitHub
dsmiley commented on PR #12293: URL: https://github.com/apache/lucene/pull/12293#issuecomment-1777343300 @clayburn An espoused benefit to this PR is that, as an Apache committer, I could do builds on my own machine and have the analysis be published for viewing on ge.apache.org. How do I d

Re: [PR] Speed up the sort when building forward index [lucene]

2023-10-24 Thread via GitHub
jpountz commented on code in PR #12712: URL: https://github.com/apache/lucene/pull/12712#discussion_r1370176150 ## lucene/misc/src/java/org/apache/lucene/misc/index/BPIndexReorderer.java: ## @@ -991,4 +939,233 @@ static int readMonotonicInts(DataInput in, int[] ints) throws IOE

Re: [PR] Improve hash mixing in FST's double-barrel LRU hash [lucene]

2023-10-24 Thread via GitHub
mikemccand commented on code in PR #12716: URL: https://github.com/apache/lucene/pull/12716#discussion_r1370148505 ## lucene/CHANGES.txt: ## @@ -190,6 +190,8 @@ Improvements * GITHUB#12705, GITHUB#12705: Improve handling of NullPointerException and IllegalStateException in

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-10-24 Thread via GitHub
dungba88 commented on PR #12624: URL: https://github.com/apache/lucene/pull/12624#issuecomment-1777157731 @mikemccand I rebased and created some implementation of DataOutput-based FSTWriter. I think I need to write tests, but let me know what you think. -- This is an automated mes

Re: [PR] Concurrent HNSW Merge [lucene]

2023-10-24 Thread via GitHub
msokolov commented on code in PR #12660: URL: https://github.com/apache/lucene/pull/12660#discussion_r1370054483 ## lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java: ## @@ -146,18 +148,24 @@ public final class Lucene95HnswVectorsFormat extend

Re: [PR] Add a specialized bulk scorer for regular conjunctions. [lucene]

2023-10-24 Thread via GitHub
jpountz commented on PR #12719: URL: https://github.com/apache/lucene/pull/12719#issuecomment-1777078920 Wikibigall: ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value Prefix3

[PR] Add a specialized bulk scorer for regular conjunctions. [lucene]

2023-10-24 Thread via GitHub
jpountz opened a new pull request, #12719: URL: https://github.com/apache/lucene/pull/12719 PR #12382 added a bulk scorer for top-k hits on conjunctions that yielded a significant speedup (annotation [FP](http://people.apache.org/~mikemccand/lucenebench/AndHighHigh.html)). This change pr

Re: [PR] Concurrent HNSW Merge [lucene]

2023-10-24 Thread via GitHub
msokolov commented on code in PR #12660: URL: https://github.com/apache/lucene/pull/12660#discussion_r1370046773 ## lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java: ## @@ -635,17 +667,31 @@ private static DocsWithFieldSet writeVectorData(

Re: [PR] Prevent users from using document block APIs when sort is configured [lucene]

2023-10-24 Thread via GitHub
msokolov commented on PR #12711: URL: https://github.com/apache/lucene/pull/12711#issuecomment-1777062768 Another question: do we have any testing around this sort-stability / block-preservation today? I'm getting nervous now that we are relying on an undocumented feature that just happens

Re: [PR] Prevent users from using document block APIs when sort is configured [lucene]

2023-10-24 Thread via GitHub
msokolov commented on PR #12711: URL: https://github.com/apache/lucene/pull/12711#issuecomment-1777060875 The idea of more explicitly modeling doc-blocks makes sense to me, but I wonder if it would really enable "That way we can sort only on the parent document". What about the case (suppor

Re: [PR] Speed up the sort when building forward index [lucene]

2023-10-24 Thread via GitHub
gf2121 commented on PR #12712: URL: https://github.com/apache/lucene/pull/12712#issuecomment-1777056056 I indexed `wikimidumall` with: * BPIndexReorder monfig mentioned [here](https://github.com/apache/lucene/issues/12665#issuecomment-1770827026). * BPMergePolicy on this [commit](htt

Re: [PR] Concurrent HNSW Merge [lucene]

2023-10-24 Thread via GitHub
benwtrent commented on code in PR #12660: URL: https://github.com/apache/lucene/pull/12660#discussion_r1370034532 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -35,6 +38,9 @@ public class NeighborArray { float[] score; int[] node; private

Re: [PR] Concurrent HNSW Merge [lucene]

2023-10-24 Thread via GitHub
benwtrent commented on code in PR #12660: URL: https://github.com/apache/lucene/pull/12660#discussion_r1370028382 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java: ## @@ -221,34 +296,50 @@ private long printGraphBuildStatus(int node, long start, long t)

Re: [PR] Concurrent HNSW Merge [lucene]

2023-10-24 Thread via GitHub
benwtrent commented on code in PR #12660: URL: https://github.com/apache/lucene/pull/12660#discussion_r1370023343 ## lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java: ## @@ -146,18 +148,24 @@ public final class Lucene95HnswVectorsFormat exten

Re: [PR] Clean up ByteBlockPool [lucene]

2023-10-24 Thread via GitHub
iverase commented on code in PR #12506: URL: https://github.com/apache/lucene/pull/12506#discussion_r1369987969 ## lucene/core/src/java/org/apache/lucene/util/ByteSlicePool.java: ## @@ -0,0 +1,138 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + *

Re: [PR] Speed up the sort when building forward index [lucene]

2023-10-24 Thread via GitHub
gf2121 commented on PR #12712: URL: https://github.com/apache/lucene/pull/12712#issuecomment-1776994927 ``` BPIndexReorderer reorderer = new BPIndexReorderer(); reorderer.setMinDocFreq(16384); reorderer.setMaxIters(3); reorderer.setMinPartitionSize(8192); mp = new BPReorderingM

Re: [PR] Clean up ByteBlockPool [lucene]

2023-10-24 Thread via GitHub
iverase commented on PR #12506: URL: https://github.com/apache/lucene/pull/12506#issuecomment-1776973771 I like the introduction of `ByteSlicePool` but I wonder if the naming is correct as it does not feel a generic slicer class but very tied to the format used by TermsHashPerField. Just

Re: [PR] Make IndexSearcher#getSlices final and clarify docs [lucene]

2023-10-24 Thread via GitHub
javanna commented on code in PR #12718: URL: https://github.com/apache/lucene/pull/12718#discussion_r1369871845 ## lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java: ## @@ -115,10 +115,10 @@ public class IndexSearcher { protected final List leafContexts; /

Re: [PR] Make IndexSearcher#getSlices final and clarify docs [lucene]

2023-10-24 Thread via GitHub
javanna commented on code in PR #12718: URL: https://github.com/apache/lucene/pull/12718#discussion_r1369871488 ## lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java: ## @@ -425,11 +425,12 @@ public int count(Query query) throws IOException { } /** - * Re

Re: [PR] Prevent users from using document block APIs when sort is configured [lucene]

2023-10-24 Thread via GitHub
s1monw commented on PR #12711: URL: https://github.com/apache/lucene/pull/12711#issuecomment-1776831638 I spoke to @jpountz about this topic and we discussed a different approach. We could get away with not having the check at all and make blocks a first class citizen by recording the paren

[I] Use max BPV encoding in postings if doc buffer size less than ForUtil.BLOCK_SIZE [lucene]

2023-10-24 Thread via GitHub
easyice opened a new issue, #12717: URL: https://github.com/apache/lucene/issues/12717 ### Description Currently we use vint encoding the doc IDs if the doc buffer < 128, then decode in `Lucene90PostingsReader#readVIntBlock`. In the high cardinality field, it it possibly slow to

Re: [I] Improve hash mixing in FST's double-barrel LRU hash [lucene]

2023-10-24 Thread via GitHub
shubhamvishu commented on issue #12704: URL: https://github.com/apache/lucene/issues/12704#issuecomment-1776661739 I opened a PR to make use of this constant : #12716 Also, I was thinking if this constant could be utilised in other hash function implementations as well in the codebas

[PR] Improve hash mixing in FST's double-barrel LRU hash [lucene]

2023-10-24 Thread via GitHub
shubhamvishu opened a new pull request, #12716: URL: https://github.com/apache/lucene/pull/12716 ### Description Addresses #12704. Below is the comment that inspired this ([link](https://github.com/apache/lucene/pull/12633#discussion_r1366847986)), ``` Instead, we shoul