Re: [PR] Fix jacoco coverage tests (add createClassLoader to replicator permissions) [lucene]

2023-10-16 Thread via GitHub
dweiss merged PR #12684: URL: https://github.com/apache/lucene/pull/12684 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apac

Re: [PR] Extract the hnsw graph merging from being part of the vector writer [lucene]

2023-10-16 Thread via GitHub
benwtrent commented on code in PR #12657: URL: https://github.com/apache/lucene/pull/12657#discussion_r1360502250 ## lucene/core/src/java/org/apache/lucene/util/hnsw/InitializedHnswGraphBuilder.java: ## @@ -0,0 +1,78 @@ +/* + * Licensed to the Apache Software Foundation (ASF) un

Re: [PR] Record if block API has been used in SegmentsInfo [lucene]

2023-10-16 Thread via GitHub
s1monw commented on code in PR #12685: URL: https://github.com/apache/lucene/pull/12685#discussion_r1360623450 ## lucene/core/src/java/org/apache/lucene/index/IndexWriter.java: ## @@ -3368,9 +3368,15 @@ public void addIndexesReaderMerge(MergePolicy.OneMerge merge) throws IOExce

Re: [PR] Record if block API has been used in SegmentsInfo [lucene]

2023-10-16 Thread via GitHub
s1monw commented on code in PR #12685: URL: https://github.com/apache/lucene/pull/12685#discussion_r1360627901 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99SegmentInfoFormat.java: ## @@ -0,0 +1,236 @@ +/* + * Licensed to the Apache Software Foundation (ASF)

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-10-16 Thread via GitHub
mikemccand commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1360701669 ## lucene/core/src/java/org/apache/lucene/util/fst/BytesStore.java: ## @@ -21,19 +21,18 @@ import java.util.List; import org.apache.lucene.store.DataInput; impor

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-10-16 Thread via GitHub
mikemccand commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1360715823 ## lucene/core/src/java/org/apache/lucene/util/fst/BytesStore.java: ## @@ -21,19 +21,18 @@ import java.util.List; import org.apache.lucene.store.DataInput; impor

Re: [PR] Record if block API has been used in SegmentInfo [lucene]

2023-10-16 Thread via GitHub
s1monw commented on code in PR #12685: URL: https://github.com/apache/lucene/pull/12685#discussion_r1360837043 ## lucene/core/src/java/org/apache/lucene/index/IndexWriter.java: ## @@ -3368,9 +3368,15 @@ public void addIndexesReaderMerge(MergePolicy.OneMerge merge) throws IOExce

Re: [PR] [WIP] first cut at bounding the NodeHash size during FST compilation [lucene]

2023-10-16 Thread via GitHub
dungba88 commented on code in PR #12633: URL: https://github.com/apache/lucene/pull/12633#discussion_r1360866889 ## lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java: ## @@ -99,31 +87,23 @@ public class FSTCompiler { * tuning and tweaking, see {@link Builder}.

Re: [PR] [WIP] first cut at bounding the NodeHash size during FST compilation [lucene]

2023-10-16 Thread via GitHub
dungba88 commented on code in PR #12633: URL: https://github.com/apache/lucene/pull/12633#discussion_r1360875178 ## lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java: ## @@ -17,50 +17,80 @@ package org.apache.lucene.util.fst; import java.io.IOException; -import o

Re: [PR] Fix segmentInfos replace doesn't set userData [lucene]

2023-10-16 Thread via GitHub
msfroh commented on code in PR #12626: URL: https://github.com/apache/lucene/pull/12626#discussion_r1360878505 ## lucene/core/src/test/org/apache/lucene/index/TestIndexWriter.java: ## @@ -1996,6 +1996,41 @@ public void testGetCommitData() throws Exception { dir.close();

Re: [PR] Fix segmentInfos replace doesn't set userData [lucene]

2023-10-16 Thread via GitHub
msfroh commented on code in PR #12626: URL: https://github.com/apache/lucene/pull/12626#discussion_r1360880697 ## lucene/core/src/test/org/apache/lucene/index/TestIndexWriter.java: ## @@ -1996,6 +1996,41 @@ public void testGetCommitData() throws Exception { dir.close();

Re: [PR] read MSB VLong in new way [lucene]

2023-10-16 Thread via GitHub
gf2121 commented on PR #12661: URL: https://github.com/apache/lucene/pull/12661#issuecomment-1764814636 I made some effort to speed up the `add` operation for `BytesRef`, getting a tiny improvement: > Baseline: after https://github.com/apache/lucene/pull/12631; Candidate: this patch;

[I] `IndexOrDocValuesQuery` does not support query highlighting [lucene]

2023-10-16 Thread via GitHub
harshavamsi opened a new issue, #12686: URL: https://github.com/apache/lucene/issues/12686 ### Description While working with the `IndexOrDocValuesQuery`, I noticed that highlighting was broken. This is potentially caused by the extract function that does not check if the query is in

[PR] Add timeouts to github jobs. [lucene]

2023-10-16 Thread via GitHub
dweiss opened a new pull request, #12687: URL: https://github.com/apache/lucene/pull/12687 Estimates taken from empirical run times (actions history), with a generous buffer added. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to Git

Re: [PR] Extract the hnsw graph merging from being part of the vector writer [lucene]

2023-10-16 Thread via GitHub
benwtrent commented on code in PR #12657: URL: https://github.com/apache/lucene/pull/12657#discussion_r136233 ## lucene/core/src/java/org/apache/lucene/codecs/lucene95/IncrementalHnswGraphMerger.java: ## @@ -0,0 +1,189 @@ +/* + * Licensed to the Apache Software Foundation (A

Re: [PR] Extract the hnsw graph merging from being part of the vector writer [lucene]

2023-10-16 Thread via GitHub
benwtrent commented on code in PR #12657: URL: https://github.com/apache/lucene/pull/12657#discussion_r1361112199 ## lucene/core/src/java/org/apache/lucene/codecs/lucene95/IncrementalHnswGraphMerger.java: ## @@ -0,0 +1,189 @@ +/* + * Licensed to the Apache Software Foundation (A

Re: [PR] Fix SynonymQuery equals implementation [lucene]

2023-10-16 Thread via GitHub
mingshl commented on PR #12260: URL: https://github.com/apache/lucene/pull/12260#issuecomment-1765091198 @romseygeek @mkhludnev, this bug was introduced since 9.4 version, can this PR be back-ported to 9.4.2 to fix the issue? -- This is an automated message from the Apache Git Service. To

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-16 Thread via GitHub
jmazanec15 commented on PR #12582: URL: https://github.com/apache/lucene/pull/12582#issuecomment-1765145453 Hey @benwtrent, sorry for delay, still looking through change. But 4x space improvement with minimal recall loss is awesome. -- This is an automated message from the Apache Git Ser

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-16 Thread via GitHub
benwtrent commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1361186978 ## lucene/core/src/java/org/apache/lucene/util/ScalarQuantizer.java: ## @@ -0,0 +1,267 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-16 Thread via GitHub
jmazanec15 commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1357440025 ## lucene/core/src/java/org/apache/lucene/util/ScalarQuantizer.java: ## @@ -0,0 +1,267 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more

Re: [PR] Record if block API has been used in SegmentInfo [lucene]

2023-10-16 Thread via GitHub
jpountz commented on code in PR #12685: URL: https://github.com/apache/lucene/pull/12685#discussion_r1361188738 ## lucene/core/src/java/org/apache/lucene/index/SegmentInfo.java: ## @@ -153,6 +157,16 @@ public boolean getUseCompoundFile() { return isCompoundFile; } + /

Re: [PR] read MSB VLong in new way [lucene]

2023-10-16 Thread via GitHub
jpountz commented on PR #12661: URL: https://github.com/apache/lucene/pull/12661#issuecomment-1765186395 If we're specializing the format anyway, I wonder if we could try different layouts. E.g. another option could be to encode the number of supplementary bytes using unary coding (like UTF

Re: [PR] Optimize OnHeapHnswGraph's data structure [lucene]

2023-10-16 Thread via GitHub
zhaih merged PR #12651: URL: https://github.com/apache/lucene/pull/12651 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apach

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-16 Thread via GitHub
uschindler commented on PR #12582: URL: https://github.com/apache/lucene/pull/12582#issuecomment-1765287960 Hi, why do we need a new Codec? The Lucebe main file format does not change, olly the HNSW format was exchanged. Because like pistingsfornats and dicvaluesformats, the SPI can detect

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-16 Thread via GitHub
benwtrent commented on PR #12582: URL: https://github.com/apache/lucene/pull/12582#issuecomment-1765316000 @uschindler so I should just add a new format? It would be a new Lucene99 HNSW format, but keep the default Lucene95 HNSW format? Or can we change the default vector form

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-16 Thread via GitHub
jmazanec15 commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1361261168 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99ScalarQuantizedVectorsWriter.java: ## @@ -0,0 +1,782 @@ +/* + * Licensed to the Apache Software Fo

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-16 Thread via GitHub
jimczi commented on PR #12582: URL: https://github.com/apache/lucene/pull/12582#issuecomment-1765330759 > why do we need a new top-level Codec? The Lucene main file format does not change, only the HNSW format was exchanged. Because like ppostingsfornats and docvaluesformats, the SPI can d

[PR] Random access term dictionary [lucene]

2023-10-16 Thread via GitHub
Tony-X opened a new pull request, #12688: URL: https://github.com/apache/lucene/pull/12688 ### Description Related issue https://github.com/apache/lucene/issues/12513 Opening this PR early to avoid massive diffs in one-shot - [x] Encode (term type, local ord) in FST T

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-16 Thread via GitHub
uschindler commented on PR #12582: URL: https://github.com/apache/lucene/pull/12582#issuecomment-1765355363 > > why do we need a new top-level Codec? The Lucene main file format does not change, only the HNSW format was exchanged. Because like ppostingsfornats and docvaluesformats, the SPI

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-16 Thread via GitHub
uschindler commented on PR #12582: URL: https://github.com/apache/lucene/pull/12582#issuecomment-1765362790 > @uschindler so I should just add a new format? > > It would be a new Lucene99 HNSW format, but keep the default Lucene95 HNSW format? > > Or can we change the default v

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-16 Thread via GitHub
uschindler commented on PR #12582: URL: https://github.com/apache/lucene/pull/12582#issuecomment-1765382016 I just checked the code, the 9.5 top-level codec addition was useless. Just code duplication. We can't revert it anymore, but we should not repeat that. The only required top-level Fo

Re: [PR] Create a task executor when executor is not provided [lucene]

2023-10-16 Thread via GitHub
sohami commented on code in PR #12606: URL: https://github.com/apache/lucene/pull/12606#discussion_r1361333154 ## lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java: ## @@ -420,13 +418,12 @@ public int count(Query query) throws IOException { } /** - * Ret

Re: [PR] [BROKEN, for reference only] concurrent hnsw [lucene]

2023-10-16 Thread via GitHub
zhaih commented on code in PR #12683: URL: https://github.com/apache/lucene/pull/12683#discussion_r1361334258 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java: ## @@ -59,11 +60,26 @@ protected HnswGraph() {} * * @param level level of the graph * @pa

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-16 Thread via GitHub
uschindler commented on PR #12582: URL: https://github.com/apache/lucene/pull/12582#issuecomment-1765386547 The simplest change is: - Remove Lucene99Codec - In Lucene95Codec just change this: `this.defaultKnnVectorsFormat = new Lucene95HnswVectorsFormat();` to the new format. Do

Re: [PR] Extract the hnsw graph merging from being part of the vector writer [lucene]

2023-10-16 Thread via GitHub
zhaih commented on code in PR #12657: URL: https://github.com/apache/lucene/pull/12657#discussion_r1361341835 ## lucene/core/src/java/org/apache/lucene/util/hnsw/IncrementalHnswGraphMerger.java: ## @@ -0,0 +1,197 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under

Re: [I] Exception rising while using QueryTimeout [lucene]

2023-10-16 Thread via GitHub
msfroh commented on issue #12032: URL: https://github.com/apache/lucene/issues/12032#issuecomment-1765587096 I started to work on making DrillSidewaysScorer work on windows of doc IDs, when I noticed the following comment added in TestDrillSideways as part of https://github.com/apache/lucen

Re: [PR] Add timeouts to github jobs. [lucene]

2023-10-16 Thread via GitHub
dweiss merged PR #12687: URL: https://github.com/apache/lucene/pull/12687 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apac

Re: [PR] Fix SynonymQuery equals implementation [lucene]

2023-10-16 Thread via GitHub
mkhludnev commented on PR #12260: URL: https://github.com/apache/lucene/pull/12260#issuecomment-1765741713 Hi, @mingshl I'm able to cherrypick this fix into branch_9_4, but I'm not sure if there'll be release 9.4.2 ever. -- This is an automated message from the Apache Git Service. To