Re: [PR] TestIndexWriterOnVMError.testUnknownError times out (fixes potential IW deadlock on tragic exceptions). [lucene]
dweiss merged PR #12751: URL: https://github.com/apache/lucene/pull/12751 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] TestIndexWriterOnVMError.testUnknownError times out (potential IndexWriter deadlock with tragic exceptions) [lucene]
dweiss closed issue #12654: TestIndexWriterOnVMError.testUnknownError times out (potential IndexWriter deadlock with tragic exceptions) URL: https://github.com/apache/lucene/issues/12654 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce the similarity as boost functionality to the Word2VecSynonyFilter [lucene]
dantuzi commented on PR #12433: URL: https://github.com/apache/lucene/pull/12433#issuecomment-1801394824 Thanks @mikemccand for your feedback. I had to address some comments from @alessandrobenedetti, that's why this PR is still WIP. At the moment I have other priorities at work but I'll resolve all conflicts and address your comments in the next few days -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager [lucene]
mikemccand commented on code in PR #240: URL: https://github.com/apache/lucene/pull/240#discussion_r1386407230 ## lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java: ## @@ -602,34 +579,10 @@ private TopFieldDocs searchAfter( final Sort rewrittenSort = sort.rewrite(this); final LeafSlice[] leafSlices = getSlices(); +final boolean supportsConcurrency = leafSlices.length > 1; final CollectorManager manager = -new CollectorManager<>() { - - private final HitsThresholdChecker hitsThresholdChecker = - leafSlices.length <= 1 - ? HitsThresholdChecker.create(Math.max(TOTAL_HITS_THRESHOLD, numHits)) - : HitsThresholdChecker.createShared(Math.max(TOTAL_HITS_THRESHOLD, numHits)); - - private final MaxScoreAccumulator minScoreAcc = - leafSlices.length <= 1 ? null : new MaxScoreAccumulator(); - - @Override - public TopFieldCollector newCollector() throws IOException { -// TODO: don't pay the price for accurate hit counts by default -return TopFieldCollector.create( -rewrittenSort, cappedNumHits, after, hitsThresholdChecker, minScoreAcc); - } - - @Override - public TopFieldDocs reduce(Collection collectors) throws IOException { -final TopFieldDocs[] topDocs = new TopFieldDocs[collectors.size()]; -int i = 0; -for (TopFieldCollector collector : collectors) { - topDocs[i++] = collector.topDocs(); -} -return TopDocs.merge(rewrittenSort, 0, cappedNumHits, topDocs); - } -}; +new TopFieldCollectorManager( Review Comment: Wonderful to see all this redundant code go away! ## lucene/core/src/java/org/apache/lucene/search/TopFieldCollector.java: ## @@ -429,106 +432,29 @@ public static TopFieldCollector create(Sort sort, int numHits, int totalHitsThre * field is indexed both with doc values and points. In this case, there is an assumption that * the same data is stored in these points and doc values. * @return a {@link TopFieldCollector} instance which will sort the results by the sort criteria. + * @deprecated This method is being deprecated in favor of using the constructor of {@link Review Comment: Remove `being` and `using`? Makes the language more punchy (avoids passive voice). ## lucene/core/src/java/org/apache/lucene/search/TopDocs.java: ## @@ -232,8 +232,8 @@ public static TopDocs merge( /** * Returns a new TopFieldDocs, containing topN results across the provided TopFieldDocs, sorting * by the specified {@link Sort}. Each of the TopDocs must have been sorted by the same Sort, and - * sort field values must have been filled (ie, fillFields=true must be passed to - * {@link TopFieldCollector#create}). + * sort field values must have been filled (ie, fillFields=true must be passed to the + * constructor of {@link TopFieldCollectorManager}). Review Comment: Hmm I don't think I see a `boolean fillFields` parameter to `TopFieldCollectorManager`? ## lucene/core/src/java/org/apache/lucene/search/TopScoreDocCollector.java: ## @@ -44,7 +43,7 @@ public void setScorer(Scorable scorer) throws IOException { } Review Comment: Should we add a comment to the javadoc for this class that one should use `TopScoreDocCollectorManager` instead of the static factory methods here? Is it too soon to deprecate this class (so we can make it package private eventually)? ## lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/SearchWithCollectorTask.java: ## @@ -45,20 +43,6 @@ public boolean withCollector() { return true; } - @Override - protected Collector createCollector() throws Exception { Review Comment: Hmm the purpose of this benchmark component is explicitly to test ones own `Collector` impl -- which I think is no longer possible with this diff (neutering it). Maybe we can move this to a `SearchWithCollectorManagerTask` instead? Or we could defer this to a later PR -- this one is already great. Progress not perfection (PNP)! ## lucene/core/src/test/org/apache/lucene/document/BaseSpatialTestCase.java: ## @@ -695,8 +695,8 @@ protected void verifyRandomDistanceQueries(IndexReader reader, Object... shapes) } } - private FixedBitSet searchIndex(IndexSearcher s, Query query, int maxDoc) throws IOException { -return s.search(query, FixedBitSetCollector.createManager(maxDoc)); + private CollectorManager searchIndex(int maxDoc) { Review Comment: Hmm rename method to `createCollectorManager` or so? It's no longer searching. ## lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java: ## @@ -527,7 +503,10 @@ public TopDocs search(Query query, int n) throws IOException { *
Re: [PR] LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager [lucene]
mikemccand commented on PR #240: URL: https://github.com/apache/lucene/pull/240#issuecomment-1801638506 > We are still a ways away For example: query concurrency is still tied to segment geometry, which is insane. An "optimized" (`forceMerge to 1 segment`) index loses all of its concurrency! Hideously, at my team (Amazon product search) we had to create a neutered merge policy that intentionally avoids "accidentally" making a lopsided index (where too many docs are in a single segment) because it harms our long pole latencies, due to this silly Lucene limitation. We have a longstanding issue open for this: https://github.com/apache/lucene/issues/9721, and it ought not be so difficult because Lucene already has strong support for one thread to search a "slice" of a big segment, i.e. we can make the thread work units slices of segments instead of whole segments or collections of segments (what IndexSearcher now calls slices, confusingly). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Re-adding the backward_codecs.lucene90 TestPForUtil + TestForUtil [lucene]
slow-J opened a new pull request, #12781: URL: https://github.com/apache/lucene/pull/12781 Clean-up from adding the Lucene99PostingsFormat in https://github.com/apache/lucene/pull/12741 These test cases were moved to Lucene99 test directory and I forgot to copy the unmodified versions for the backward_codecs.lucene90 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Copy directly between 2 ByteBlockPool to avoid double-copy [lucene]
mikemccand commented on code in PR #12778: URL: https://github.com/apache/lucene/pull/12778#discussion_r1386504215 ## lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java: ## @@ -289,21 +273,34 @@ public long getNodeAddress(long hashSlot) { } /** - * Set the node address and bytes from the provided hash slot + * Set the node address from the provided hash slot * * @param hashSlot the hash slot to write to * @param nodeAddress the node address - * @param bytes the node bytes to be copied */ -public void setNode(long hashSlot, long nodeAddress, byte[] bytes) { +public void setNodeAddress(long hashSlot, long nodeAddress) { assert fstNodeAddress.get(hashSlot) == 0; fstNodeAddress.set(hashSlot, nodeAddress); count++; +} +private void setBytes(long hashSlot, byte[] bytes) { Review Comment: Can we rename to maybe `copyNodeBytes` or so? ## lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java: ## @@ -289,21 +273,34 @@ public long getNodeAddress(long hashSlot) { } /** - * Set the node address and bytes from the provided hash slot + * Set the node address from the provided hash slot * * @param hashSlot the hash slot to write to * @param nodeAddress the node address - * @param bytes the node bytes to be copied */ -public void setNode(long hashSlot, long nodeAddress, byte[] bytes) { +public void setNodeAddress(long hashSlot, long nodeAddress) { assert fstNodeAddress.get(hashSlot) == 0; fstNodeAddress.set(hashSlot, nodeAddress); count++; +} +private void setBytes(long hashSlot, byte[] bytes) { + assert copiedNodeAddress.get(hashSlot) == 0; copiedNodes.append(bytes); // write the offset, which points to the last byte of the node we copied since we later read // this node in reverse + copiedNodeAddress.set(hashSlot, copiedNodes.getPosition() - 1); +} + +/** promote the node bytes from the fallback table */ +private void setBytes( Review Comment: Can we rename to `copyFallbackNodeBytes` or so? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up BytesRefHash#sort [lucene]
jpountz commented on PR #12775: URL: https://github.com/apache/lucene/pull/12775#issuecomment-1801748121 FWIW I think it's fine if this yields a speedup on some platforms but not other platforms. From your results, it doesn't seem to slow down flushing either? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up BytesRefHash#sort [lucene]
gf2121 commented on PR #12775: URL: https://github.com/apache/lucene/pull/12775#issuecomment-1801756914 > it's fine if this yields a speedup on some platforms but not other platforms +1 > it doesn't seem to slow down flushing either? Yes! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Deprecated public constructor of FSTCompiler in favor of the Builder. [lucene]
dungba88 commented on PR #12715: URL: https://github.com/apache/lucene/pull/12715#issuecomment-1801792656 @mikemccand can you help to review this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Normalize written scalar quantized vectors when using cosine similarity [lucene]
benwtrent commented on code in PR #12780: URL: https://github.com/apache/lucene/pull/12780#discussion_r1386577696 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99ScalarQuantizedVectorsWriter.java: ## @@ -143,6 +143,13 @@ private void writeQuantizedVectors(QuantizationFieldVectorWriter fieldData) thro byte[] vector = new byte[fieldData.dim]; final ByteBuffer offsetBuffer = ByteBuffer.allocate(Float.BYTES).order(ByteOrder.LITTLE_ENDIAN); for (float[] v : fieldData.floatVectors) { + if (fieldData.normalize) { +float[] copy = new float[v.length]; Review Comment: we should only have one `copy` array created. Outside the loop check if we need it, if so, initialize and then you can use it here in the loop. Seems better to do that than create a new `float[dim]` for every vector. ## lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java: ## @@ -81,7 +81,8 @@ public void init() { protected void addRandomFields(Document doc) { switch (vectorEncoding) { case BYTE -> doc.add(new KnnByteVectorField("v2", randomVector8(30), similarityFunction)); - case FLOAT32 -> doc.add(new KnnFloatVectorField("v2", randomVector(30), similarityFunction)); + case FLOAT32 -> doc.add( + new KnnFloatVectorField("v2", randomNormalizedVector(30), similarityFunction)); Review Comment: Were you getting many failures with your change when you put non-normalized vectors? The fix to normalize on cosine should be transparent. The caller should never be aware that it happens as it only occurs for things that are then quantized. ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99ScalarQuantizedVectorsWriter.java: ## @@ -196,6 +203,13 @@ void writeSortedQuantizedVectors(QuantizationFieldVectorWriter fieldData, int[] final ByteBuffer offsetBuffer = ByteBuffer.allocate(Float.BYTES).order(ByteOrder.LITTLE_ENDIAN); for (int ordinal : ordMap) { float[] v = fieldData.floatVectors.get(ordinal); + if (fieldData.normalize) { +float[] copy = new float[v.length]; Review Comment: Similar to above ^ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Use group-varint encoding for the tail of postings [lucene]
easyice opened a new pull request, #12782: URL: https://github.com/apache/lucene/pull/12782 As discussed in issue https://github.com/apache/lucene/issues/12717 the read performance of group-varint is 14-30%% faster than vint, the `Mode` 16-248 is the number of ints will be read. feel free to close the PR if the performance improves is not enough :) ``` Benchmark(size) Mode Cnt Score Error Units GroupVInt.readGroupVInt 16 thrpt5 30.743 ± 5.054 ops/us GroupVInt.readGroupVInt 32 thrpt5 14.495 ± 0.606 ops/us GroupVInt.readGroupVInt 64 thrpt5 6.930 ± 4.679 ops/us GroupVInt.readGroupVInt 128 thrpt5 3.593 ± 0.687 ops/us GroupVInt.readGroupVInt 248 thrpt5 2.356 ± 0.073 ops/us GroupVInt.readVInt 16 thrpt5 21.437 ± 1.102 ops/us GroupVInt.readVInt 32 thrpt5 10.482 ± 3.620 ops/us GroupVInt.readVInt 64 thrpt5 5.966 ± 0.707 ops/us GroupVInt.readVInt 128 thrpt5 2.750 ± 1.668 ops/us GroupVInt.readVInt 248 thrpt5 1.606 ± 0.042 ops/us ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Copy directly between 2 ByteBlockPool to avoid double-copy [lucene]
dungba88 commented on code in PR #12778: URL: https://github.com/apache/lucene/pull/12778#discussion_r1386637002 ## lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java: ## @@ -289,21 +273,34 @@ public long getNodeAddress(long hashSlot) { } /** - * Set the node address and bytes from the provided hash slot + * Set the node address from the provided hash slot * * @param hashSlot the hash slot to write to * @param nodeAddress the node address - * @param bytes the node bytes to be copied */ -public void setNode(long hashSlot, long nodeAddress, byte[] bytes) { +public void setNodeAddress(long hashSlot, long nodeAddress) { assert fstNodeAddress.get(hashSlot) == 0; fstNodeAddress.set(hashSlot, nodeAddress); count++; +} +private void setBytes(long hashSlot, byte[] bytes) { Review Comment: Thank you, I have renamed these 2 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Re-adding the backward_codecs.lucene90 TestPForUtil + TestForUtil [lucene]
mikemccand merged PR #12781: URL: https://github.com/apache/lucene/pull/12781 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] HnwsGraph creates disconnected components [lucene]
nitirajrathore commented on issue #12627: URL: https://github.com/apache/lucene/issues/12627#issuecomment-1801982741 I was able to conduct some perf-test as well. I would like to propose following changes With example of adding node `a` to 'b, c, d' | node | neighbours| |---|---| | b | c, d, e| | c | b, d| | e | e, x, y, z | ### Proposed heuristic 1. At the time of adding a node always add all (upto maxConn) edges to highest scoring neighbours (disregarding the diversity check) eg. adding edges a-->b, a-->c etc. Currently we check for diverse connections at this time also. 2. At the time of adding reverse edges from neighbours to the node, if the given neighbour has more than maxConn number of connections then use new diversity check formula to calculate the edge that can be removed. For example at the time of adding edge b --> a if we see that `b` has more than maxConn connections then apply new hiuristic and try to remove one edge. 3. Check for diversity only between the common neighbours of candidate node and the node. Example, if `b` has more than maxConn connections then we iterate over b's neighbours (which now includes 'a' as well) considering each one as a possible candidate for disconnection. When deciding if we should disconnect an edge say b-->c or not. We will consider the only the common neighbours of c and b for diversity check, so we will compare the score(c,b) with only score(c,d) and not with score(c,e) as only `d` is the common neighbour of c and b. If we find a non diverse node we return that, otherwise 4. we return a node which has highest number of common neighbours with `b`, otherwise 5. we return a node with maximum number of edges (eg. `e` in above case). 6. If none of the above condition exists, then we return -1, meaning we should not remove any connection. 7. If we receive a valid node for removal, we remove that and the other half of the connection i.e. we remove both bi-directional connections. so if `c` is returned as non-diverse node, then we remove both b---> c and c ---> b connections. Performance number: |Branch| recall | avgCpuTime | numDocs | reindexTimeMsec | fanout | maxConn | beamWidth | totalVisited | selectivity | prefilter | |---| -- | -- | --- | -- | --- | - | | --- | --- | --- | |Baseline | 0.451 | 17.78 | 100 | 370916 | 0 | 16 | 100 | 10 | 1.00| post-filter | |Candidate| 0.599 (33% increase) | 21.63 (22% increase) | 100 | 624797 (68% increase) | 0 | 16 | 100 | 10 | 1.00| post-filter | Since now we are increasing the number of connections per node, we also see a good increase in the recall for same max-conn. For the same reason we do expect the avgCpuTime to increase at query time. I can work a little bit on improving the indexing time as the my current implementation is very naive and finds common nodes by two for loops and some other places recalculates the scores where not needed. But overall the indexing time will increase for sure. Another data that I will post next is the number of (histogram) number of edges each node has. Expect it to increase from lower number to max-conn on average. I will upload my (rough) draft PR now. How I reached this heuristic: Before coming up with this heuristic I tried quite some simpler ones, but in those cases I was either getting much much higher disconnections than the mainline or if very less and even just 1 disconnection, but some disconnection was happening. I am open for removing or adding any other condition which is considered for edge removal. Initially, I tried with just (1) and left the rest of the algorithm same as in mainline. This "increased" the disconnections by large factor instead of decreasing it. The issue was that say 0 to 32 nodes are added with star kind of formation, now if 33rd node comes in, it will form connections with 1-32 nodes and each may find 0 as the non diverse node, so each one removes 0 leaving it disconnected from the graph. Other than this extreme case there were others too. That is where (3) came in handy where we don't check diverseness of an edge unless we are sure that there is an alternate path to the disconnecting node via a neighbouring node. This coupled with (7) reduced the disconnectedness from few thousand nodes to just couple of nodes at level-1 and only occassional disconnected node at level-0. Then to achieve no disconnection at all, I initially tried (1, 2, 3) with (6) to remove even slightest chance of creating disconnectedness, but with this we end up not removing any connection a large number of times and the number of connections per node increased from max-conn of 32 at level-0 to upwards of 250 for more than 5% of nod
Re: [I] HnwsGraph creates disconnected components [lucene]
benwtrent commented on issue #12627: URL: https://github.com/apache/lucene/issues/12627#issuecomment-1802098274 `68% increase` in index time is untenable, I would be a hard no on a change that slows things down this much. Maybe we can find something better. @nitirajrathore i know https://github.com/nmslib/nmslib/blob/master/similarity_search/src/method/hnsw.cc has various heuristics, have you explored those to see how they can be modified or added here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Copy directly between 2 ByteBlockPool to avoid double-copy [lucene]
mikemccand merged PR #12778: URL: https://github.com/apache/lucene/pull/12778 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Copy directly between 2 ByteBlockPool to avoid double-copy [lucene]
mikemccand commented on PR #12778: URL: https://github.com/apache/lucene/pull/12778#issuecomment-1802213131 Hmm running `./gradlew check` after pulling this change into my dev box I hit: ``` org.apache.lucene.util.fst.TestFSTs > test suite's output saved to /s1/l/trunk/lucene/core/build/test-results/test/outputs/OUTPUT-org.apache.lucene.util.fst.TestFSTs.txt, copied below: > java.lang.AssertionError > at __randomizedtesting.SeedInfo.seed([CD812A59C5A455F0:D3A718D42274A0C2]:0) > at org.apache.lucene.util.fst.FST$Arc$BitTable.assertIsValid(FST.java:376) > at org.apache.lucene.util.fst.FST.readNextRealArc(FST.java:907) > at org.apache.lucene.util.fst.FST.readFirstRealTargetArc(FST.java:763) > at org.apache.lucene.util.fst.NodeHash$PagedGrowableHash.nodesEqual(NodeHash.java:387) > at org.apache.lucene.util.fst.NodeHash.getFallback(NodeHash.java:94) > at org.apache.lucene.util.fst.NodeHash.add(NodeHash.java:121) > at org.apache.lucene.util.fst.FSTCompiler.compileNode(FSTCompiler.java:295) > at org.apache.lucene.util.fst.FSTCompiler.freezeTail(FSTCompiler.java:703) > at org.apache.lucene.util.fst.FSTCompiler.add(FSTCompiler.java:777) > at org.apache.lucene.util.fst.TestFSTs.testRealTerms(TestFSTs.java:404) > at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) > at java.base/java.lang.reflect.Method.invoke(Method.java:580) > at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758) > at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946) > at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982) > at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996) > at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48) > at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43) > at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45) > at org.apache.lucene.tests.util.TestRuleIgnoreAfterM
Re: [PR] Copy directly between 2 ByteBlockPool to avoid double-copy [lucene]
mikemccand commented on PR #12778: URL: https://github.com/apache/lucene/pull/12778#issuecomment-1802214263 And it does reproduce for me. I'll revert this change for now! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Copy directly between 2 ByteBlockPool to avoid double-copy [lucene]
mikemccand commented on PR #12778: URL: https://github.com/apache/lucene/pull/12778#issuecomment-1802219722 (And does not reproduce once I revert). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Normalize written scalar quantized vectors when using cosine similarity [lucene]
kevindrosendahl commented on code in PR #12780: URL: https://github.com/apache/lucene/pull/12780#discussion_r1386980331 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99ScalarQuantizedVectorsWriter.java: ## @@ -143,6 +143,13 @@ private void writeQuantizedVectors(QuantizationFieldVectorWriter fieldData) thro byte[] vector = new byte[fieldData.dim]; final ByteBuffer offsetBuffer = ByteBuffer.allocate(Float.BYTES).order(ByteOrder.LITTLE_ENDIAN); for (float[] v : fieldData.floatVectors) { + if (fieldData.normalize) { +float[] copy = new float[v.length]; Review Comment: Good catch, changed in https://github.com/apache/lucene/pull/12780/commits/1d9f68be208ad2b111640d370bee8f410203099f -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Normalize written scalar quantized vectors when using cosine similarity [lucene]
kevindrosendahl commented on code in PR #12780: URL: https://github.com/apache/lucene/pull/12780#discussion_r1386980454 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99ScalarQuantizedVectorsWriter.java: ## @@ -196,6 +203,13 @@ void writeSortedQuantizedVectors(QuantizationFieldVectorWriter fieldData, int[] final ByteBuffer offsetBuffer = ByteBuffer.allocate(Float.BYTES).order(ByteOrder.LITTLE_ENDIAN); for (int ordinal : ordMap) { float[] v = fieldData.floatVectors.get(ordinal); + if (fieldData.normalize) { +float[] copy = new float[v.length]; Review Comment: Also changed in https://github.com/apache/lucene/pull/12780/commits/1d9f68be208ad2b111640d370bee8f410203099f -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Normalize written scalar quantized vectors when using cosine similarity [lucene]
kevindrosendahl commented on code in PR #12780: URL: https://github.com/apache/lucene/pull/12780#discussion_r1386992189 ## lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java: ## @@ -81,7 +81,8 @@ public void init() { protected void addRandomFields(Document doc) { switch (vectorEncoding) { case BYTE -> doc.add(new KnnByteVectorField("v2", randomVector8(30), similarityFunction)); - case FLOAT32 -> doc.add(new KnnFloatVectorField("v2", randomVector(30), similarityFunction)); + case FLOAT32 -> doc.add( + new KnnFloatVectorField("v2", randomNormalizedVector(30), similarityFunction)); Review Comment: The issue I was having was that I couldn't reproduce the issue in `testQuantizedVectorsWriteAndRead()` because `BaseKnnVectorsFormatTestCase::randomVector()` [always normalizes the vectors it creates](https://github.com/apache/lucene/blob/20d5de448a739beb85e380d153fb13bf1817a8d6/lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java#L1236). That means that the persisted vectors matched the expected vectors (even after normalizing the expected vectors when using cosine). To isolate the effects of the test changes from normalized to potentially non-normalized vectors to just this test, I kept the existing behavior (i.e. existing tests use normalized vectors) by renaming the existing method from `randomVector()` to `randomNormalizedVector()`, then adding a new method `randomVector()` which does not normalize the vectors. This commit https://github.com/apache/lucene/pull/12780/commits/7cb2375e850f963630788028af34e3275fc5af09 has those changes. Not sure if that really answers your question, I didn't try changing any other tests to use non-normalized vectors. I'm happy to structure all of these tests however makes sense to you. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Normalize written scalar quantized vectors when using cosine similarity [lucene]
kevindrosendahl commented on code in PR #12780: URL: https://github.com/apache/lucene/pull/12780#discussion_r1386992189 ## lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java: ## @@ -81,7 +81,8 @@ public void init() { protected void addRandomFields(Document doc) { switch (vectorEncoding) { case BYTE -> doc.add(new KnnByteVectorField("v2", randomVector8(30), similarityFunction)); - case FLOAT32 -> doc.add(new KnnFloatVectorField("v2", randomVector(30), similarityFunction)); + case FLOAT32 -> doc.add( + new KnnFloatVectorField("v2", randomNormalizedVector(30), similarityFunction)); Review Comment: The issue I was having was that I couldn't reproduce the issue in `testQuantizedVectorsWriteAndRead()` because `BaseKnnVectorsFormatTestCase::randomVector()` [always normalizes the vectors it creates](https://github.com/apache/lucene/blob/20d5de448a739beb85e380d153fb13bf1817a8d6/lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java#L1236). That means that the persisted vectors matched the expected vectors even though the vectors were not explicitly normalized by the `Writer` (even after normalizing the expected vectors when using cosine). To isolate the effects of the test changes from normalized to potentially non-normalized vectors to just this test, I kept the existing behavior (i.e. existing tests use normalized vectors) by renaming the existing method from `randomVector()` to `randomNormalizedVector()`, then adding a new method `randomVector()` which does not normalize the vectors. This commit https://github.com/apache/lucene/pull/12780/commits/7cb2375e850f963630788028af34e3275fc5af09 has those changes. Not sure if that really answers your question, I didn't try changing any other tests to use non-normalized vectors. I'm happy to structure all of these tests however makes sense to you. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Normalize written scalar quantized vectors when using cosine similarity [lucene]
benwtrent commented on code in PR #12780: URL: https://github.com/apache/lucene/pull/12780#discussion_r1386996018 ## lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java: ## @@ -81,7 +81,8 @@ public void init() { protected void addRandomFields(Document doc) { switch (vectorEncoding) { case BYTE -> doc.add(new KnnByteVectorField("v2", randomVector8(30), similarityFunction)); - case FLOAT32 -> doc.add(new KnnFloatVectorField("v2", randomVector(30), similarityFunction)); + case FLOAT32 -> doc.add( + new KnnFloatVectorField("v2", randomNormalizedVector(30), similarityFunction)); Review Comment: @kevindrosendahl I see now! thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Normalize written scalar quantized vectors when using cosine similarity [lucene]
benwtrent commented on code in PR #12780: URL: https://github.com/apache/lucene/pull/12780#discussion_r1386996238 ## lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java: ## @@ -81,7 +81,8 @@ public void init() { protected void addRandomFields(Document doc) { switch (vectorEncoding) { case BYTE -> doc.add(new KnnByteVectorField("v2", randomVector8(30), similarityFunction)); - case FLOAT32 -> doc.add(new KnnFloatVectorField("v2", randomVector(30), similarityFunction)); + case FLOAT32 -> doc.add( + new KnnFloatVectorField("v2", randomNormalizedVector(30), similarityFunction)); Review Comment: I will merge and backport once this is green. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Updated heuristic to remove non diverse edges keeping overall graph c… [lucene]
nitirajrathore opened a new pull request, #12783: URL: https://github.com/apache/lucene/pull/12783 …onnected. No test cases, unoptimized, draft only version. ### Description Details in this comment : https://github.com/apache/lucene/issues/12627#issuecomment-1801982741 I will first update the code for performance, then add / update test cases and then create a PR for review. This is just a draft code. Have some ideas commented inline for improving indexing performance. Also, need to check if this change works well with the parallel graph merge/construction that got added recently(?) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] HnwsGraph creates disconnected components [lucene]
nitirajrathore commented on issue #12627: URL: https://github.com/apache/lucene/issues/12627#issuecomment-1802424885 @benwtrent : I have added draft PR. The code is not at all optimized right now for performance and I am hoping to fix some obvious stuff and will post perf results here. > have you explored those to see how they can be modified or added here? No, I haven't checked those. Thanks for pointing it out. I will check and revert. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Speed up BytesRefHash#sort (another approach) [lucene]
gf2121 opened a new pull request, #12784: URL: https://github.com/apache/lucene/pull/12784 Following https://github.com/apache/lucene/pull/12775, this PR tries another approach to speed up `BytesRefHash#sort`: The idea is that since we have extra ints in this map, we can cache the bucket when building the histograms, and reuse them when `reorder`. I checked this approach on intel chip, showing ~30% speed up. I'll check M2 chip and wikimedium data tomorrow. ``` BASELINE: sort 5169965 terms, build histogram took: 1968ms, reorder took: 2132ms, total took: 5470ms. BASELINE: sort 5169965 terms, build histogram took: 1975ms, reorder took: 2133ms, total took: 5526ms. BASELINE: sort 5169965 terms, build histogram took: 1999ms, reorder took: 2157ms, total took: 5573ms. BASELINE: sort 5169965 terms, build histogram took: 1955ms, reorder took: 2138ms, total took: 5446ms. BASELINE: sort 5169965 terms, build histogram took: 1990ms, reorder took: 2161ms, total took: 5528ms. BASELINE: sort 5169965 terms, build histogram took: 1997ms, reorder took: 2175ms, total took: 5571ms. BASELINE: sort 5169965 terms, build histogram took: 2004ms, reorder took: 2119ms, total took: 5477ms. BASELINE: sort 5169965 terms, build histogram took: 1978ms, reorder took: 2155ms, total took: 5501ms. BASELINE: sort 5169965 terms, build histogram took: 2015ms, reorder took: 2169ms, total took: 5572ms. BASELINE: sort 5169965 terms, build histogram took: 1941ms, reorder took: 2138ms, total took: 5400ms. BASELINE: sort 5169965 terms, build histogram took: 2000ms, reorder took: 2155ms, total took: 5558ms. CANDIDATE: sort 5169965 terms, build histogram took: 1996ms, reorder took: 133ms, total took: 3734ms. CANDIDATE: sort 5169965 terms, build histogram took: 1989ms, reorder took: 142ms, total took: 3655ms. CANDIDATE: sort 5169965 terms, build histogram took: 2031ms, reorder took: 155ms, total took: 3762ms. CANDIDATE: sort 5169965 terms, build histogram took: 2016ms, reorder took: 145ms, total took: 3739ms. CANDIDATE: sort 5169965 terms, build histogram took: 1994ms, reorder took: 142ms, total took: 3667ms. CANDIDATE: sort 5169965 terms, build histogram took: 2010ms, reorder took: 140ms, total took: 3651ms. CANDIDATE: sort 5169965 terms, build histogram took: 2021ms, reorder took: 154ms, total took: 3731ms. CANDIDATE: sort 5169965 terms, build histogram took: 2019ms, reorder took: 144ms, total took: 3727ms. CANDIDATE: sort 5169965 terms, build histogram took: 2064ms, reorder took: 138ms, total took: 3784ms. CANDIDATE: sort 5169965 terms, build histogram took: 2043ms, reorder took: 142ms, total took: 3727ms. CANDIDATE: sort 5169965 terms, build histogram took: 1964ms, reorder took: 140ms, total took: 3630ms. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up BytesRefHash#sort [lucene]
gf2121 commented on PR #12775: URL: https://github.com/apache/lucene/pull/12775#issuecomment-1802480151 I came up with https://github.com/apache/lucene/pull/12784 as another idea to speed up `BytesRefHash#sort`, which has been shown to have performance improvements running on Intel chips. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Normalize written scalar quantized vectors when using cosine similarity [lucene]
benwtrent merged PR #12780: URL: https://github.com/apache/lucene/pull/12780 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Clean up ordinal map in default SSDV reader state [lucene]
gsmiller commented on PR #12454: URL: https://github.com/apache/lucene/pull/12454#issuecomment-1802519348 > @gsmiller, I think this PR is ready. Is there anything else you'd like to see changed? Gah! I'm sorry I missed this. I'll have a look here shortly. Apologies again. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]
benwtrent commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1802705393 @kevindrosendahl if I am reading the code correctly, it does the following: - Write int8 quantized vectors along side the vector ordinals in the graph (`.vex` or whatever has a copy of each vector). - Continue to write vectors in `.vec`. I am guessing this is a stop-gap and you are thinking of removing this? Maybe not? I have a concern around index sorting. How does building the graph & subsequent `getFloatVectors()` play with index sorting? Usually when folks sort an index, they expect to be able to iterate values in that given order. Is it horrifically slow to iterate `.vex` in this scenario? What do we think about always keeping `.vec` around? Probably should for re-ranking purposes once more extreme quantization measures are used. One more question, have you tested your implementation in the situation where `.vex` cannot all be paged into memory and it faired ok? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Copy directly between 2 ByteBlockPool to avoid double-copy [lucene]
dungba88 commented on PR #12778: URL: https://github.com/apache/lucene/pull/12778#issuecomment-1802753207 Thank you for reproducing this! I found the bug, it's quite silly. The node address is the last address, so I should have do this ``` copiedNodes.append(fallbackTable.copiedNodes, fallbackAddress - nodeLength + 1, nodeLength); ``` instead of ``` copiedNodes.append(fallbackTable.copiedNodes, fallbackAddress, nodeLength); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Redo #12707: Do not rely on isAlive() status of MemorySegment#Scope [lucene]
uschindler opened a new pull request, #12785: URL: https://github.com/apache/lucene/pull/12785 Unfortunately the solution in #12707 was not working well with concurrency. The is alive status of `MemorySegment.Scope` may be stale. In that case the `IllegalStateException` was catched, but the confirmation using `isAlive()` did not work, so the original `IllegalStateException` was rethrown instead of the required `AlreadyClosedException`. This PR reverts #12707 and falls back to the naive approach by checking exception message for "closed". I am in contact with @mcimadamore to find a better solution to confirm the closed status without relying on Exception message. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] MMapDirectory with MemorySegment: Confirm that scope/session is no longer alive before throwing AlreadyClosedException [lucene]
uschindler commented on PR #12707: URL: https://github.com/apache/lucene/pull/12707#issuecomment-1802765955 This code did not work well as the `isAlive` status may be stale in other threads. I reworked this one here: #12785 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Copy directly between 2 ByteBlockPool to avoid double-copy [lucene]
dungba88 opened a new pull request, #12786: URL: https://github.com/apache/lucene/pull/12786 ### Description See the previous PR: https://github.com/apache/lucene/pull/12778 There was a bug in the PR, the copiedNodeAddress is the last address (inclusively) of the node, thus the offset we want to copy from should be `fallbackAddress - nodeLength + 1` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Redo #12707: Do not rely on isAlive() status of MemorySegment#Scope [lucene]
uschindler commented on PR #12785: URL: https://github.com/apache/lucene/pull/12785#issuecomment-1802841410 The bug in JDK is here: https://bugs.openjdk.org/browse/JDK-8319756 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Copy directly between 2 ByteBlockPool to avoid double-copy [lucene]
dungba88 commented on PR #12786: URL: https://github.com/apache/lucene/pull/12786#issuecomment-1802854503 > Could not copy file '/home/runner/work/lucene/lucene/lucene/JRE_VERSION_MIGRATION.md' to '/home/runner/work/lucene/lucene/lucene/documentation/build/site/JRE_VERSION_MIGRATION.html'. This seems to be some intermittent error -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Copy directly between 2 ByteBlockPool to avoid double-copy [lucene]
dungba88 commented on code in PR #12786: URL: https://github.com/apache/lucene/pull/12786#discussion_r1387298308 ## lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java: ## @@ -289,21 +273,38 @@ public long getNodeAddress(long hashSlot) { } /** - * Set the node address and bytes from the provided hash slot + * Set the node address from the provided hash slot * * @param hashSlot the hash slot to write to * @param nodeAddress the node address - * @param bytes the node bytes to be copied */ -public void setNode(long hashSlot, long nodeAddress, byte[] bytes) { +public void setNodeAddress(long hashSlot, long nodeAddress) { assert fstNodeAddress.get(hashSlot) == 0; fstNodeAddress.set(hashSlot, nodeAddress); count++; +} +/** copy the node bytes from the FST */ +private void copyNodeBytes(long hashSlot, byte[] bytes) { + assert copiedNodeAddress.get(hashSlot) == 0; copiedNodes.append(bytes); // write the offset, which points to the last byte of the node we copied since we later read // this node in reverse + copiedNodeAddress.set(hashSlot, copiedNodes.getPosition() - 1); +} + +/** promote the node bytes from the fallback table */ +private void copyFallbackNodeBytes( Review Comment: I think we can have a test for this class to be safe, but that means we need to expose it (package private maybe) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Clean up ordinal map in default SSDV reader state [lucene]
gsmiller merged PR #12454: URL: https://github.com/apache/lucene/pull/12454 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager [lucene]
zacharymorn commented on code in PR #240: URL: https://github.com/apache/lucene/pull/240#discussion_r1387436327 ## lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/SearchWithCollectorTask.java: ## @@ -45,20 +43,6 @@ public boolean withCollector() { return true; } - @Override - protected Collector createCollector() throws Exception { Review Comment: Hmm after some more thought about this, I feel maybe I should revert this change here for now given the task tests not just the `TopScoreDocCollector`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager [lucene]
zacharymorn commented on code in PR #240: URL: https://github.com/apache/lucene/pull/240#discussion_r1387438590 ## lucene/core/src/java/org/apache/lucene/search/TopDocs.java: ## @@ -232,8 +232,8 @@ public static TopDocs merge( /** * Returns a new TopFieldDocs, containing topN results across the provided TopFieldDocs, sorting * by the specified {@link Sort}. Each of the TopDocs must have been sorted by the same Sort, and - * sort field values must have been filled (ie, fillFields=true must be passed to - * {@link TopFieldCollector#create}). + * sort field values must have been filled (ie, fillFields=true must be passed to the + * constructor of {@link TopFieldCollectorManager}). Review Comment: Good catch! This most likely was an updated doc as `TopFieldCollector#create` also did not take boolean field. Removed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager [lucene]
zacharymorn commented on code in PR #240: URL: https://github.com/apache/lucene/pull/240#discussion_r1387443316 ## lucene/core/src/java/org/apache/lucene/search/TopFieldCollector.java: ## @@ -174,7 +173,7 @@ private static boolean canEarlyTerminateOnPrefix(Sort searchSort, Sort indexSort * Implements a TopFieldCollector over one SortField criteria, with tracking Review Comment: I feel extending or instantiating collectors should still be allowed, as long as they are done inside collector managers, and this would become enforceable once we fully deprecate `IndexSearch#search(Query, Collector)`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager [lucene]
zacharymorn commented on code in PR #240: URL: https://github.com/apache/lucene/pull/240#discussion_r1387449164 ## lucene/core/src/java/org/apache/lucene/search/TopFieldCollectorManager.java: ## @@ -0,0 +1,198 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Collection; +import java.util.List; + +/** + * Create a TopFieldCollectorManager which uses a shared hit counter to maintain number of hits and + * a shared {@link MaxScoreAccumulator} to propagate the minimum score across segments if the + * primary sort is by relevancy. + * + * Note that a new collectorManager should be created for each search due to its internal states. + */ +public class TopFieldCollectorManager implements CollectorManager { + private final Sort sort; + private final int numHits; + private final FieldDoc after; + private final HitsThresholdChecker hitsThresholdChecker; + private final MaxScoreAccumulator minScoreAcc; + private final List collectors; + private final boolean supportsConcurrency; + private boolean collectorCreated; + + /** + * Creates a new {@link TopFieldCollectorManager} from the given arguments. + * + * NOTE: The instances returned by this method pre-allocate a full array of length + * numHits. + * + * @param sort the sort criteria (SortFields). + * @param numHits the number of results to collect. + * @param after the previous doc after which matching docs will be collected. + * @param totalHitsThreshold the number of docs to count accurately. If the query matches more + * than {@code totalHitsThreshold} hits then its hit count will be a lower bound. On the other + * hand if the query matches less than or exactly {@code totalHitsThreshold} hits then the hit + * count of the result will be accurate. {@link Integer#MAX_VALUE} may be used to make the hit + * count accurate, but this will also make query processing slower. + * @param supportsConcurrency to use thread-safe and slower internal states for count tracking. + */ + public TopFieldCollectorManager( + Sort sort, int numHits, FieldDoc after, int totalHitsThreshold, boolean supportsConcurrency) { Review Comment: Hmm that's a good question. I'll try to setup some benchmark later on to see if I can measure this. Technically speaking the current expected usage would be `supportsConcurrency = search.getSlices().length > 1` (I didn't add doc for this though as I'm not sure if it makes sense to tie the parameter to this condition this early on), so setting `supportsConcurrency` to `true` while using single thread / slice seems to be a misuse (that we should maybe warn about)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager [lucene]
zacharymorn commented on code in PR #240: URL: https://github.com/apache/lucene/pull/240#discussion_r1387450681 ## lucene/core/src/java/org/apache/lucene/search/TopFieldCollector.java: ## @@ -429,106 +432,29 @@ public static TopFieldCollector create(Sort sort, int numHits, int totalHitsThre * field is indexed both with doc values and points. In this case, there is an assumption that * the same data is stored in these points and doc values. * @return a {@link TopFieldCollector} instance which will sort the results by the sort criteria. + * @deprecated This method is being deprecated in favor of using the constructor of {@link Review Comment: Done. ## lucene/core/src/java/org/apache/lucene/search/TopFieldCollectorManager.java: ## @@ -0,0 +1,198 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Collection; +import java.util.List; + +/** + * Create a TopFieldCollectorManager which uses a shared hit counter to maintain number of hits and + * a shared {@link MaxScoreAccumulator} to propagate the minimum score across segments if the + * primary sort is by relevancy. + * + * Note that a new collectorManager should be created for each search due to its internal states. + */ +public class TopFieldCollectorManager implements CollectorManager { + private final Sort sort; + private final int numHits; + private final FieldDoc after; + private final HitsThresholdChecker hitsThresholdChecker; + private final MaxScoreAccumulator minScoreAcc; + private final List collectors; + private final boolean supportsConcurrency; + private boolean collectorCreated; + + /** + * Creates a new {@link TopFieldCollectorManager} from the given arguments. + * + * NOTE: The instances returned by this method pre-allocate a full array of length + * numHits. + * + * @param sort the sort criteria (SortFields). + * @param numHits the number of results to collect. + * @param after the previous doc after which matching docs will be collected. + * @param totalHitsThreshold the number of docs to count accurately. If the query matches more + * than {@code totalHitsThreshold} hits then its hit count will be a lower bound. On the other + * hand if the query matches less than or exactly {@code totalHitsThreshold} hits then the hit + * count of the result will be accurate. {@link Integer#MAX_VALUE} may be used to make the hit + * count accurate, but this will also make query processing slower. + * @param supportsConcurrency to use thread-safe and slower internal states for count tracking. + */ + public TopFieldCollectorManager( + Sort sort, int numHits, FieldDoc after, int totalHitsThreshold, boolean supportsConcurrency) { +if (totalHitsThreshold < 0) { + throw new IllegalArgumentException( + "totalHitsThreshold must be >= 0, got " + totalHitsThreshold); +} + +if (numHits <= 0) { + throw new IllegalArgumentException( + "numHits must be > 0; please use TotalHitCountCollector if you just need the total hit count"); +} + +if (sort.getSort().length == 0) { + throw new IllegalArgumentException("Sort must contain at least one field"); +} + +if (after != null) { + if (after.fields == null) { +throw new IllegalArgumentException( +"after.fields wasn't set; you must pass fillFields=true for the previous search"); + } + + if (after.fields.length != sort.getSort().length) { +throw new IllegalArgumentException( +"after.fields has " ++ after.fields.length ++ " values but sort has " ++ sort.getSort().length); + } +} + +this.sort = sort; +this.numHits = numHits; +this.after = after; +this.supportsConcurrency = supportsConcurrency; +this.hitsThresholdChecker = +supportsConcurrency +? HitsThresholdChecker.createShared(Math.max(totalHitsThreshold, numHits)) +: HitsThresholdChecker.create(Math.max(totalHitsThreshold, numHits)); +this.minScor
Re: [PR] LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager [lucene]
zacharymorn commented on code in PR #240: URL: https://github.com/apache/lucene/pull/240#discussion_r1387453179 ## lucene/core/src/java/org/apache/lucene/search/TopScoreDocCollector.java: ## @@ -44,7 +43,7 @@ public void setScorer(Scorable scorer) throws IOException { } Review Comment: See https://github.com/apache/lucene/pull/240#discussion_r1387443316 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager [lucene]
zacharymorn commented on code in PR #240: URL: https://github.com/apache/lucene/pull/240#discussion_r1387453543 ## lucene/core/src/test/org/apache/lucene/document/BaseSpatialTestCase.java: ## @@ -695,8 +695,8 @@ protected void verifyRandomDistanceQueries(IndexReader reader, Object... shapes) } } - private FixedBitSet searchIndex(IndexSearcher s, Query query, int maxDoc) throws IOException { -return s.search(query, FixedBitSetCollector.createManager(maxDoc)); + private CollectorManager searchIndex(int maxDoc) { Review Comment: Ah this change should have been reverted. Fixed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager [lucene]
zacharymorn commented on PR #240: URL: https://github.com/apache/lucene/pull/240#issuecomment-1803126121 > > We are still a ways away (from seeing Lucene fully utilize available hardware concurrency available at search time to reduce query latencies) > > For example: query concurrency is still tied to segment geometry, which is insane. An "optimized" (`forceMerge to 1 segment`) index loses all of its concurrency! Hideously, at my team (Amazon product search) we had to create a neutered merge policy that intentionally avoids "accidentally" making a lopsided index (where too many docs are in a single segment) because it harms our long pole query latencies (we run all queries concurrently, except close to red-line), due to this silly Lucene limitation. > > We have a longstanding issue open for this: #9721, and it ought not be so difficult to fix because Lucene already has strong support for one thread to search a "slice" of a big segment, i.e. we can make the thread work units slices of segments instead of whole segments or collections of segments (what IndexSearcher now calls slices, confusingly). Thanks @mikemccand for sharing the additional insight and context! Looks like there's already a lot of good discussions in that ticket as well. Will definitely take a look at that and see what I can contribute, once most of the work for deprecating `IndexSearch#search(Query, Collector)` has been completed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up BytesRefHash#sort (another approach) [lucene]
gf2121 commented on PR #12784: URL: https://github.com/apache/lucene/pull/12784#issuecomment-1803144990 Even faster than the original approach on M2: ``` BASELINE: sort 5169965 terms, build histogram took: 489ms, reorder took: 1359ms, total took: 2381ms. BASELINE: sort 5169965 terms, build histogram took: 449ms, reorder took: 1290ms, total took: 2249ms. BASELINE: sort 5169965 terms, build histogram took: 458ms, reorder took: 1279ms, total took: 2238ms. BASELINE: sort 5169965 terms, build histogram took: 462ms, reorder took: 1302ms, total took: 2260ms. BASELINE: sort 5169965 terms, build histogram took: 455ms, reorder took: 1282ms, total took: 2239ms. BASELINE: sort 5169965 terms, build histogram took: 449ms, reorder took: 1327ms, total took: 2282ms. BASELINE: sort 5169965 terms, build histogram took: 474ms, reorder took: 1320ms, total took: 2303ms. BASELINE: sort 5169965 terms, build histogram took: 448ms, reorder took: 1260ms, total took: 2227ms. BASELINE: sort 5169965 terms, build histogram took: 448ms, reorder took: 1307ms, total took: 2258ms. BASELINE: sort 5169965 terms, build histogram took: 464ms, reorder took: 1313ms, total took: 2280ms. BASELINE: sort 5169965 terms, build histogram took: 444ms, reorder took: 1294ms, total took: 2261ms. CANDIDATE: sort 5169965 terms, build histogram took: 549ms, reorder took: 84ms, total took: 1287ms. CANDIDATE: sort 5169965 terms, build histogram took: 535ms, reorder took: 83ms, total took: 1205ms. CANDIDATE: sort 5169965 terms, build histogram took: 531ms, reorder took: 65ms, total took: 1197ms. CANDIDATE: sort 5169965 terms, build histogram took: 552ms, reorder took: 70ms, total took: 1217ms. CANDIDATE: sort 5169965 terms, build histogram took: 576ms, reorder took: 73ms, total took: 1255ms. CANDIDATE: sort 5169965 terms, build histogram took: 551ms, reorder took: 66ms, total took: 1207ms. CANDIDATE: sort 5169965 terms, build histogram took: 551ms, reorder took: 68ms, total took: 1208ms. CANDIDATE: sort 5169965 terms, build histogram took: 553ms, reorder took: 85ms, total took: 1240ms. CANDIDATE: sort 5169965 terms, build histogram took: 562ms, reorder took: 66ms, total took: 1235ms. CANDIDATE: sort 5169965 terms, build histogram took: 557ms, reorder took: 76ms, total took: 1233ms. CANDIDATE: sort 5169965 terms, build histogram took: 541ms, reorder took: 66ms, total took: 1204ms. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Skip docs with Docvalues in NumericLeafComparator [lucene]
LuXugang merged PR #12405: URL: https://github.com/apache/lucene/pull/12405 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Skip docs with Docvalues in NumericLeafComparator [lucene]
LuXugang closed issue #12401: Skip docs with Docvalues in NumericLeafComparator URL: https://github.com/apache/lucene/issues/12401 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Multiple ClassNotFoundExceptions in IntelliJ Fat Jar on ARM64 Java 20 [lucene]
davido commented on issue #12307: URL: https://github.com/apache/lucene/issues/12307#issuecomment-1803192152 @uschindler We are using [Bazel](https://bazel.build) build system, and merging the two JARs like this: # Merge jars so # META-INF/services/org.apache.lucene.codecs.Codec # contains the union of both Codec collections. java_binary( name = "lucene-core-and-backward-codecs-merged", data = ["//lib:LICENSE-Apache2.0"], main_class = "NotImportant", runtime_deps = [ # in case of conflict, we want the implementation of backwards-codecs # first. "@backward-codecs//jar", "@lucene-core//jar", ], ) java_import( name = "lucene-core-and-backward-codecs", jars = [ ":lucene-core-and-backward-codecs-merged_deploy.jar", ], ) And it works with JDk11 and JDK17 as expected, but is failing on JDK21: 7) offlineReindexForAllAvailableIndicesIsPossibleInSlaveMode(com.google.gerrit.acceptance.pgm.LuceneReindexIT) java.lang.NoClassDefFoundError: Could not initialize class org.apache.lucene.store.MMapDirectory at org.apache.lucene.store.FSDirectory.open(FSDirectory.java:161) at org.apache.lucene.store.FSDirectory.open(FSDirectory.java:156) at com.google.gerrit.lucene.LuceneAccountIndex.dir(LuceneAccountIndex.java:91) at com.google.gerrit.lucene.LuceneAccountIndex.(LuceneAccountIndex.java:105) at com.google.gerrit.lucene.LuceneAccountIndex$$FastClassByGuice$$416172e5.GUICE$TRAMPOLINE() at com.google.gerrit.lucene.LuceneAccountIndex$$FastClassByGuice$$416172e5.apply() at com.google.inject.internal.DefaultConstructionProxyFactory$FastClassProxy.newInstance(DefaultConstructionProxyFactory.java:82) at com.google.inject.internal.ConstructorInjector.provision(ConstructorInjector.java:114) at com.google.inject.internal.ConstructorInjector.construct(ConstructorInjector.java:91) at com.google.inject.internal.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:300) at com.google.inject.internal.InjectorImpl$1.get(InjectorImpl.java:1148) at com.google.inject.assistedinject.FactoryProvider2.invoke(FactoryProvider2.java:907) at jdk.proxy2/jdk.proxy2.$Proxy59.create(Unknown Source) at com.google.gerrit.server.index.SingleVersionModule$SingleVersionListener.start(SingleVersionModule.java:98) at com.google.gerrit.server.index.SingleVersionModule$SingleVersionListener.start(SingleVersionModule.java:79) at com.google.gerrit.pgm.init.index.IndexManagerOnInit.start(IndexManagerOnInit.java:40) at com.google.gerrit.pgm.init.BaseInit.run(BaseInit.java:121) at com.google.gerrit.pgm.util.AbstractProgram.main(AbstractProgram.java:62) at com.google.gerrit.acceptance.GerritServer.init(GerritServer.java:344) at com.google.gerrit.acceptance.StandaloneSiteTest.beforeTest(StandaloneSiteTest.java:140) at com.google.gerrit.acceptance.StandaloneSiteTest$1.evaluate(StandaloneSiteTest.java:117) at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) at org.junit.rules.RunRules.evaluate(RunRules.java:20) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at org.junit.runners.ParentRunner.run(ParentRunner.java:363) at org.junit.runners.Suite.runChild(Suite.java:128) at org.junit.runners.Suite.runChild(Suite.java:27) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at org.junit.runners.ParentRunner.run(ParentRunner.java:363) at org.junit.runners.Suite.runChild(Suite.java:128) at org.junit.runners.Suite.runChild(Suite.java:27) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at org
Re: [PR] Enable executing using NFA in RegexpQuery [lucene]
zhaih commented on code in PR #12767: URL: https://github.com/apache/lucene/pull/12767#discussion_r1387529653 ## lucene/core/src/test/org/apache/lucene/search/TestRegexpQuery.java: ## @@ -80,7 +80,10 @@ private long caseInsensitiveRegexQueryNrHits(String regex) throws IOException { newTerm(regex), RegExp.ALL, RegExp.ASCII_CASE_INSENSITIVE, -Operations.DEFAULT_DETERMINIZE_WORK_LIMIT); +RegexpQuery.DEFAULT_PROVIDER, +Operations.DEFAULT_DETERMINIZE_WORK_LIMIT, +MultiTermQuery.CONSTANT_SCORE_BLENDED_REWRITE, +random().nextBoolean()); Review Comment: OK, I added the NFA version to `TestRegexpRandom2` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up BytesRefHash#sort (another approach) [lucene]
gf2121 commented on PR #12784: URL: https://github.com/apache/lucene/pull/12784#issuecomment-1803223146 As "reorder" gets faster, I'm considering lowering the fallback threshold and letting radix sort do more of the work. ### Benchmark result: MAC Intel ``` BASELINE: sort 5169965 terms, build histogram took: 1969ms, reorder took: 2103ms, total took: 5524ms. BASELINE: sort 5169965 terms, build histogram took: 1975ms, reorder took: 2118ms, total took: 5435ms. BASELINE: sort 5169965 terms, build histogram took: 1977ms, reorder took: 2140ms, total took: 5441ms. BASELINE: sort 5169965 terms, build histogram took: 1982ms, reorder took: 2168ms, total took: 5452ms. BASELINE: sort 5169965 terms, build histogram took: 1987ms, reorder took: 2164ms, total took: 5475ms. BASELINE: sort 5169965 terms, build histogram took: 1963ms, reorder took: 2157ms, total took: 5456ms. BASELINE: sort 5169965 terms, build histogram took: 1979ms, reorder took: 2162ms, total took: 5452ms. BASELINE: sort 5169965 terms, build histogram took: 1984ms, reorder took: 2172ms, total took: 5493ms. BASELINE: sort 5169965 terms, build histogram took: 1986ms, reorder took: 2166ms, total took: 5476ms. BASELINE: sort 5169965 terms, build histogram took: 2016ms, reorder took: 2287ms, total took: 5660ms. BASELINE: sort 5169965 terms, build histogram took: 1761ms, reorder took: 2109ms, total took: 5183ms. CANDIDATE( fallback threshold = 100 ): sort 5169965 terms, build histogram took: 2018ms, reorder took: 125ms, total took: 3760ms. CANDIDATE( fallback threshold = 100 ): sort 5169965 terms, build histogram took: 1998ms, reorder took: 118ms, total took: 3630ms. CANDIDATE( fallback threshold = 100 ): sort 5169965 terms, build histogram took: 1987ms, reorder took: 115ms, total took: 3596ms. CANDIDATE( fallback threshold = 100 ): sort 5169965 terms, build histogram took: 2001ms, reorder took: 111ms, total took: 3622ms. CANDIDATE( fallback threshold = 100 ): sort 5169965 terms, build histogram took: 2017ms, reorder took: 110ms, total took: 3610ms. CANDIDATE( fallback threshold = 100 ): sort 5169965 terms, build histogram took: 2021ms, reorder took: 118ms, total took: 3712ms. CANDIDATE( fallback threshold = 100 ): sort 5169965 terms, build histogram took: 2006ms, reorder took: 108ms, total took: 3634ms. CANDIDATE( fallback threshold = 100 ): sort 5169965 terms, build histogram took: 2015ms, reorder took: 106ms, total took: 3620ms. CANDIDATE( fallback threshold = 100 ): sort 5169965 terms, build histogram took: 2006ms, reorder took: 101ms, total took: 3611ms. CANDIDATE( fallback threshold = 100 ): sort 5169965 terms, build histogram took: 1993ms, reorder took: 115ms, total took: 3609ms. CANDIDATE( fallback threshold = 100 ): sort 5169965 terms, build histogram took: 2024ms, reorder took: 106ms, total took: 3633ms. CANDIDATE( fallback threshold = 50 ): sort 5169965 terms, build histogram took: 2972ms, reorder took: 199ms, total took: 3288ms. CANDIDATE( fallback threshold = 50 ): sort 5169965 terms, build histogram took: 2941ms, reorder took: 211ms, total took: 3279ms. CANDIDATE( fallback threshold = 50 ): sort 5169965 terms, build histogram took: 2991ms, reorder took: 178ms, total took: 3283ms. CANDIDATE( fallback threshold = 50 ): sort 5169965 terms, build histogram took: 2932ms, reorder took: 195ms, total took: 3233ms. CANDIDATE( fallback threshold = 50 ): sort 5169965 terms, build histogram took: 2954ms, reorder took: 202ms, total took: 3259ms. CANDIDATE( fallback threshold = 50 ): sort 5169965 terms, build histogram took: 2938ms, reorder took: 201ms, total took: 3249ms. CANDIDATE( fallback threshold = 50 ): sort 5169965 terms, build histogram took: 2949ms, reorder took: 195ms, total took: 3295ms. CANDIDATE( fallback threshold = 50 ): sort 5169965 terms, build histogram took: 2961ms, reorder took: 208ms, total took: 3274ms. CANDIDATE( fallback threshold = 50 ): sort 5169965 terms, build histogram took: 2965ms, reorder took: 201ms, total took: 3257ms. CANDIDATE( fallback threshold = 50 ): sort 5169965 terms, build histogram took: 2957ms, reorder took: 188ms, total took: 3285ms. CANDIDATE( fallback threshold = 50 ): sort 5169965 terms, build histogram took: 2975ms, reorder took: 182ms, total took: 3291ms. ``` MAC M2 ``` BASELINE: sort 5169965 terms, build histogram took: 478ms, reorder took: 1332ms, total took: 2334ms. BASELINE: sort 5169965 terms, build histogram took: 483ms, reorder took: 1351ms, total took: 2333ms. BASELINE: sort 5169965 terms, build histogram took: 462ms, reorder took: 1319ms, total took: 2284ms. BASELINE: sort 5169965 terms, build histogram took: 463ms, reorder took: 1272ms, total took: 2246ms. BASELINE: sort 5169965 terms, build histogram took: 466ms, reorder took: 1285ms, total took: 2257ms.
Re: [PR] speedup arm int functions? [lucene]
rmuir closed pull request #12743: speedup arm int functions? URL: https://github.com/apache/lucene/pull/12743 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] speedup arm int functions? [lucene]
rmuir commented on PR #12743: URL: https://github.com/apache/lucene/pull/12743#issuecomment-1803265618 speeds up as many machines as it slows down. cascadelake: `['0', 'GenuineIntel', 'Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz', '1', 'GenuineIntel', 'Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz']` main ``` Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.binaryDotProductScalar1024 thrpt 15 2.504 ± 0.005 ops/us ``` patch ``` Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.binaryDotProductScalar1024 thrpt 15 1.850 ± 0.013 ops/us ``` graviton2: `['0', '1']` main ``` Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.binaryDotProductScalar1024 thrpt 15 2.387 ± 0.001 ops/us ``` patch ``` Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.binaryDotProductScalar1024 thrpt 15 2.121 ± 0.004 ops/us ``` graviton3: `['0', '1']` main ``` Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.binaryDotProductScalar1024 thrpt 15 2.370 ± 0.002 ops/us ``` patch ``` Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.binaryDotProductScalar1024 thrpt 15 3.346 ± 0.025 ops/us ``` haswell: `['0', 'GenuineIntel', 'Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz', '1', 'GenuineIntel', 'Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz']` main ``` Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.binaryDotProductScalar1024 thrpt 15 2.096 ± 0.016 ops/us ``` patch ``` Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.binaryDotProductScalar1024 thrpt 15 1.678 ± 0.039 ops/us ``` icelake: `['0', 'GenuineIntel', 'Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz', '1', 'GenuineIntel', 'Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz']` main ``` Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.binaryDotProductScalar1024 thrpt 15 2.586 ± 0.010 ops/us ``` patch ``` Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.binaryDotProductScalar1024 thrpt 15 2.240 ± 0.006 ops/us ``` sapphirerapids: `['0', 'GenuineIntel', 'Intel(R) Xeon(R) Platinum 8488C', '1', 'GenuineIntel', 'Intel(R) Xeon(R) Platinum 8488C']` main ``` Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.binaryDotProductScalar1024 thrpt 15 2.408 ± 0.009 ops/us ``` patch ``` Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.binaryDotProductScalar1024 thrpt 15 2.729 ± 0.058 ops/us ``` zen2: `['0', 'AuthenticAMD', 'AMD EPYC 7R32', '1', 'AuthenticAMD', 'AMD EPYC 7R32']` main ``` Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.binaryDotProductScalar1024 thrpt 15 1.544 ± 0.004 ops/us ``` patch ``` Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.binaryDotProductScalar1024 thrpt 15 1.435 ± 0.004 ops/us ``` zen3: `['0', 'AuthenticAMD', 'AMD EPYC 7R13 Processor', '1', 'AuthenticAMD', 'AMD EPYC 7R13 Processor']` main ``` Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.binaryDotProductScalar1024 thrpt 15 1.580 ± 0.002 ops/us ``` patch ``` Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.binaryDotProductScalar1024 thrpt 15 1.791 ± 0.004 ops/us ``` zen4: `['0', 'AuthenticAMD', 'AMD EPYC 9R14', '1', 'AuthenticAMD', 'AMD EPYC 9R14']` main ``` Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.binaryDotProductScalar1024 thrpt 15 1.856 ± 0.002 ops/us ``` patch ``` Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.binaryDotProductScalar1024 thrpt 15 2.364 ± 0.007 ops/us ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-ma
[PR] script to run microbenchmarks across different ec2 instance types [lucene]
rmuir opened a new pull request, #12787: URL: https://github.com/apache/lucene/pull/12787 This saves me a lot of time and prevents making bad changes that help some cpus and hurt others. Case in point: #12743 You run a command such as: ``` make PATCH_BRANCH=rmuir:some-speedup ``` The benchmark procedure is run across different instance types in parallel. "large" instances are used, although I try to keep it inexpensive, it is java and needs RAM. At the end, you get a combined `build/report.txt` which can be pasted into a PR comment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up BytesRefHash#sort (another approach) [lucene]
gf2121 commented on PR #12784: URL: https://github.com/apache/lucene/pull/12784#issuecomment-1803319176 I try this approach with `wikimedium10m` on the M2 mac, the sort took sum decreased ~60%. Details https://bytedance.larkoffice.com/sheets/XfVCsZL5phx9letbDEQcaw0snnf"; data-importRangeRawData-range="'Sheet1'!D1:I62"> flush round | sort field name | sort term count | baseline | candidate | diff -- | -- | -- | -- | -- | -- 1 | date | 45925 | 32 | 15 | -53.13% 1 | groupend | 1 | 0 | 0 | #DIV/0! 1 | titleTokenized | 33250 | 10 | 6 | -40.00% 1 | id | 989620 | 356 | 167 | -53.09% 1 | body | 3444798 | 1991 | 698 | -64.94% 1 | title | 46261 | 10 | 7 | -30.00% 2 | date | 84517 | 52 | 21 | -59.62% 2 | groupend | 1 | 0 | 0 | #DIV/0! 2 | titleTokenized | 50134 | 15 | 7 | -53.33% 2 | id | 1059900 | 680 | 212 | -68.82% 2 | body | 3210318 | 1730 | 660 | -61.85% 2 | title | 85009 | 22 | 13 | -40.91% 3 | date | 86188 | 37 | 21 | -43.24% 3 | groupend | 1 | 0 | 0 | #DIV/0! 3 | titleTokenized | 60502 | 14 | 9 | -35.71% 3 | id | 1011210 | 397 | 185 | -53.40% 3 | body | 3389689 | 1865 | 692 | -62.90% 3 | title | 86756 | 24 | 15 | -37.50% 4 | date | 123222 | 76 | 39 | -48.68% 4 | groupend | 1 | 0 | 0 | #DIV/0! 4 | titleTokenized | 84573 | 23 | 16 | -30.43% 4 | id | 1030670 | 533 | 205 | -61.54% 4 | body | 3442602 | 1985 | 717 | -63.88% 4 | title | 124156 | 41 | 24 | -41.46% 5 | date | 150957 | 85 | 38 | -55.29% 5 | groupend | 1 | 0 | 0 | #DIV/0! 5 | titleTokenized | 100910 | 37 | 16 | -56.76% 5 | id | 1045550 | 595 | 209 | -64.87% 5 | body | 3267300 | 2031 | 689 | -66.08% 5 | title | 152061 | 58 | 32 | -44.83% 6 | date | 162374 | 77 | 41 | -46.75% 6 | groupend | 1 | 0 | 0 | #DIV/0! 6 | titleTokenized | 107464 | 30 | 17 | -43.33% 6 | id | 1057430 | 385 | 194 | -49.61% 6 | body | 3140161 | 1834 | 633 | -65.49% 6 | title | 163347 | 68 | 36 | -47.06% 7 | date | 177647 | 116 | 45 | -61.21% 7 | groupend | 1 | 0 | 0 | #DIV/0! 7 | titleTokenized | 120037 | 59 | 18 | -69.49% 7 | id | 1062920 | 652 | 211 | -67.64% 7 | body | 3173313 | 2373 | 638 | -73.11% 7 | title | 178886 | 79 | 37 | -53.16% 8 | date | 199744 | 106 | 50 | -52.83% 8 | groupend | 1 | 0 | 0 | #DIV/0! 8 | titleTokenized | 133801 | 42 | 21 | -50.00% 8 | id | 1072970 | 409 | 193 | -52.81% 8 | body | 3090582 | 1817 | 641 | -64.72% 8 | title | 201130 | 84 | 45 | -46.43% 9 | date | 209298 | 104 | 62 | -40.38% 9 | groupend | 1 | 0 | 0 | #DIV/0! 9 | titleTokenized | 142697 | 50 | 25 | -50.00% 9 | id | 1084580 | 571 | 225 | -60.60% 9 | body | 3032046 | 1642 | 631 | -61.57% 9 | title | 210958 | 92 | 46 | -50.00% 10 | date | 119014 | 47 | 30 | -36.17% 10 | groupend | 1 | 0 | 0 | #DIV/0! 10 | titleTokenized | 93308 | 26 | 16 | -38.46% 10 | id | 585150 | 208 | 111 | -46.63% 10 | body | 1909153 | 918 | 380 | -58.61% 10 | title | 119806 | 50 | 27 | -46.00% sort took sum | | | 24538 | 9086 | -62.97% -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Multiple ClassNotFoundExceptions in IntelliJ Fat Jar on ARM64 Java 20 [lucene]
uschindler commented on issue #12307: URL: https://github.com/apache/lucene/issues/12307#issuecomment-1803320694 Hi, this has nothing to do with the minimum Java version in your Java built, it only has to do with the runtime Java version. If you use Java 19 or later, the classloader needs to be able to find the version specific support classes for MemorySegment (foreign API) and Vector API in the JAR file, which it clearly says: Caused by: java.lang.ExceptionInInitializerError: Exception java.lang.LinkageError: MemorySegmentIndexInputProvider is missing in Lucene JAR file [in thread "main"] Lucene's JAR files are modularized (https://openjdk.org/jeps/261) MR-JAR (https://openjdk.org/jeps/238) files with service provider (SPI, https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/ServiceLoader.html). If you want to merge different JAR files and Lucene's JAR files are part of it, please read ALL the three specs above and then ensure the following: - All files in `META-INF/services` _with same name need to be merged together_ and placed in resulting JAR (https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/ServiceLoader.html) - All files in `META-INF/versions` need to be copied to resulting JAR with exactly same directory structure. Those files are the version specific classes and are only loaded depending on the Runtime JDK (https://openjdk.org/jeps/238). - META-INF/MANIFEST.MF must contain `Multi-Release: true` (https://openjdk.org/jeps/238). - All class files named `module-info.class` should be removed as the resulting JAR file is no longer modularized. Keeping them inside may fail to load the application in modularized runtime (https://openjdk.org/jeps/261). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org