[GitHub] [lucene] ChrisHegarty commented on issue #12302: vector API integration, plan B
ChrisHegarty commented on issue #12302: URL: https://github.com/apache/lucene/issues/12302#issuecomment-1554223493 ++ to all of what @uschindler and @rmuir said relating to which JDK versions we add support for. Specifically, let's start with JDK 20 **only**. After which we can prepare for, and test with, JDK 21 EA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] ChrisHegarty commented on a diff in pull request #12311: Integrate the Incubating Panama Vector API
ChrisHegarty commented on code in PR #12311: URL: https://github.com/apache/lucene/pull/12311#discussion_r1198703579 ## gradle/testing/defaults-tests.gradle: ## @@ -122,7 +122,7 @@ allprojects { // Lucene needs to optional modules at runtime, which we want to enforce for testing // (if the runner JVM does not support them, it will fail tests): - jvmArgs '--add-modules', 'jdk.unsupported,jdk.management' + jvmArgs '--add-modules', 'jdk.unsupported,jdk.management,jdk.incubator.vector' Review Comment: We're going to need a way to test both the default and the VectorAPI implementations. Currently the code just checks for the presence of the jdk.incubator.vector module. Either we add some additional configuration, like say a system prop to forcibly disable the VectorAPI implementation, or else we could have some gradle-foo that controls the addition of the module in `--add-modules` ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] ChrisHegarty commented on pull request #12311: Integrate the Incubating Panama Vector API
ChrisHegarty commented on PR #12311: URL: https://github.com/apache/lucene/pull/12311#issuecomment-1554440332 I refactored the provider and impl's: 1. So as to separate them out from VectorUtil - this should improve readability, etc, as we move beyond dotProduct. 2. I also moved them into a it's own non-exported package. I'm less sure about no.2. The general thought was that the code might be more reusable from there, but now that I think about it, it might be better as package-private where it was, since the "interface" is through VectorUtils - not directly to the imp. Thoughts? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] ChrisHegarty commented on a diff in pull request #12311: Integrate the Incubating Panama Vector API
ChrisHegarty commented on code in PR #12311: URL: https://github.com/apache/lucene/pull/12311#discussion_r1198869676 ## gradle/testing/defaults-tests.gradle: ## @@ -119,10 +119,13 @@ allprojects { if (rootProject.runtimeJavaVersion < JavaVersion.VERSION_16) { jvmArgs '--illegal-access=deny' } + + // Disable assertions to workaround JDK-8301190 + jvmArgs '-da:jdk.incubator.vector.LaneType' Review Comment: Argh! we're running into a JDK bug, fixed in the not-yet-released JDK 20.0.2! :-( The JDK bug is https://bugs.openjdk.org/browse/JDK-8301190 - it's an incorrect assertion when default locale is say tr. Reproduce with: `./gradlew :lucene:core:test --tests "org.apache.lucene.util.TestVectorUtil" -Dtests.locale=tr-TR` I've just disabled assertions in this one JDK class to workaround the issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] ChrisHegarty commented on a diff in pull request #12311: Integrate the Incubating Panama Vector API
ChrisHegarty commented on code in PR #12311: URL: https://github.com/apache/lucene/pull/12311#discussion_r1198879166 ## gradle/testing/defaults-tests.gradle: ## @@ -119,10 +119,13 @@ allprojects { if (rootProject.runtimeJavaVersion < JavaVersion.VERSION_16) { jvmArgs '--illegal-access=deny' } + + // Disable assertions to workaround JDK-8301190 + jvmArgs '-da:jdk.incubator.vector.LaneType' Review Comment: I'm now questioning how safe it is to just suppress this. :-( -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] contrebande-labs commented on issue #12302: vector API integration, plan B
contrebande-labs commented on issue #12302: URL: https://github.com/apache/lucene/issues/12302#issuecomment-1554466920 Guys, relax. My intention is to _**HELP**_. I was trying to find something that can be done in parallel to what @ChrisHegarty is doing so I don't step on his toes. If you want help, let me know. And be specific about it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #12311: Integrate the Incubating Panama Vector API
rmuir commented on code in PR #12311: URL: https://github.com/apache/lucene/pull/12311#discussion_r1198991426 ## gradle/testing/defaults-tests.gradle: ## @@ -119,10 +119,13 @@ allprojects { if (rootProject.runtimeJavaVersion < JavaVersion.VERSION_16) { jvmArgs '--illegal-access=deny' } + + // Disable assertions to workaround JDK-8301190 + jvmArgs '-da:jdk.incubator.vector.LaneType' Review Comment: Yeah, i think its not good to ignore it since not all uses of lucene are server-side and someone might run it e.g. in their IDE on a Turkish machine. Couple of alternatives: * start with JDK-21 as our first supported release. avoids the problem easily, but means nobody can use this stuff until September * fall back to scalar impl (e.g. pretend vector api is not enabled) if the user has Turkish or Azeri locale and jdk version < 21? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on a diff in pull request #12311: Integrate the Incubating Panama Vector API
uschindler commented on code in PR #12311: URL: https://github.com/apache/lucene/pull/12311#discussion_r1199003878 ## gradle/testing/defaults-tests.gradle: ## @@ -119,10 +119,13 @@ allprojects { if (rootProject.runtimeJavaVersion < JavaVersion.VERSION_16) { jvmArgs '--illegal-access=deny' } + + // Disable assertions to workaround JDK-8301190 + jvmArgs '-da:jdk.incubator.vector.LaneType' Review Comment: Forbiddenapis should be used inside openjdk. Suggested this several times. Is always ignored, although it works also with Ant (used in openjdk for compiling). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on a diff in pull request #12311: Integrate the Incubating Panama Vector API
uschindler commented on code in PR #12311: URL: https://github.com/apache/lucene/pull/12311#discussion_r1199008182 ## gradle/testing/defaults-tests.gradle: ## @@ -122,7 +122,7 @@ allprojects { // Lucene needs to optional modules at runtime, which we want to enforce for testing // (if the runner JVM does not support them, it will fail tests): - jvmArgs '--add-modules', 'jdk.unsupported,jdk.management' + jvmArgs '--add-modules', 'jdk.unsupported,jdk.management,jdk.incubator.vector' Review Comment: I don't think we need to explicitly test both. If the build runs with java 17 and no runtime of 20 given it uses legacy code. The same us true for mmapdir. There it is tested by Jenkins based on the used jfk version. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] ChrisHegarty commented on a diff in pull request #12311: Integrate the Incubating Panama Vector API
ChrisHegarty commented on code in PR #12311: URL: https://github.com/apache/lucene/pull/12311#discussion_r1199035703 ## gradle/testing/defaults-tests.gradle: ## @@ -119,10 +119,13 @@ allprojects { if (rootProject.runtimeJavaVersion < JavaVersion.VERSION_16) { jvmArgs '--illegal-access=deny' } + + // Disable assertions to workaround JDK-8301190 + jvmArgs '-da:jdk.incubator.vector.LaneType' Review Comment: I went with option 2 - fall back to scalar impl (e.g. pretend vector api is not enabled) if the user has Turkish or Azeri locale and jdk version < 21. Clearly we can revisit this if it turns out to be problematic. I'm hoping not - and we should be able to do something smarter for JDK 20.0.2, which has this JDK bug fixed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler merged pull request #12308: Wrap Query rewrite backwards layer with AccessController
uschindler merged PR #12308: URL: https://github.com/apache/lucene/pull/12308 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on a diff in pull request #12311: Integrate the Incubating Panama Vector API
uschindler commented on code in PR #12311: URL: https://github.com/apache/lucene/pull/12311#discussion_r1199048935 ## lucene/core/src/java/org/apache/lucene/internal/vector/DefaultVectorUtilProvider.java: ## @@ -0,0 +1,92 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.internal.vector; + +/** + * The default VectorUtil provider implementation. + * + * @lucene.internal + */ +public final class DefaultVectorUtilProvider implements VectorUtilProvider { Review Comment: IMHO, this one should be package private like the Java 20 one. In general I am not really happy with the additional package. I know you are a fan of the module system, but most users still don't use it and our Javadocs also show all packages (including internal ones). Is it really needed to have the vector stuff in a separate package. For MMAPDirectory I made sure to hide everything, including the provider as it is no public API. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on a diff in pull request #12311: Integrate the Incubating Panama Vector API
uschindler commented on code in PR #12311: URL: https://github.com/apache/lucene/pull/12311#discussion_r1199053205 ## lucene/core/src/java/org/apache/lucene/internal/vector/VectorUtilProvider.java: ## @@ -76,4 +77,10 @@ static boolean vectorModulePresentAndReadable() { } return false; } + + // Workaround for JDK-8301190, avoids assertion when default locale is say tr. + @SuppressForbidden(reason = "required to determine if non-workable locale") + static boolean useVectorAPI() { +return 'I' == int.class.getSimpleName().toUpperCase().charAt(0); Review Comment: this confused me at first, it's really funny. How about a simpler: `return Objects.equals('I', "i".toUpperCase());` ## lucene/core/src/java/org/apache/lucene/internal/vector/VectorUtilProvider.java: ## @@ -76,4 +77,10 @@ static boolean vectorModulePresentAndReadable() { } return false; } + + // Workaround for JDK-8301190, avoids assertion when default locale is say tr. + @SuppressForbidden(reason = "required to determine if non-workable locale") + static boolean useVectorAPI() { +return 'I' == int.class.getSimpleName().toUpperCase().charAt(0); Review Comment: this confused me at first, it's really funny. How about a simpler: `return Objects.equals("I", "i".toUpperCase());` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] tang-hi commented on pull request #12255: allocate one NeighborQueue per search for results
tang-hi commented on PR #12255: URL: https://github.com/apache/lucene/pull/12255#issuecomment-1554708934 @msokolov, thank you! I have successfully run the test and it confirms what I mentioned earlier. I believe that #12303 by @jbellis could resolve this issue. **baseline** (this pull request) compared to the **candidate** version (version prior to this pull request).  **baseline**(version prior to this pull request) compared to the **candidate** version(fix this PR)  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on a diff in pull request #12290: Make memory fence in `ByteBufferGuard` explicit
uschindler commented on code in PR #12290: URL: https://github.com/apache/lucene/pull/12290#discussion_r1199073297 ## lucene/core/src/java/org/apache/lucene/store/ByteBufferGuard.java: ## @@ -65,14 +62,8 @@ public ByteBufferGuard(String resourceDescription, BufferCleaner cleaner) { public void invalidateAndUnmap(ByteBuffer... bufs) throws IOException { if (cleaner != null) { invalidated = true; - // This call should hopefully flush any CPU caches and as a result make - // the "invalidated" field update visible to other threads. We specifically - // don't make "invalidated" field volatile for performance reasons, hoping the - // JVM won't optimize away reads of that field and hardware should ensure - // caches are in sync after this call. This isn't entirely "fool-proof" - // (see LUCENE-7409 discussion), but it has been shown to work in practice - // and we count on this behavior. - barrier.lazySet(0); + // Makes "invalidated" field visible to other threads. + VarHandle.fullFence(); Review Comment: I'd like to merge this, but could we restore most of the previous comment. The comment added is not correct, because the invalidated field is not definitely visible to other threads (only in code not yet optimized). So please refer to the "fool-proof" discussion too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on pull request #12294: Implement MMapDirectory with Java 21 Project Panama Preview API
uschindler commented on PR #12294: URL: https://github.com/apache/lucene/pull/12294#issuecomment-1554737939 I noticed that the apijar files are quite large, because the extraction code can't remove package private superclasses. Therefore all package private classes stay alive as "empty" fragments. For panama-foreign this was not an issue, as in `java.base` the rules for separating implementation from api is very strict. I will update the code to do a 2pass scan: first collect all referenced classes (superclasses and interfaces) and then extract APIs. I will commit directly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on pull request #12311: Integrate the Incubating Panama Vector API
uschindler commented on PR #12311: URL: https://github.com/apache/lucene/pull/12311#issuecomment-1554738661 I noticed that the apijar files are quite large, because the extraction code can't remove package private superclasses. Therefore all package private classes stay alive as "empty" fragments. For panama-foreign this was not an issue, as in java.base the rules for separating implementation from api is very strict. I will update the code to do a 2pass scan: first collect all referenced classes (superclasses and interfaces) and then extract APIs. I will commit directly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on pull request #12311: Integrate the Incubating Panama Vector API
uschindler commented on PR #12311: URL: https://github.com/apache/lucene/pull/12311#issuecomment-1554742703 In addition, as we do not implement java 19 vector support yet, I would add some code to don't extract it dependning on java version. So we can control separately which of the 2 api areas (foreign / vector) are extracted per version. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on pull request #12311: Integrate the Incubating Panama Vector API
uschindler commented on PR #12311: URL: https://github.com/apache/lucene/pull/12311#issuecomment-1554743815 In addition at some point we should rename the files, but thats not urgent because naming is not so important. We should then also rename the extraction gradle script, as it will be used not only for panama-foreign. But this is cosmetic only. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] alessandrobenedetti opened a new issue, #12313: Multi-value Support for KnnVectorField
alessandrobenedetti opened a new issue, #12313: URL: https://github.com/apache/lucene/issues/12313 ### Description It would be nice to support multiple values in a Knn vector field. This must be compatible with both the Exact and Approximate Nearest Neighbor search. There are two sides to the coin: 1) Index time support - allowing to add in the indexing data structures multiple vectors for the same field and docID 2) Query time support - how to retrieve a topK list of documents, where each document my have multiple neighbors to the query The problem is more complicated than it seems. An initial tentative design and draft implementation is attached -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] alessandrobenedetti opened a new pull request, #12314: Multi-value support for KnnVectorField
alessandrobenedetti opened a new pull request, #12314: URL: https://github.com/apache/lucene/pull/12314 ### Description This pull request aims to introduce support for multiple values in a single Knn vector field. The adopted solution relies on: **Index time** Sparse vector values approach where an Ordinal(vectorId) to DocId map is used to keep the relation between a DocId and all its vectors. In the current sparse vector approach, we have just one vectorId per docID In this proposed contribution, multiple vectorIds are mapped to the same docID **Query time** A multi-valued strategy choice is offered to the user: MAX/SUM In exact nearest neighbor, for each document accepted by the query/filter : MAX = the similarity score between the query and each vector is computed, the max score is chosen for the search result SUM = the similarity score between the query and each vector is computed, all scores are summed to get the final score In aproximate nearest neighbor, for each document accepted by the query/filter : MAX = every time we find a nearest neighbor vector to be added to the topK, if the document is already there, its score is updated keeping the maximum between what it was there and the new score SUM = every time we find a nearest neighbor vector to be added to the topK, if the document is already there, its score is updated summing the old and new score N.B. This Pull Request is not meant to be ready to be merged at this stage. I can identify at least this set of activities before this draft can move to a 'production ready' version: 1) validate the overall idea and approach 2) validate index time usage of sparse vector values for the multi-valued use case 3) validate merge policy for the multi-valued use case 4) validate query time MAX/SUM approach 5) validate query time modifiable heap and neighborQueue usage 6) validate regressions 7) introduce more tests It's a big contribution and It will take time and effort to be completed. Any help is welcome. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] tang-hi commented on pull request #12314: Multi-value support for KnnVectorField
tang-hi commented on PR #12314: URL: https://github.com/apache/lucene/pull/12314#issuecomment-1554789228 This concept is interesting, but I am curious about its practical uses. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler closed issue #12304: VirtualMethod does unprivileged reflection access
uschindler closed issue #12304: VirtualMethod does unprivileged reflection access URL: https://github.com/apache/lucene/issues/12304 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #12312: [DRAFT] GH#12176: TermInSetQuery extends AutomatonQuery
rmuir commented on code in PR #12312: URL: https://github.com/apache/lucene/pull/12312#discussion_r1199137356 ## lucene/core/src/java/org/apache/lucene/util/automaton/DaciukMihovAutomatonBuilder.java: ## @@ -308,17 +290,83 @@ private void replaceOrRegister(State state) { } } - /** - * Add a suffix of current starting at fromIndex (inclusive) to state - * state. - */ - private void addSuffix(State state, CharSequence current, int fromIndex) { -final int len = current.length(); -while (fromIndex < len) { - int cp = Character.codePointAt(current, fromIndex); - state = state.newState(cp); - fromIndex += Character.charCount(cp); + private static class CharacterBasedBuilder extends DaciukMihovAutomatonBuilder { +private final CharsRefBuilder scratch = new CharsRefBuilder(); + +@Override +protected void add(BytesRef current) { + // Convert the input UTF-8 bytes to CharsRef so we can use the code points as our transition + // labels. Review Comment: hmm, this won't work if someone has binary terms that aren't UTF-8 encoded. I think we shouldn't be building an int32 automaton since we are passing `binary=true` to AutomatonQuery constructor. Instead we just want to build a binary automaton, where transitions are just bytes. PrefixQuery is an example of an automatonquery that supports binary terms and builds a binary automaton directly in this way. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #12312: [DRAFT] GH#12176: TermInSetQuery extends AutomatonQuery
rmuir commented on code in PR #12312: URL: https://github.com/apache/lucene/pull/12312#discussion_r1199139773 ## lucene/core/src/java/org/apache/lucene/util/automaton/DaciukMihovAutomatonBuilder.java: ## @@ -308,17 +290,83 @@ private void replaceOrRegister(State state) { } } - /** - * Add a suffix of current starting at fromIndex (inclusive) to state - * state. - */ - private void addSuffix(State state, CharSequence current, int fromIndex) { -final int len = current.length(); -while (fromIndex < len) { - int cp = Character.codePointAt(current, fromIndex); - state = state.newState(cp); - fromIndex += Character.charCount(cp); + private static class CharacterBasedBuilder extends DaciukMihovAutomatonBuilder { +private final CharsRefBuilder scratch = new CharsRefBuilder(); + +@Override +protected void add(BytesRef current) { + // Convert the input UTF-8 bytes to CharsRef so we can use the code points as our transition + // labels. Review Comment: nevermind, sorry for the noise. i read the diff wrong and got the char/binary mixed up. i think you are doing it right -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #12312: [DRAFT] GH#12176: TermInSetQuery extends AutomatonQuery
rmuir commented on PR #12312: URL: https://github.com/apache/lucene/pull/12312#issuecomment-1554807614 thanks for getting this started! Will be interested to see how the use of `Terms.intersect` impacts the performance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] ChrisHegarty commented on a diff in pull request #12311: Integrate the Incubating Panama Vector API
ChrisHegarty commented on code in PR #12311: URL: https://github.com/apache/lucene/pull/12311#discussion_r1199144041 ## lucene/core/src/java/org/apache/lucene/internal/vector/DefaultVectorUtilProvider.java: ## @@ -0,0 +1,92 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.internal.vector; + +/** + * The default VectorUtil provider implementation. + * + * @lucene.internal + */ +public final class DefaultVectorUtilProvider implements VectorUtilProvider { Review Comment: I was on the fence about this too. I’ll revert it - put it back in the original package and depend upon package-private access. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] alessandrobenedetti commented on pull request #12314: Multi-value support for KnnVectorField
alessandrobenedetti commented on PR #12314: URL: https://github.com/apache/lucene/pull/12314#issuecomment-1554825253 I'll follow up with many clean up and tidy up on my own in the next few weeks. I should have a bit of bandwidth from now till Berlin Buzzword (mid June). Any feedback is welcome, I am pretty sure there are a *lot* of wrong/imperfect stuff around. I worked on this for many months, in my spare time, I probably spent most of my time merging from the main as this area changed a lot :) Didn't bother with pre-commit and stuff, I'll do it in the next few days. @tang-hi the idea is to allow multiple vectors in a field, sometimes you have multiple paragraphs and long texts you may want to split before encoding them to vectors, rather than concatenate everything to a single vector -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler opened a new pull request, #12315: Make sure APIJAR reproduces with different timezone
uschindler opened a new pull request, #12315: URL: https://github.com/apache/lucene/pull/12315 this PR just fixes the apijar generator to reproduce the file exactly (unfortunately java encodes the date using local timezone). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on pull request #12311: Integrate the Incubating Panama Vector API
uschindler commented on PR #12311: URL: https://github.com/apache/lucene/pull/12311#issuecomment-1554866016 ... I had to first the reason why my computer produced a different APIJAR from beginning I HATE DEFAULT TIMEZONE, DIE; DIE; DIE -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler merged pull request #12315: Make sure APIJAR reproduces with different timezone
uschindler merged PR #12315: URL: https://github.com/apache/lucene/pull/12315 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on pull request #12311: Integrate the Incubating Panama Vector API
uschindler commented on PR #12311: URL: https://github.com/apache/lucene/pull/12311#issuecomment-1554917991 I fixed it and merged the main branch into this one. Proceeding with fixing API generator to exclude unreferenced, private classes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] donnerpeter opened a new pull request, #12316: hunspell (minor): reduce allocations when processing compound rules
donnerpeter opened a new pull request, #12316: URL: https://github.com/apache/lucene/pull/12316 IntelliJ's allocation profiler shows some non-zero numbers from stream allocation and traversal. While JIT might be able to eliminate this sometimes, I'd prefer to avoid the doubt completely. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on a diff in pull request #12311: Integrate the Incubating Panama Vector API
uschindler commented on code in PR #12311: URL: https://github.com/apache/lucene/pull/12311#discussion_r1199198237 ## lucene/core/src/java/org/apache/lucene/internal/vector/VectorUtilProvider.java: ## @@ -76,4 +77,10 @@ static boolean vectorModulePresentAndReadable() { } return false; } + + // Workaround for JDK-8301190, avoids assertion when default locale is say tr. + @SuppressForbidden(reason = "required to determine if non-workable locale") + static boolean useVectorAPI() { +return 'I' == int.class.getSimpleName().toUpperCase().charAt(0); Review Comment: You can remove the supress forbidden, if you use `toUppercase(Locale.getDefault())` (being explicit is allowed). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on pull request #12314: Multi-value support for KnnVectorField
benwtrent commented on PR #12314: URL: https://github.com/apache/lucene/pull/12314#issuecomment-1554999300 @alessandrobenedetti thank you for kick starting this! You are absolutely correct, this is a large, but pivotal and necessary change for vector search in Lucene. I have not yet reviewed or looked over the entire design, but two things stuck out to me immediately. 1. To simplify the review, design, testing, etc. could you simplify and reduce scope? Meaning, for the first iteration of this, Lucene only supports `max` and is used by default for multi-vector fields. This should: - remove the need for the additional search parameter (can be added as a part 2 of this feature in the future). - reduce testing requirements - reduce user overhead as the closest document based on the closest vector is returned. This behavior parallels nicely with a single vector field. 2. Declaring a vector field as "multi-vector" seems...weird and has a very blast radius on the entire code base. It seems to me that the vector codec should seamlessly handle multiple vectors per field, just like it seamlessly handles sparse vector encoding. Though, thinking about it more, it seems difficult to support both sparse & multi-vector values in this way. All that said, I am curious about to your reasoning here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on a diff in pull request #12311: Integrate the Incubating Panama Vector API
uschindler commented on code in PR #12311: URL: https://github.com/apache/lucene/pull/12311#discussion_r1199048935 ## lucene/core/src/java/org/apache/lucene/internal/vector/DefaultVectorUtilProvider.java: ## @@ -0,0 +1,92 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.internal.vector; + +/** + * The default VectorUtil provider implementation. + * + * @lucene.internal + */ +public final class DefaultVectorUtilProvider implements VectorUtilProvider { Review Comment: IMHO, this one should be package private like the Java 20 one. In general I am not really happy with the additional package? I know you are a fan of the module system, but most users still don't use it and our Javadocs also show all packages (including internal ones). Is it really needed to have the vector stuff in a separate package. For MMAPDirectory I made sure to hide everything, including the provider as it is no public API. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] ChrisHegarty commented on a diff in pull request #12311: Integrate the Incubating Panama Vector API
ChrisHegarty commented on code in PR #12311: URL: https://github.com/apache/lucene/pull/12311#discussion_r1199236929 ## lucene/core/src/java/org/apache/lucene/internal/vector/VectorUtilProvider.java: ## @@ -76,4 +77,10 @@ static boolean vectorModulePresentAndReadable() { } return false; } + + // Workaround for JDK-8301190, avoids assertion when default locale is say tr. + @SuppressForbidden(reason = "required to determine if non-workable locale") + static boolean useVectorAPI() { +return 'I' == int.class.getSimpleName().toUpperCase().charAt(0); Review Comment: Yeah, sorry for this - I was a bit distracted! I just committed your suggestion, thanks. [0779fcb](https://github.com/apache/lucene/pull/12311/commits/0779fcb801cc668a943bb5bbca44df5272321f59) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] ChrisHegarty commented on pull request #12311: Integrate the Incubating Panama Vector API
ChrisHegarty commented on PR #12311: URL: https://github.com/apache/lucene/pull/12311#issuecomment-1555062236 > In addition at some point we should rename the files, but thats not urgent because naming is not so important. We should then also rename the extraction gradle script, as it will be used not only for panama-foreign. But this is cosmetic only. Agreed. I had a similar thought when starting out on this, but avoided it then because of the "noise" - but it's certainly worth doing now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] ChrisHegarty commented on a diff in pull request #12311: Integrate the Incubating Panama Vector API
ChrisHegarty commented on code in PR #12311: URL: https://github.com/apache/lucene/pull/12311#discussion_r1199250652 ## lucene/core/src/java/org/apache/lucene/internal/vector/DefaultVectorUtilProvider.java: ## @@ -0,0 +1,92 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.internal.vector; + +/** + * The default VectorUtil provider implementation. + * + * @lucene.internal + */ +public final class DefaultVectorUtilProvider implements VectorUtilProvider { Review Comment: I put things back, with a little renaming. All should be package-private. [d3fe388](https://github.com/apache/lucene/pull/12311/commits/d3fe38815b4d918577704e5a5df06d904001ce7f) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on a diff in pull request #12316: hunspell (minor): reduce allocations when processing compound rules
dweiss commented on code in PR #12316: URL: https://github.com/apache/lucene/pull/12316#discussion_r1199252099 ## lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Dictionary.java: ## @@ -155,7 +155,7 @@ public class Dictionary { boolean checkCompoundCase, checkCompoundDup, checkCompoundRep; boolean checkCompoundTriple, simplifiedTriple; int compoundMin = 3, compoundMax = Integer.MAX_VALUE; - List compoundRules; // nullable + CompoundRule[] compoundRules; // nullable Review Comment: LGTM. Would it be any different if you passed the size to ArrayList's constructor? I'm guessing streams have an associated cost but if it's nearly the same then I'd go with less verbose version (collections). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] donnerpeter commented on a diff in pull request #12316: hunspell (minor): reduce allocations when processing compound rules
donnerpeter commented on code in PR #12316: URL: https://github.com/apache/lucene/pull/12316#discussion_r1199286526 ## lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Dictionary.java: ## @@ -155,7 +155,7 @@ public class Dictionary { boolean checkCompoundCase, checkCompoundDup, checkCompoundRep; boolean checkCompoundTriple, simplifiedTriple; int compoundMin = 3, compoundMax = Integer.MAX_VALUE; - List compoundRules; // nullable + CompoundRule[] compoundRules; // nullable Review Comment: With the list, the question arose whether to iterate over it with foreach or with indexed access. The latter definitely has no allocations but results in wordier code (including an IDE suppression comment). So I chose an array to get the best of both worlds :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on a diff in pull request #12316: hunspell (minor): reduce allocations when processing compound rules
dweiss commented on code in PR #12316: URL: https://github.com/apache/lucene/pull/12316#discussion_r1199299390 ## lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Dictionary.java: ## @@ -155,7 +155,7 @@ public class Dictionary { boolean checkCompoundCase, checkCompoundDup, checkCompoundRep; boolean checkCompoundTriple, simplifiedTriple; int compoundMin = 3, compoundMax = Integer.MAX_VALUE; - List compoundRules; // nullable + CompoundRule[] compoundRules; // nullable Review Comment: Yup, sure. Do those allocations really show up in benchmarks? I'd think it to be almost non-detectable as long as it doesn't go beyond local thread allocation buffers. Just wondering, I'm fine with the patch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] donnerpeter merged pull request #12316: hunspell (minor): reduce allocations when processing compound rules
donnerpeter merged PR #12316: URL: https://github.com/apache/lucene/pull/12316 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] RS146BIJAY commented on issue #12228: IndexWriter should clean up unreferenced files when segment merge fails due to disk full
RS146BIJAY commented on issue #12228: URL: https://github.com/apache/lucene/issues/12228#issuecomment-1555145366 As of now is there is any way User can delete these unreferenced files on their end? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] RS146BIJAY commented on issue #12228: IndexWriter should clean up unreferenced files when segment merge fails due to disk full
RS146BIJAY commented on issue #12228: URL: https://github.com/apache/lucene/issues/12228#issuecomment-1555155442 Also Lucene provides a way to rollback to previous commit using ```rollback``` function call. But as of now, it also closes the IndexWriter as well. I think ```close``` function almost does the same thing. Any reason why ```rollback``` also closes IndexWriter? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #12312: [DRAFT] GH#12176: TermInSetQuery extends AutomatonQuery
gsmiller commented on PR #12312: URL: https://github.com/apache/lucene/pull/12312#issuecomment-1555164152 Here's what I'm seeing so far in benchmarking... I took a custom benchmarking approach for this, similar to #12151 and other related issues. I did this because, 1) we don't really have benchmark coverage in `luceneutil` (I know, we should address this!), and 2) it lets me test a number of specific scenarios. Honestly, it was just the easiest path for me to get some numbers. My benchmarked is here: [TiSBench.java.txt](https://github.com/apache/lucene/files/11519960/TiSBench.java.txt). If you're interested in running it, you need a little setup. Most of it is obvious, but I'm happy to answer questions if helpful. The benchmark indexes geonames data (~12MM records). It includes an ID field and a Country Code field (both postings and doc values for each). The ID field is a primary key. The benchmark tasks break down into: 1. Term disjunctions over the country code field, with varying cardinality: "all" country codes (254 terms), "medium cardinality" country codes (20 different terms), and "low cardinality" country codes (10 different terms). The medium/low cardinality tasks also break down into a set with high and low costs (i.e., common terms and rare terms). 2. Term disjunctions over the ID field. There are high/medium/low versions of this task (500 terms, 20 terms, 10 terms) For each task, there are four runs: 1. Typical postings-based approach with the current TermInSetQuery that extends MultiTermQuery (using prefix coding of terms and the ping-pong intersection with index terms) 2. Postings-based approach with a new TermInSetQuery that extends AutomatonQuery (delegating to Term#intersect to intersect query terms with index terms) 3. DocValues-based approach with the current TermInSetQuery 4. DocValues-based approach with the new TermInSetQuery I ran postings- and docvalues-based approaches since the term dictionary implementations are different. In general, the two approaches demonstrate similar latency characteristics, but the automaton approach is a bit worse on the PK field. I dug in a bit with a profiler and I _think_ we're just seeing the overhead of building the automaton. I think this overhead is showing up because the PK query processing is so cheap in general vs. the other tasks that can "hide" the overhead. So... as of now, I don't see any performance benefits to moving to this approach, and maybe see some regressions. On the other hand, it would be nice to move to this implementation so we could have codec-dependent intersection techniques, which would help address issues like #12280 (bloom filter implementation could have a specific intersection implementation that leverages the bloom filter). I'll try to run some benchmarks on our Amazon product search application next week just to gather some additional data points. Maybe we can learn more about how this technique might behave on another benchmark data set. Here are the benchmark results (numbers are query time in ms): | Task | Postings MTQ | Postings Automata | DV MTQ | DV Automata | |---|---|---|---|---| | All Country Code Filter Terms | 507.47 | 506.98 | 82.43 | 104.75 | | Task | Postings MTQ | Postings Automata | DV MTQ | DV Automata | |---|---|---|---|---| | Medium Cardinality + High Cost Country Code Filter Terms | 328.24 | 328.16 | 167.88 | 167.87 | | Task | Postings MTQ | Postings Automata | DV MTQ | DV Automata | |---|---|---|---|---| | Low Cardinality + High Cost Country Code Filter Terms | 70.85 | 70.93 | 169.59 | 169.71 | | Task | Postings MTQ | Postings Automata | DV MTQ | DV Automata | |---|---|---|---|---| | Medium Cardinality + Low Cost Country Code Filter Terms | 0.31 | 0.27 | 73.97 | 73.99 | | Task | Postings MTQ | Postings Automata | DV MTQ | DV Automata | |---|---|---|---|---| | Low Cardinality + Low Cost Country Code Filter Terms | 0.42 | 0.41 | 73.85 | 73.99 | | Task | Postings MTQ | Postings Automata | DV MTQ | DV Automata | |---|---|---|---|---| | High Cardinality PK Filter Terms | 6.34 | 8.57 | 127.34 | 129.63 | | Task | Postings MTQ | Postings Automata | DV MTQ | DV Automata | |---|---|---|---|---| | Medium Cardinality PK Filter Terms | 0.46 | 0.64 | 110.12 | 110.51 | | Task | Postings MTQ | Postings Automata | DV MTQ | DV Automata | |---|---|---|---|---| | Low Cardinality PK Filter Terms | 0.19 | 0.31 | 66.41 | 66.51 | -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mai
[GitHub] [lucene] rmuir commented on pull request #12312: [DRAFT] GH#12176: TermInSetQuery extends AutomatonQuery
rmuir commented on PR #12312: URL: https://github.com/apache/lucene/pull/12312#issuecomment-1555172675 hmm, disappointing. Was hoping to see gains on the terms dictionary since it optimizes `intersect`. wonder what is going on. Of course docvalues impl doesn't optimize `intersect` in any way, so it isn't surprising that we don't see gains there, but it is something we could improve later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on pull request #12311: Integrate the Incubating Panama Vector API
uschindler commented on PR #12311: URL: https://github.com/apache/lucene/pull/12311#issuecomment-1555188725 > > In addition at some point we should rename the files, but thats not urgent because naming is not so important. We should then also rename the extraction gradle script, as it will be used not only for panama-foreign. But this is cosmetic only. > > Agreed. I had a similar thought when starting out on this, but avoided it then because of the "noise" - but it's certainly worth doing now. I'll leave this to you @uschindler, unless I hear otherwise. Thanks. Yes. Let's wait with doing this. It's just cosmetic and would conflict with the other open PR. Let me first remove useless classes from jar file. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #12312: [DRAFT] GH#12176: TermInSetQuery extends AutomatonQuery
gsmiller commented on PR #12312: URL: https://github.com/apache/lucene/pull/12312#issuecomment-1555244204 OK, here's a method profiler diff for the "High Cardinality PK" task, comparing two postings approaches—one that's using the current MultiTermQuery version, and one using AutomatonQuery ("LegacyTermInSetQuery" is our current version extending MultiTermQuery). The green colored frames show where the AutomatonQuery is spending less time compared to the MultiTermQuery version, and red is the opposite. Eyeballing this, it seems to confirm that it's a little more expensive to build the automaton than to do the prefix-coding, but a little less expensive to do the term intersection with the automaton approach (down in the codec's optimized intersect). The net/net though is that the AutomatonQuery approach in this case is slower by ~35%. https://github.com/apache/lucene/assets/16479560/c251415c-877a-4c5e-b8a1-ecb5c8546282";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #12312: [DRAFT] GH#12176: TermInSetQuery extends AutomatonQuery
gsmiller commented on PR #12312: URL: https://github.com/apache/lucene/pull/12312#issuecomment-1555337627 Hmm... not sure if I've got something setup incorrectly with my JFR settings, but trying to dig into the other tasks, I can't even get the relevant methods to show up in the profiled calls. I attached a debugger and made sure the call flow is what I expect, but I'm not seeing any of the term intersection logic in what's getting profiled. My theory is that the term intersection is so trivial relative to the actually work of computing the postings disjunction that it's just not showing up in the samples, but maybe there's a setting I'm missing that's not sampling in a fine-enough grain? Not sure. But as far as I can tell, with the non-PK tasks, it looks like the term intersection may just be such a trivial part of the query cost that it doesn't matter what approach we take. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on pull request #12255: allocate one NeighborQueue per search for results
msokolov commented on PR #12255: URL: https://github.com/apache/lucene/pull/12255#issuecomment-1555366279 Thanks everyone for testing and fixing. I had reverted this yesterday and I believe what we have on main now has recovered the performance we had before. I also ran luceneutil a few times and wasn't able to observe any change from clear() vs creating new NeighborQueue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jbellis commented on pull request #12255: allocate one NeighborQueue per search for results
jbellis commented on PR #12255: URL: https://github.com/apache/lucene/pull/12255#issuecomment-1555369584 The performance impact to building is more meaningful because that is where you are allocating large queues for multiple levels -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jainankitk opened a new issue, #12317: Option for disabling term dictionary compression
jainankitk opened a new issue, #12317: URL: https://github.com/apache/lucene/issues/12317 ### Description While working on a customer issue, I noticed that memory allocations for recently added [term dictionary compression](https://github.com/apache/lucene-solr/commit/33a7af9cbfb9f668b4aee433906ee93d55e1e709) is significant. After disabling the compression using patch, I was able to notice some reduction in the memory allocation. Generally the cost of storage is significantly lower than memory/CPU, but can be useful once the segment/index is being archived. But during live data ingestion when segments merge frequently, the cost of compression/decompression is paid more than once. Wondering couple of things here: * Should we expose an option to disable term dictionary compression? * Does it make sense to initialize the HighCompressionHashTable lazily in TermsWriter? As some code paths (non-compression) don't end up using this. For context, the customer workload is running on instance having 32G memory with 16G allocated for heap. Attaching the memory allocation profile below:  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on pull request #12310: #12276: rename DaciukMihovAutomatonBuilder to StringsToAutomaton
mikemccand commented on PR #12310: URL: https://github.com/apache/lucene/pull/12310#issuecomment-1555414840 > LGTM. Apologies for the merge conflict I created for you (but thanks for the review on that PR!). No worries! I'll resolve and merge soon! Thanks for the review @gsmiller. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on pull request #12310: #12276: rename DaciukMihovAutomatonBuilder to StringsToAutomaton
mikemccand commented on PR #12310: URL: https://github.com/apache/lucene/pull/12310#issuecomment-1555415525 > Yeah we should explore a binary version. Even if it doesn't speedup TermInSetQuery. +1 > Pretty sure I added a comment along the lines of "we should not do this" Oh yeah! I think you are referring to this super scary code! ``` @Override public void visit(QueryVisitor visitor) { if (visitor.acceptField(field) == false) { return; } if (termData.size() == 1) { visitor.consumeTerms(this, new Term(field, termData.iterator().next())); } if (termData.size() > 1) { visitor.consumeTermsMatching(this, field, this::asByteRunAutomaton); } } // TODO: this is extremely slow. we should not be doing this. private ByteRunAutomaton asByteRunAutomaton() { TermIterator iterator = termData.iterator(); List automata = new ArrayList<>(); for (BytesRef term = iterator.next(); term != null; term = iterator.next()) { automata.add(Automata.makeBinary(term)); } Automaton automaton = Operations.determinize( Operations.union(automata), Operations.DEFAULT_DETERMINIZE_WORK_LIMIT); return new CompiledAutomaton(automaton).runAutomaton; } ``` We should indeed make a `BytesRefsToAutomaton`! I'll open a spinoff. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on pull request #12310: #12276: rename DaciukMihovAutomatonBuilder to StringsToAutomaton
mikemccand commented on PR #12310: URL: https://github.com/apache/lucene/pull/12310#issuecomment-1555416664 Actually, the terms must be sorted in Unicode code point order, and, we do have the builder for `BytesRef` already: `public static Automaton build(Collection input) {`. So I think we should just fix `TermInSetQuery`'s scary code to use this binary builder? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] hydrogen666 commented on pull request #816: LUCENE-10519: Improvement for CloseableThreadLocal
hydrogen666 commented on PR #816: URL: https://github.com/apache/lucene/pull/816#issuecomment-1555456315 Any progress of this PR? @uschindler -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org