[GitHub] [lucene] matthias-mueller commented on a diff in pull request #12299: GITHUB-12291: Skip blank lines from stopwords list.

2023-05-17 Thread via GitHub
matthias-mueller commented on code in PR #12299: URL: https://github.com/apache/lucene/pull/12299#discussion_r1196179445 ## lucene/core/src/java/org/apache/lucene/analysis/WordlistLoader.java: ## @@ -53,7 +53,10 @@ public static CharArraySet getWordSet(Reader reader, CharArrayS

[GitHub] [lucene] rmuir commented on issue #12302: vector API integration, plan B

2023-05-17 Thread via GitHub
rmuir commented on issue #12302: URL: https://github.com/apache/lucene/issues/12302#issuecomment-1551026488 > Of course we can also include the implementation classes into the main JAR file. To be clear this is what I propose, not making things pluggable. That's not related and not w

[GitHub] [lucene] uschindler commented on a diff in pull request #12299: GITHUB-12291: Skip blank lines from stopwords list.

2023-05-17 Thread via GitHub
uschindler commented on code in PR #12299: URL: https://github.com/apache/lucene/pull/12299#discussion_r1196362845 ## lucene/core/src/java/org/apache/lucene/analysis/WordlistLoader.java: ## @@ -53,7 +53,10 @@ public static CharArraySet getWordSet(Reader reader, CharArraySet res

[GitHub] [lucene] matthias-mueller commented on a diff in pull request #12299: GITHUB-12291: Skip blank lines from stopwords list.

2023-05-17 Thread via GitHub
matthias-mueller commented on code in PR #12299: URL: https://github.com/apache/lucene/pull/12299#discussion_r1196380870 ## lucene/core/src/java/org/apache/lucene/analysis/WordlistLoader.java: ## @@ -53,7 +53,10 @@ public static CharArraySet getWordSet(Reader reader, CharArrayS

[GitHub] [lucene] msokolov commented on issue #12302: vector API integration, plan B

2023-05-17 Thread via GitHub
msokolov commented on issue #12302: URL: https://github.com/apache/lucene/issues/12302#issuecomment-1551254446 +1 to this - I would be grateful to you all if you are able to get this working. Don't really want to have to die with a sword in my hand to get to the promised land of vector API.

[GitHub] [lucene] benwtrent commented on pull request #12255: allocate one NeighborQueue per search for results

2023-05-17 Thread via GitHub
benwtrent commented on PR #12255: URL: https://github.com/apache/lucene/pull/12255#issuecomment-1551287769 I don't see how it is that commit as the change was only detected after this PR was merged and has persisted since then. Though I may be reading the graph incorrectly. We should

[GitHub] [lucene] msokolov commented on pull request #12255: allocate one NeighborQueue per search for results

2023-05-17 Thread via GitHub
msokolov commented on PR #12255: URL: https://github.com/apache/lucene/pull/12255#issuecomment-1551293959 Here is the timeline graph with one sample per day, roughly https://home.apache.org/~mikemccand/lucenebench/VectorSearch.html -- This is an automated message from the Apache Git Servi

[GitHub] [lucene] uschindler commented on pull request #12296: Seal `IndexReader` and `IndexReaderContext`

2023-05-17 Thread via GitHub
uschindler commented on PR #12296: URL: https://github.com/apache/lucene/pull/12296#issuecomment-1551296276 Javadocs look fine: https://ci-builds.apache.org/job/Lucene/job/Lucene-Artifacts-main/javadoc/core/org/apache/lucene/index/IndexReader.html ![image](https://github.com/apache/l

[GitHub] [lucene] rmuir commented on issue #12302: vector API integration, plan B

2023-05-17 Thread via GitHub
rmuir commented on issue #12302: URL: https://github.com/apache/lucene/issues/12302#issuecomment-1551298163 I am at a conference this week but i will try to make a prototype, building upon work uwe has done. for example using the apijar technique to prevent having to juggle N jvms on every

[GitHub] [lucene] uschindler commented on issue #12302: vector API integration, plan B

2023-05-17 Thread via GitHub
uschindler commented on issue #12302: URL: https://github.com/apache/lucene/issues/12302#issuecomment-1551363152 > I realize if we do this, it will make the gradle uglier and less elegant. I think theres no way around it, we are re-implementing mr-jar basically. we just have to try to conta

[GitHub] [lucene] fsparv commented on a diff in pull request #12301: Improve error message if codec not found Fixes #12300

2023-05-17 Thread via GitHub
fsparv commented on code in PR #12301: URL: https://github.com/apache/lucene/pull/12301#discussion_r1196537557 ## lucene/core/src/java/org/apache/lucene/util/NamedSPILoader.java: ## @@ -110,7 +112,14 @@ public S lookup(String name) { + "' does not exist."

[GitHub] [lucene] ChrisHegarty commented on issue #12302: vector API integration, plan B

2023-05-17 Thread via GitHub
ChrisHegarty commented on issue #12302: URL: https://github.com/apache/lucene/issues/12302#issuecomment-1551468645 I agree with the high-level idea - use the Incubating Vector API internally in Lucene, in a way that is opt-in and also constrained by module/SPI/loading techniques (so as to r

[GitHub] [lucene] ChrisHegarty commented on issue #12302: vector API integration, plan B

2023-05-17 Thread via GitHub
ChrisHegarty commented on issue #12302: URL: https://github.com/apache/lucene/issues/12302#issuecomment-1551473467 Correction. The incubating Vector API must be loaded as part of the boot layer - it must be added by the command line --add-modules flag **OR _required_ by the implemen

[GitHub] [lucene] benwtrent commented on pull request #12255: allocate one NeighborQueue per search for results

2023-05-17 Thread via GitHub
benwtrent commented on PR #12255: URL: https://github.com/apache/lucene/pull/12255#issuecomment-1551495918 > Also, is it actually this commit, or is it the HashMap commit https://github.com/apache/lucene/commit/3c163745bb07aed51b52878de0666f1405696147 ? @jbellis that change seems to

[GitHub] [lucene] jbellis opened a new pull request, #12303: Address HNSW Searcher performance regression

2023-05-17 Thread via GitHub
jbellis opened a new pull request, #12303: URL: https://github.com/apache/lucene/pull/12303 Referencing the regression discussed in #12255 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spec

[GitHub] [lucene] jbellis commented on pull request #12255: allocate one NeighborQueue per search for results

2023-05-17 Thread via GitHub
jbellis commented on PR #12255: URL: https://github.com/apache/lucene/pull/12255#issuecomment-1551523734 ![image](https://github.com/apache/lucene/assets/42158/102f1123-d205-4c93-8353-ec92a30243ea) ^ This is almost the entire diff, there is also the same change for byte[] and then `p

[GitHub] [lucene] nknize commented on issue #12302: vector API integration, plan B

2023-05-17 Thread via GitHub
nknize commented on issue #12302: URL: https://github.com/apache/lucene/issues/12302#issuecomment-1551636108 > 2\. Java is dying and becoming the next COBOL 😆 This is timely and > let's carve out a path where we use the vector api IFF the user opts in via the command-line.

[GitHub] [lucene] uschindler commented on issue #12302: vector API integration, plan B

2023-05-17 Thread via GitHub
uschindler commented on issue #12302: URL: https://github.com/apache/lucene/issues/12302#issuecomment-1551695383 > Correction. > > The incubating Vector API must be loaded as part of the boot layer - it must be added by the command line --add-modules flag **OR explicitly _required_ b

[GitHub] [lucene] uschindler commented on issue #12302: vector API integration, plan B

2023-05-17 Thread via GitHub
uschindler commented on issue #12302: URL: https://github.com/apache/lucene/issues/12302#issuecomment-1551698369 > > 2. Java is dying and becoming the next COBOL > > 😆 > > > let's carve out a path where we use the vector api IFF the user opts in via the command-line. > >

[GitHub] [lucene] JarvisCraft commented on pull request #12295: Use `instanceof` pattern-matching where possible

2023-05-17 Thread via GitHub
JarvisCraft commented on PR #12295: URL: https://github.com/apache/lucene/pull/12295#issuecomment-1551711212 > So I tend to change the equals methods with instanceof to follow the more modern approach consistently. In this case, I may roll back changes to `equals` methods and create a

[GitHub] [lucene] markrmiller commented on issue #11507: Increase the number of dims for KNN vectors to 2048 [LUCENE-10471]

2023-05-17 Thread via GitHub
markrmiller commented on issue #11507: URL: https://github.com/apache/lucene/issues/11507#issuecomment-1551779674 While this is not a critique on Lucene's attempt to utilize SIMD via OpenJDK, or any proposed ideas here, it's challenging to envision Lucene emerging as the leading solution fo

[GitHub] [lucene] mikemccand commented on pull request #12255: allocate one NeighborQueue per search for results

2023-05-17 Thread via GitHub
mikemccand commented on PR #12255: URL: https://github.com/apache/lucene/pull/12255#issuecomment-1551807578 Hmm, curiously, clicking through the datapoint with the ~20% QPS drop, [these were the changes to Lucene](https://github.com/apache/lucene/compare/963ed7ce888724c2dd55fff8c13a08b81b20

[GitHub] [lucene] benwtrent commented on issue #12219: Making vector similarity functions pluggable

2023-05-17 Thread via GitHub
benwtrent commented on issue #12219: URL: https://github.com/apache/lucene/issues/12219#issuecomment-1551846428 @rmuir I was not suggesting it as a way to only get performance for "some big company". I just thought using an incubating API was out of the question in Lucene (you have implied

[GitHub] [lucene] reta opened a new issue, #12304: VirtualMethod does unprivileged reflection access

2023-05-17 Thread via GitHub
reta opened a new issue, #12304: URL: https://github.com/apache/lucene/issues/12304 ### Description In OpenSearch, we've updated to the recent Apache Lucene 9.7 snapshots and got an number of tests failing with `java.security.AccessControlException`. The culprit is `org.apache.lucene

[GitHub] [lucene] benwtrent commented on pull request #12255: allocate one NeighborQueue per search for results

2023-05-17 Thread via GitHub
benwtrent commented on PR #12255: URL: https://github.com/apache/lucene/pull/12255#issuecomment-1551891133 So, using lucene util, I have been comparing this PR's commit vs. the one previous. I am consistently getting lower QPS with this PR's change but not near the 20% slow down (only aroun

[GitHub] [lucene] benwtrent commented on pull request #12255: allocate one NeighborQueue per search for results

2023-05-17 Thread via GitHub
benwtrent commented on PR #12255: URL: https://github.com/apache/lucene/pull/12255#issuecomment-1551959695 Well, with just a single thread, Luca's commit changed nothing to my local tests. I do seem to have a single threaded slow down (though minor) due to this change. @mikem

[GitHub] [lucene] uschindler commented on issue #12304: VirtualMethod does unprivileged reflection access

2023-05-17 Thread via GitHub
uschindler commented on issue #12304: URL: https://github.com/apache/lucene/issues/12304#issuecomment-1551968070 Hi, VirtualMethod is stone-aged class which was recently introduced again to handle Yes it may miss doPrivileged(). In fact the missing permission is granted by Opensearc

[GitHub] [lucene] reta commented on issue #12304: VirtualMethod does unprivileged reflection access

2023-05-17 Thread via GitHub
reta commented on issue #12304: URL: https://github.com/apache/lucene/issues/12304#issuecomment-1551969539 >This is new in 9.7, the recently released 9.6 does not use this. If I provide a PR for 9.x can you test my "better fix"? I am not 100% sure it can be fixed without permissions, but gi

[GitHub] [lucene] jbellis commented on pull request #12255: allocate one NeighborQueue per search for results

2023-05-17 Thread via GitHub
jbellis commented on PR #12255: URL: https://github.com/apache/lucene/pull/12255#issuecomment-1551970248 Can you check the performance with the changes at https://github.com/apache/lucene/pull/12303 ? -- This is an automated message from the Apache Git Service. To respond to the message,

[GitHub] [lucene] uschindler commented on issue #12304: VirtualMethod does unprivileged reflection access

2023-05-17 Thread via GitHub
uschindler commented on issue #12304: URL: https://github.com/apache/lucene/issues/12304#issuecomment-1552013584 Thanks for finding the issue so early (before a release). Backport #12197 uses VirtualMethod class for the first time since Very long time. It was heavily used at Lucene 3 an

[GitHub] [lucene] uschindler commented on pull request #12299: GITHUB-12291: Skip blank lines from stopwords list.

2023-05-17 Thread via GitHub
uschindler commented on PR #12299: URL: https://github.com/apache/lucene/pull/12299#issuecomment-1552064934 Looks fine. I will merge this, but please add a CHANGES.txt entry in the 9.7 section. Thanks for taking care of the issue! 👍 -- This is an automated message from the Apache G

[GitHub] [lucene] msokolov commented on pull request #12255: allocate one NeighborQueue per search for results

2023-05-17 Thread via GitHub
msokolov commented on PR #12255: URL: https://github.com/apache/lucene/pull/12255#issuecomment-1552065169 I see that `constants.SEARCH_NUM_THREADS=2` and this is what is passed to Competitor() by default as `numThreads`, and is then passed to `perf.SearchPerfTest` as `-searchThreadCount`,

[GitHub] [lucene] msokolov commented on pull request #12255: allocate one NeighborQueue per search for results

2023-05-17 Thread via GitHub
msokolov commented on PR #12255: URL: https://github.com/apache/lucene/pull/12255#issuecomment-1552073475 One thing that';s confusing here is that luceneutil reports "commits since last successful run" relative to a 5/10 run, but there is no graph point showing that run. The last successful

[GitHub] [lucene] gsmiller opened a new pull request, #12305: Minor cleanup and improvements to DaciukMihovAutomatonBuilder

2023-05-17 Thread via GitHub
gsmiller opened a new pull request, #12305: URL: https://github.com/apache/lucene/pull/12305 ### Description This proposes some minor cleanup and improvements to `DaciukMihovAutomatonBuilder` I came across while starting to explore #12176 (exploring the idea of allowing this algorith

[GitHub] [lucene] dsmiley opened a new pull request, #12306: Make MAX_DIMENSIONS configurable via a system property.

2023-05-17 Thread via GitHub
dsmiley opened a new pull request, #12306: URL: https://github.com/apache/lucene/pull/12306 The principle objective of this PR is to make it easier for users to use Lucene with higher dimensions. While it's [technically possible to circumvent the existing limit](https://lists.apache.org/t

[GitHub] [lucene] msokolov commented on pull request #12255: allocate one NeighborQueue per search for results

2023-05-17 Thread via GitHub
msokolov commented on PR #12255: URL: https://github.com/apache/lucene/pull/12255#issuecomment-1552164269 I tried running luceneutil before/after this change using this command: ``` comp = competition.Competition() index = comp.newIndex('baseline', sourceData,

[GitHub] [lucene] msokolov commented on pull request #12255: allocate one NeighborQueue per search for results

2023-05-17 Thread via GitHub
msokolov commented on PR #12255: URL: https://github.com/apache/lucene/pull/12255#issuecomment-1552173316 Re-running with comp = competition.Competition(verifyScores=False) lets the benchmark complete without errors and I get this result: ```

[GitHub] [lucene] dsmiley commented on pull request #12306: Make MAX_DIMENSIONS configurable via a system property.

2023-05-17 Thread via GitHub
dsmiley commented on PR #12306: URL: https://github.com/apache/lucene/pull/12306#issuecomment-1552175880 I get your point on supportability, and I appreciate that you and others take that so seriously! Then let's have documentation around this setting to make it explicitly clear that it is

[GitHub] [lucene] rmuir commented on pull request #12306: Make MAX_DIMENSIONS configurable via a system property.

2023-05-17 Thread via GitHub
rmuir commented on PR #12306: URL: https://github.com/apache/lucene/pull/12306#issuecomment-1552179361 Sorry, I really don't believe we should do this in the default codec. Again you can just make an alternative non default codec in lucene/codecs with a different limit than the default code

[GitHub] [lucene] dsmiley commented on pull request #12306: Make MAX_DIMENSIONS configurable via a system property.

2023-05-17 Thread via GitHub
dsmiley commented on PR #12306: URL: https://github.com/apache/lucene/pull/12306#issuecomment-1552179995 I could also imagine some users hosting search for others might want to configure the limit lower! -- This is an automated message from the Apache Git Service. To respond to the messag

[GitHub] [lucene] rmuir commented on pull request #12306: Make MAX_DIMENSIONS configurable via a system property.

2023-05-17 Thread via GitHub
rmuir commented on PR #12306: URL: https://github.com/apache/lucene/pull/12306#issuecomment-1552183640 The default codec with the index back compact guarantees that we have is just no place for leniency or "you are on your own". You should trust the index is supported. The lucene/co

[GitHub] [lucene] dsmiley commented on pull request #12306: Make MAX_DIMENSIONS configurable via a system property.

2023-05-17 Thread via GitHub
dsmiley commented on PR #12306: URL: https://github.com/apache/lucene/pull/12306#issuecomment-1552202819 BTW thanks for your prompt and friendly responses. I have two lines of inquiry here: (A) I don't accept the rule/belief that you have in which our default codecs shall not h

[GitHub] [lucene] rmuir commented on pull request #12306: Make MAX_DIMENSIONS configurable via a system property.

2023-05-17 Thread via GitHub
rmuir commented on PR #12306: URL: https://github.com/apache/lucene/pull/12306#issuecomment-1552206013 I gave you the technical reasons I am against the system property. I don't want to argue on and on because you don't agree with it, we don't have to agree. -- This is an automated messag

[GitHub] [lucene] contrebande-labs commented on issue #11507: Increase the number of dims for KNN vectors to 2048 [LUCENE-10471]

2023-05-17 Thread via GitHub
contrebande-labs commented on issue #11507: URL: https://github.com/apache/lucene/issues/11507#issuecomment-1552208982 > [...] it's challenging to envision Lucene emerging as the leading solution for large-scale vector similarity search [...] For anyone out there who knows IR's SOTA

[GitHub] [lucene] contrebande-labs opened a new issue, #12307: Multiple ClassNotFoundExceptions in IntelliJ Fat Jar on ARM64 Java 20

2023-05-17 Thread via GitHub
contrebande-labs opened a new issue, #12307: URL: https://github.com/apache/lucene/issues/12307 ### Description We use IntelliJ to build a fat jar to run on ARM64 cloud VMs with Java 20 on Oracle Linux 8.5 to build Lucene BM25+HNSW indices. Our first run gave us this error mess

[GitHub] [lucene] gsmiller commented on issue #4350: CharsRef has confusing methods/needs tests/bugs [LUCENE-3277]

2023-05-17 Thread via GitHub
gsmiller commented on issue #4350: URL: https://github.com/apache/lucene/issues/4350#issuecomment-1552229102 I came across a reference to this issue while working on #12305. It looks like the `CharsRef` family is pretty consistent with `BytesRef` now (including the builders). I don't see an

[GitHub] [lucene] contrebande-labs commented on issue #12302: vector API integration, plan B

2023-05-17 Thread via GitHub
contrebande-labs commented on issue #12302: URL: https://github.com/apache/lucene/issues/12302#issuecomment-1552245148 Is there room for a helping hand here? Tests, benchmarks, continuous integration, documentation, anything ? -- This is an automated message from the Apache Git Service. T

[GitHub] [lucene] dsmiley commented on pull request #12306: Make MAX_DIMENSIONS configurable via a system property.

2023-05-17 Thread via GitHub
dsmiley commented on PR #12306: URL: https://github.com/apache/lucene/pull/12306#issuecomment-1552322741 The only argument you've given against a system property specifically is: > system property is not used appropriately (e.g. proper security checks). If there is a security manager, and

[GitHub] [lucene] JerryChin commented on pull request #12299: GITHUB-12291: Skip blank lines from stopwords list.

2023-05-17 Thread via GitHub
JerryChin commented on PR #12299: URL: https://github.com/apache/lucene/pull/12299#issuecomment-1552327703 Hi @uschindler, which category should I put it under? how about `Improvements`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on

[GitHub] [lucene] michaelwechner commented on pull request #12306: Make MAX_DIMENSIONS configurable via a system property.

2023-05-17 Thread via GitHub
michaelwechner commented on PR #12306: URL: https://github.com/apache/lucene/pull/12306#issuecomment-1552471997 Just to make sure I understand the -1 correctly: The concern is that if a user can set the dimension by himself, because the Lucene version works well with this dimension at that