[GitHub] [lucene] mocobeta commented on pull request #920: LUCENE-10589: increase upper bound of test range query to the maximum value + 1
mocobeta commented on PR #920: URL: https://github.com/apache/lucene/pull/920#issuecomment-1136964941 I'll merge this, thanks for your quick response! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta merged pull request #920: LUCENE-10589: increase upper bound of test range query to the maximum value + 1
mocobeta merged PR #920: URL: https://github.com/apache/lucene/pull/920 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10589) Fix corner case in TestKnnVectorQuery.testRandomWithFilter
[ https://issues.apache.org/jira/browse/LUCENE-10589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541898#comment-17541898 ] ASF subversion and git services commented on LUCENE-10589: -- Commit 2620b5669f9a3ccb90439309723314295a850b29 in lucene's branch refs/heads/main from Tomoko Uchida [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=2620b5669f9 ] LUCENE-10589: increase upper bound of test range query (#920) > Fix corner case in TestKnnVectorQuery.testRandomWithFilter > -- > > Key: LUCENE-10589 > URL: https://issues.apache.org/jira/browse/LUCENE-10589 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Tomoko Uchida >Priority: Minor > Time Spent: 1.5h > Remaining Estimate: 0h > > {{TestKnnVectorQuery.testRandomWithFilter}} can fail with > java.lang.UnsupportedOperationException. > Reproducible command > {code:java} > ./gradlew test --tests TestKnnVectorQuery.testRandomWithFilter > -Dtests.seed=1DA39B92702DAC45 -Dtests.multiplier=3 > {code} > {code:java} > org.apache.lucene.search.TestKnnVectorQuery > testRandomWithFilter FAILED > java.lang.UnsupportedOperationException: exact search is not supported > at > __randomizedtesting.SeedInfo.seed([1DA39B92702DAC45:6BEAC2197AD96AE0]:0) > at > org.apache.lucene.search.TestKnnVectorQuery$ThrowingKnnVectorQuery.exactSearch(TestKnnVectorQuery.java:715) > at > org.apache.lucene.search.KnnVectorQuery.searchLeaf(KnnVectorQuery.java:151) > at > org.apache.lucene.search.KnnVectorQuery.rewrite(KnnVectorQuery.java:108) > at > org.apache.lucene.search.ConstantScoreQuery.rewrite(ConstantScoreQuery.java:44) > at > org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:789) > at > org.apache.lucene.tests.search.AssertingIndexSearcher.rewrite(AssertingIndexSearcher.java:69) > at > org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:803) > at > org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:685) > at > org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:667) > at > org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:584) > at > org.apache.lucene.search.TestKnnVectorQuery.testRandomWithFilter(TestKnnVectorQuery.java:556) > {code} > In some edge cases (depending on the random seed), > [KnnVectorQuery.java#147|https://github.com/apache/lucene/blob/fe9d26178d033f585c08a5e86708063ac0ec0c9e/lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java#L147] > becomes false, and then `exactSearch()` is called. > The upper bound of [the test range query > (filter)|https://github.com/apache/lucene/blob/fe9d26178d033f585c08a5e86708063ac0ec0c9e/lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java#L554] > could be 200 (the max value of "tag" field + 1) instead of lower + 150 to > make it "unrestrictive"? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10589) Fix corner case in TestKnnVectorQuery.testRandomWithFilter
[ https://issues.apache.org/jira/browse/LUCENE-10589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541902#comment-17541902 ] ASF subversion and git services commented on LUCENE-10589: -- Commit 9188b7f4c49f6a3c6e9a2580916230e56c4a41d1 in lucene's branch refs/heads/branch_9x from Tomoko Uchida [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=9188b7f4c49 ] LUCENE-10589: increase upper bound of test range query (#920) > Fix corner case in TestKnnVectorQuery.testRandomWithFilter > -- > > Key: LUCENE-10589 > URL: https://issues.apache.org/jira/browse/LUCENE-10589 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Tomoko Uchida >Priority: Minor > Time Spent: 1.5h > Remaining Estimate: 0h > > {{TestKnnVectorQuery.testRandomWithFilter}} can fail with > java.lang.UnsupportedOperationException. > Reproducible command > {code:java} > ./gradlew test --tests TestKnnVectorQuery.testRandomWithFilter > -Dtests.seed=1DA39B92702DAC45 -Dtests.multiplier=3 > {code} > {code:java} > org.apache.lucene.search.TestKnnVectorQuery > testRandomWithFilter FAILED > java.lang.UnsupportedOperationException: exact search is not supported > at > __randomizedtesting.SeedInfo.seed([1DA39B92702DAC45:6BEAC2197AD96AE0]:0) > at > org.apache.lucene.search.TestKnnVectorQuery$ThrowingKnnVectorQuery.exactSearch(TestKnnVectorQuery.java:715) > at > org.apache.lucene.search.KnnVectorQuery.searchLeaf(KnnVectorQuery.java:151) > at > org.apache.lucene.search.KnnVectorQuery.rewrite(KnnVectorQuery.java:108) > at > org.apache.lucene.search.ConstantScoreQuery.rewrite(ConstantScoreQuery.java:44) > at > org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:789) > at > org.apache.lucene.tests.search.AssertingIndexSearcher.rewrite(AssertingIndexSearcher.java:69) > at > org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:803) > at > org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:685) > at > org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:667) > at > org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:584) > at > org.apache.lucene.search.TestKnnVectorQuery.testRandomWithFilter(TestKnnVectorQuery.java:556) > {code} > In some edge cases (depending on the random seed), > [KnnVectorQuery.java#147|https://github.com/apache/lucene/blob/fe9d26178d033f585c08a5e86708063ac0ec0c9e/lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java#L147] > becomes false, and then `exactSearch()` is called. > The upper bound of [the test range query > (filter)|https://github.com/apache/lucene/blob/fe9d26178d033f585c08a5e86708063ac0ec0c9e/lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java#L554] > could be 200 (the max value of "tag" field + 1) instead of lower + 150 to > make it "unrestrictive"? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10589) Fix corner case in TestKnnVectorQuery.testRandomWithFilter
[ https://issues.apache.org/jira/browse/LUCENE-10589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomoko Uchida resolved LUCENE-10589. Fix Version/s: 10.0 (main) 9.3 Resolution: Fixed > Fix corner case in TestKnnVectorQuery.testRandomWithFilter > -- > > Key: LUCENE-10589 > URL: https://issues.apache.org/jira/browse/LUCENE-10589 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Tomoko Uchida >Priority: Minor > Fix For: 10.0 (main), 9.3 > > Time Spent: 1.5h > Remaining Estimate: 0h > > {{TestKnnVectorQuery.testRandomWithFilter}} can fail with > java.lang.UnsupportedOperationException. > Reproducible command > {code:java} > ./gradlew test --tests TestKnnVectorQuery.testRandomWithFilter > -Dtests.seed=1DA39B92702DAC45 -Dtests.multiplier=3 > {code} > {code:java} > org.apache.lucene.search.TestKnnVectorQuery > testRandomWithFilter FAILED > java.lang.UnsupportedOperationException: exact search is not supported > at > __randomizedtesting.SeedInfo.seed([1DA39B92702DAC45:6BEAC2197AD96AE0]:0) > at > org.apache.lucene.search.TestKnnVectorQuery$ThrowingKnnVectorQuery.exactSearch(TestKnnVectorQuery.java:715) > at > org.apache.lucene.search.KnnVectorQuery.searchLeaf(KnnVectorQuery.java:151) > at > org.apache.lucene.search.KnnVectorQuery.rewrite(KnnVectorQuery.java:108) > at > org.apache.lucene.search.ConstantScoreQuery.rewrite(ConstantScoreQuery.java:44) > at > org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:789) > at > org.apache.lucene.tests.search.AssertingIndexSearcher.rewrite(AssertingIndexSearcher.java:69) > at > org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:803) > at > org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:685) > at > org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:667) > at > org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:584) > at > org.apache.lucene.search.TestKnnVectorQuery.testRandomWithFilter(TestKnnVectorQuery.java:556) > {code} > In some edge cases (depending on the random seed), > [KnnVectorQuery.java#147|https://github.com/apache/lucene/blob/fe9d26178d033f585c08a5e86708063ac0ec0c9e/lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java#L147] > becomes false, and then `exactSearch()` is called. > The upper bound of [the test range query > (filter)|https://github.com/apache/lucene/blob/fe9d26178d033f585c08a5e86708063ac0ec0c9e/lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java#L554] > could be 200 (the max value of "tag" field + 1) instead of lower + 150 to > make it "unrestrictive"? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on a diff in pull request #923: Replace classpath with modulepath in the demo tutorial
mocobeta commented on code in PR #923: URL: https://github.com/apache/lucene/pull/923#discussion_r881413568 ## lucene/demo/src/java/overview.html: ## @@ -49,36 +49,35 @@ About the Demo demonstrates various functionalities of Lucene and how you can add Lucene to your applications. - -Setting your CLASSPATH + +Setting your MODULEPATH First, you should http://www.apache.org/dyn/closer.cgi/lucene/java/";>download the latest Lucene distribution and then extract it to a working directory. -You need four JARs: the Lucene JAR, the queryparser JAR, the common analysis JAR, and the Lucene -demo JAR. You should see the Lucene JAR file in the modules/ directory you created +You need Lucene demo and a few dependent modules. +You should see the Lucene module (JAR) files in the modules/ and modules-thirdparty/ directory you created when you extracted the archive -- it should be named something like lucene-core-{version}.jar. You should also see files called lucene-queryparser-{version}.jar, lucene-analysis-common-{version}.jar and lucene-demo-{version}.jar under queryparser, analysis/common/ and demo/, -respectively. -Put all four of these files in your Java CLASSPATH. +"codefrag">lucene-demo-{version}.jar under modules directory. +Put all of these files in your Java MODULEPATH. Indexing Files Once you've gotten this far you're probably itching to go. Let's build an -index! Assuming you've set your CLASSPATH correctly, just type: +index! Just type: -java org.apache.lucene.demo.IndexFiles -docs {path-to-lucene} +java --module-path modules:modules-thirdparty --module org.apache.lucene.demo/org.apache.lucene.demo.IndexFiles -docs {path-to-lucene} This will produce a subdirectory called index which will contain an index of all of the Lucene source code. To search the index type: -java org.apache.lucene.demo.SearchFiles +java --module-path modules:modules-thirdparty --add-modules jdk.unsupported --module org.apache.lucene.demo/org.apache.lucene.demo.SearchFiles You'll be prompted for a query. Type in a gibberish or made up word (for example: "supercalifragilisticexpialidocious"). Review Comment: Fixed in https://github.com/apache/lucene/pull/923/commits/d400d7bed703999c050407f9f5b6cf9c0f66b748 - I still can't detect typos in English by just quickly skimming :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #923: LUCENE-10200: Replace classpath with modulepath in the demo tutorial
mocobeta commented on PR #923: URL: https://github.com/apache/lucene/pull/923#issuecomment-1137010121 I hooked this on LUCENE-10200 - actually, the tutorial has been obsoleted by the change in the way to assemble the binary distribution. Still, we can stick to classpath though, I feel like it'd be clearer to switch to module path to align with the binary release structure (as well as the Luke launch script). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10200) Restructure and modernize the release artifacts
[ https://issues.apache.org/jira/browse/LUCENE-10200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541987#comment-17541987 ] Tomoko Uchida commented on LUCENE-10200: Just wanted to leave a quick note. I happened to notice that "lucene-demo" tutorial has been outdated by the change in the binary package structure. I opened a PR to correct the instruction in there (and also switch to using module path instead of classpath). This is a small correction in overview.html; will merge it if there are no objections. https://github.com/apache/lucene/pull/923 > Restructure and modernize the release artifacts > --- > > Key: LUCENE-10200 > URL: https://issues.apache.org/jira/browse/LUCENE-10200 > Project: Lucene - Core > Issue Type: Task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Major > Fix For: 9.0 > > Time Spent: 40m > Remaining Estimate: 0h > > This is an umbrella issue for various sub-tasks as per my e-mail [1]. > [1] [https://markmail.org/thread/f7yrggnynq2ijgmy] > In this order, perhaps: > * (/) Apply small text file changes (LUCENE-10163) > * (/) Simplify artifacts (LUCENE-10199 drop ZIP binary), > * (/) LUCENE-10192 drop third party JARs. > * -Create an additional binary artifact for Luke (LUCENE-9978).- > * (-) -only include relevant binary license/ notice files- > * (/) make sure source package can be compiled (no .git folder). > * (/) Test everything with the smoke tester. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10591) Invalid character in SortableSingleDocSource.java
Andras Salamon created LUCENE-10591: --- Summary: Invalid character in SortableSingleDocSource.java Key: LUCENE-10591 URL: https://issues.apache.org/jira/browse/LUCENE-10591 Project: Lucene - Core Issue Type: Bug Reporter: Andras Salamon There are invalid UTF-8 characters in SortableSingleDocSource.java "S�o Tom� and Pr�ncipe" Sonar gave me a warning because of this. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] asalamon74 opened a new pull request, #925: LUCENE-10591: Fix UTF-8 encoding
asalamon74 opened a new pull request, #925: URL: https://github.com/apache/lucene/pull/925 ### Description (or a Jira issue link if you have one) Fixing invalid UTF-8 characters in SortableSingleDocSource.java -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on pull request #923: LUCENE-10200: Replace classpath with modulepath in the demo tutorial
msokolov commented on PR #923: URL: https://github.com/apache/lucene/pull/923#issuecomment-1137160783 Thanks for fixing those (pre-existing) typos. On the classpath/module path change I have mixed feelings. On the one hand, we should broadcast that we are now fully modularized and support using module paths to declare dependencies on code. On the other hand, I don't even know how to "put jars on my module path" yet. Maybe I'm just a stick-in-the-mud, but modules still seems very new and I suspect many (most) Java devs probably haven't yet figured it out and are still using class-path? So, I'm not sure what that means for this documentation. It doesn't seem as if this is the right place to explain modules and class paths, but we want to make this accessible and easy to use. Maybe we could include both sets of directions? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on pull request #923: LUCENE-10200: Replace classpath with modulepath in the demo tutorial
dweiss commented on PR #923: URL: https://github.com/apache/lucene/pull/923#issuecomment-1137165723 Modular are not new, they're just not widespread... I agree with @mocobeta that if you provide an explicit command line then there is little harm in not explaining all the options. The problem with classpath is that you need to include all JARs individually this is handled much better by modules since you include the directory path (not each individual module JAR). This said, I don't have a strong opinion about going either way. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on a diff in pull request #873: LUCENE-10397: KnnVectorQuery doesn't tie break by doc ID
msokolov commented on code in PR #873: URL: https://github.com/apache/lucene/pull/873#discussion_r881581848 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborQueue.java: ## @@ -90,26 +92,34 @@ public boolean insertWithOverflow(int newNode, float newScore) { } private long encode(int node, float score) { -return order.applylong) NumericUtils.floatToSortableInt(score)) << 32) | node); +int nodeReverse = reversed ? node : Integer.MAX_VALUE - node; +return order.applylong) NumericUtils.floatToSortableInt(score)) << 32) | nodeReverse); } /** Removes the top element and returns its node id. */ public int pop() { -return (int) order.apply(heap.pop()); +return reversed Review Comment: can we move this logic into `Order.apply`? If we do that we can avoid a conditional in this hot spot -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #923: LUCENE-10200: Replace classpath with modulepath in the demo tutorial
mocobeta commented on PR #923: URL: https://github.com/apache/lucene/pull/923#issuecomment-1137171363 Anyway the outdated instruction would need to be corrected. Maybe we can write both working commands for classpath and module path? Let me adjust it... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial
msokolov commented on PR #923: URL: https://github.com/apache/lucene/pull/923#issuecomment-1137209317 > The problem with classpath is that you need to include all JARs individually this is handled much better by modules since you include the directory path (not each individual module JAR). Oh, that is better! Although I have gotten used to using `*.jar` in classpaths which helps shrink them down to be more manageable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on a diff in pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial
mocobeta commented on code in PR #923: URL: https://github.com/apache/lucene/pull/923#discussion_r881678572 ## lucene/demo/src/java/overview.html: ## @@ -49,40 +49,49 @@ About the Demo demonstrates various functionalities of Lucene and how you can add Lucene to your applications. - -Setting your CLASSPATH + +Setting your module path (or classpath) First, you should http://www.apache.org/dyn/closer.cgi/lucene/java/";>download the latest Lucene distribution and then extract it to a working directory. -You need four JARs: the Lucene JAR, the queryparser JAR, the common analysis JAR, and the Lucene -demo JAR. You should see the Lucene JAR file in the modules/ directory you created -when you extracted the archive -- it should be named something like +You need Lucene demo and a few dependent modules. +You should see the Lucene modules (JARs) in the modules/ and third party modules in the modules-thirdparty/ directory +you created when you extracted the archive -- it should be named something like lucene-core-{version}.jar. You should also see files called lucene-queryparser-{version}.jar, lucene-analysis-common-{version}.jar and lucene-demo-{version}.jar under queryparser, analysis/common/ and demo/, -respectively. -Put all four of these files in your Java CLASSPATH. +"codefrag">lucene-demo-{version}.jar under modules directory. +There are two ways to run Java program: with module path or with classpath. Either way is fine. Put all of these files in your Java module path or classpath. Indexing Files Once you've gotten this far you're probably itching to go. Let's build an -index! Assuming you've set your CLASSPATH correctly, just type: +index! Just type either command: +With module path -java org.apache.lucene.demo.IndexFiles -docs {path-to-lucene} +java --module-path modules:modules-thirdparty --module org.apache.lucene.demo/org.apache.lucene.demo.IndexFiles -docs {path-to-lucene} + +With classpath + +java -cp "modules/*:modules-thirdparty/*" org.apache.lucene.demo.IndexFiles -docs {path-to-lucene} Review Comment: It's not a good practice to put all jars into classpath with wildcard at all, however, I don't think we can maintain the correct jar list (it was proved in the latest tutorial - the jar list there had been outdated long before 9.0, and this is another reason why I'd prefer module path). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial
mocobeta commented on PR #923: URL: https://github.com/apache/lucene/pull/923#issuecomment-1137271336 I updated the text so that we have both working commands for module path and classpath in it. Please see the updated screenshot in the PR description to see how it looks, thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10590) Indexing all zero vectors leads to heat death of the universe
[ https://issues.apache.org/jira/browse/LUCENE-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542068#comment-17542068 ] Michael Sokolov commented on LUCENE-10590: -- > Does the indexing logic rely on tie breaking by node ID? If not, maybe > index-time graph search could stop as soon as the k-th nearest vector is > equal to the input vector? Seems like that could work although to date we use the same search implementation at index time and search time, which is a nice simplification. Perhaps in such a case we could also sacrifice the docid tiebreaking given that is going to be best effort only > Indexing all zero vectors leads to heat death of the universe > - > > Key: LUCENE-10590 > URL: https://issues.apache.org/jira/browse/LUCENE-10590 > Project: Lucene - Core > Issue Type: Bug >Reporter: Michael Sokolov >Priority: Major > > By accident while testing something else, I ran a luceneutil test indexing 1M > 100d vectors where all the vectors were all zeroes. This caused indexing to > take a very long time (~40x normal - it did eventually complete) and the > search performance was similarly bad. We should not degrade by orders of > magnitude with even the worst data though. > I'm not entirely sure what the issue is, but perhaps as long as we keep > finding hits that are "better" we keep exploring the graph, where better > means (score, -docid) >= (lowest score, -docid). If that's right and all docs > have the same score, then we probably need to either switch to > (but this > could lead to poorer recall in normal cases) or introduce some kind of > minimum score threshold? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10590) Indexing all zero vectors leads to heat death of the universe
[ https://issues.apache.org/jira/browse/LUCENE-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542068#comment-17542068 ] Michael Sokolov edited comment on LUCENE-10590 at 5/25/22 2:09 PM: --- bq. Does the indexing logic rely on tie breaking by node ID? If not, maybe index-time graph search could stop as soon as the k-th nearest vector is equal to the input vector? Seems like that could work although to date we use the same search implementation at index time and search time, which is a nice simplification. Perhaps in such a case we could also sacrifice the docid tiebreaking given that is going to be best effort only was (Author: sokolov): > Does the indexing logic rely on tie breaking by node ID? If not, maybe > index-time graph search could stop as soon as the k-th nearest vector is > equal to the input vector? Seems like that could work although to date we use the same search implementation at index time and search time, which is a nice simplification. Perhaps in such a case we could also sacrifice the docid tiebreaking given that is going to be best effort only > Indexing all zero vectors leads to heat death of the universe > - > > Key: LUCENE-10590 > URL: https://issues.apache.org/jira/browse/LUCENE-10590 > Project: Lucene - Core > Issue Type: Bug >Reporter: Michael Sokolov >Priority: Major > > By accident while testing something else, I ran a luceneutil test indexing 1M > 100d vectors where all the vectors were all zeroes. This caused indexing to > take a very long time (~40x normal - it did eventually complete) and the > search performance was similarly bad. We should not degrade by orders of > magnitude with even the worst data though. > I'm not entirely sure what the issue is, but perhaps as long as we keep > finding hits that are "better" we keep exploring the graph, where better > means (score, -docid) >= (lowest score, -docid). If that's right and all docs > have the same score, then we probably need to either switch to > (but this > could lead to poorer recall in normal cases) or introduce some kind of > minimum score threshold? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on pull request #924: Create Lucene93 Codec and move Lucene92 to backwards_codecs
msokolov commented on PR #924: URL: https://github.com/apache/lucene/pull/924#issuecomment-1137306689 Thanks for the reminder about the unit tests - I will add. As for the approach of using a feature branch, I'm ambivalent. It seems better to me to separate out the "new codec version" commit with all of its boilerplate from the actual changes to be made to the codec, to make it easier to review and understand. Certainly that can be done on a feature branch, but I'm not sure why we need a branch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on a diff in pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial
dweiss commented on code in PR #923: URL: https://github.com/apache/lucene/pull/923#discussion_r881755153 ## lucene/demo/src/java/overview.html: ## @@ -49,40 +49,49 @@ About the Demo demonstrates various functionalities of Lucene and how you can add Lucene to your applications. - -Setting your CLASSPATH + +Setting your module path (or classpath) First, you should http://www.apache.org/dyn/closer.cgi/lucene/java/";>download the latest Lucene distribution and then extract it to a working directory. -You need four JARs: the Lucene JAR, the queryparser JAR, the common analysis JAR, and the Lucene -demo JAR. You should see the Lucene JAR file in the modules/ directory you created -when you extracted the archive -- it should be named something like +You need Lucene demo and a few dependent modules. +You should see the Lucene modules (JARs) in the modules/ and third party modules in the modules-thirdparty/ directory +you created when you extracted the archive -- it should be named something like lucene-core-{version}.jar. You should also see files called lucene-queryparser-{version}.jar, lucene-analysis-common-{version}.jar and lucene-demo-{version}.jar under queryparser, analysis/common/ and demo/, -respectively. -Put all four of these files in your Java CLASSPATH. +"codefrag">lucene-demo-{version}.jar under modules directory. +There are two ways to run Java program: with module path or with classpath. Either way is fine. Put all of these files in your Java module path or classpath. Indexing Files Once you've gotten this far you're probably itching to go. Let's build an -index! Assuming you've set your CLASSPATH correctly, just type: +index! Just type either command: +With module path -java org.apache.lucene.demo.IndexFiles -docs {path-to-lucene} +java --module-path modules:modules-thirdparty --module org.apache.lucene.demo/org.apache.lucene.demo.IndexFiles -docs {path-to-lucene} + +With classpath + +java -cp "modules/*:modules-thirdparty/*" org.apache.lucene.demo.IndexFiles -docs {path-to-lucene} Review Comment: I don't think it'll work. Java doesn't expand wildcards in arguments. Also, the path delimiter varies between platforms (Windows uses a semicolon)... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on a diff in pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial
dweiss commented on code in PR #923: URL: https://github.com/apache/lucene/pull/923#discussion_r881759841 ## lucene/demo/src/java/overview.html: ## @@ -49,40 +49,49 @@ About the Demo demonstrates various functionalities of Lucene and how you can add Lucene to your applications. - -Setting your CLASSPATH + +Setting your module path (or classpath) First, you should http://www.apache.org/dyn/closer.cgi/lucene/java/";>download the latest Lucene distribution and then extract it to a working directory. -You need four JARs: the Lucene JAR, the queryparser JAR, the common analysis JAR, and the Lucene -demo JAR. You should see the Lucene JAR file in the modules/ directory you created -when you extracted the archive -- it should be named something like +You need Lucene demo and a few dependent modules. +You should see the Lucene modules (JARs) in the modules/ and third party modules in the modules-thirdparty/ directory +you created when you extracted the archive -- it should be named something like lucene-core-{version}.jar. You should also see files called lucene-queryparser-{version}.jar, lucene-analysis-common-{version}.jar and lucene-demo-{version}.jar under queryparser, analysis/common/ and demo/, -respectively. -Put all four of these files in your Java CLASSPATH. +"codefrag">lucene-demo-{version}.jar under modules directory. +There are two ways to run Java program: with module path or with classpath. Either way is fine. Put all of these files in your Java module path or classpath. Indexing Files Once you've gotten this far you're probably itching to go. Let's build an -index! Assuming you've set your CLASSPATH correctly, just type: +index! Just type either command: +With module path -java org.apache.lucene.demo.IndexFiles -docs {path-to-lucene} +java --module-path modules:modules-thirdparty --module org.apache.lucene.demo/org.apache.lucene.demo.IndexFiles -docs {path-to-lucene} + +With classpath + +java -cp "modules/*:modules-thirdparty/*" org.apache.lucene.demo.IndexFiles -docs {path-to-lucene} Review Comment: I double checked and, with some surprise, discovered that it does support a quirky glob format (has to be a single *, not a full glob). Anyway, I wouldn't bet this works across platforms with the colon and slashes in the cp argument... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on a diff in pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial
mocobeta commented on code in PR #923: URL: https://github.com/apache/lucene/pull/923#discussion_r881765516 ## lucene/demo/src/java/overview.html: ## @@ -49,40 +49,49 @@ About the Demo demonstrates various functionalities of Lucene and how you can add Lucene to your applications. - -Setting your CLASSPATH + +Setting your module path (or classpath) First, you should http://www.apache.org/dyn/closer.cgi/lucene/java/";>download the latest Lucene distribution and then extract it to a working directory. -You need four JARs: the Lucene JAR, the queryparser JAR, the common analysis JAR, and the Lucene -demo JAR. You should see the Lucene JAR file in the modules/ directory you created -when you extracted the archive -- it should be named something like +You need Lucene demo and a few dependent modules. +You should see the Lucene modules (JARs) in the modules/ and third party modules in the modules-thirdparty/ directory +you created when you extracted the archive -- it should be named something like lucene-core-{version}.jar. You should also see files called lucene-queryparser-{version}.jar, lucene-analysis-common-{version}.jar and lucene-demo-{version}.jar under queryparser, analysis/common/ and demo/, -respectively. -Put all four of these files in your Java CLASSPATH. +"codefrag">lucene-demo-{version}.jar under modules directory. +There are two ways to run Java program: with module path or with classpath. Either way is fine. Put all of these files in your Java module path or classpath. Indexing Files Once you've gotten this far you're probably itching to go. Let's build an -index! Assuming you've set your CLASSPATH correctly, just type: +index! Just type either command: +With module path -java org.apache.lucene.demo.IndexFiles -docs {path-to-lucene} +java --module-path modules:modules-thirdparty --module org.apache.lucene.demo/org.apache.lucene.demo.IndexFiles -docs {path-to-lucene} + +With classpath + +java -cp "modules/*:modules-thirdparty/*" org.apache.lucene.demo.IndexFiles -docs {path-to-lucene} Review Comment: It works, I confirmed this command. There is "Understanding class path wildcards" section in the documentation. https://docs.oracle.com/javase/7/docs/technotes/tools/windows/classpath.html As for the delimiter, I didn't think we should list up all commands for Windows and Linux/Mac; then I committed Windows here... sorry but the tutorial is written for Unix-like platform from the beginning. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on a diff in pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial
mocobeta commented on code in PR #923: URL: https://github.com/apache/lucene/pull/923#discussion_r881765516 ## lucene/demo/src/java/overview.html: ## @@ -49,40 +49,49 @@ About the Demo demonstrates various functionalities of Lucene and how you can add Lucene to your applications. - -Setting your CLASSPATH + +Setting your module path (or classpath) First, you should http://www.apache.org/dyn/closer.cgi/lucene/java/";>download the latest Lucene distribution and then extract it to a working directory. -You need four JARs: the Lucene JAR, the queryparser JAR, the common analysis JAR, and the Lucene -demo JAR. You should see the Lucene JAR file in the modules/ directory you created -when you extracted the archive -- it should be named something like +You need Lucene demo and a few dependent modules. +You should see the Lucene modules (JARs) in the modules/ and third party modules in the modules-thirdparty/ directory +you created when you extracted the archive -- it should be named something like lucene-core-{version}.jar. You should also see files called lucene-queryparser-{version}.jar, lucene-analysis-common-{version}.jar and lucene-demo-{version}.jar under queryparser, analysis/common/ and demo/, -respectively. -Put all four of these files in your Java CLASSPATH. +"codefrag">lucene-demo-{version}.jar under modules directory. +There are two ways to run Java program: with module path or with classpath. Either way is fine. Put all of these files in your Java module path or classpath. Indexing Files Once you've gotten this far you're probably itching to go. Let's build an -index! Assuming you've set your CLASSPATH correctly, just type: +index! Just type either command: +With module path -java org.apache.lucene.demo.IndexFiles -docs {path-to-lucene} +java --module-path modules:modules-thirdparty --module org.apache.lucene.demo/org.apache.lucene.demo.IndexFiles -docs {path-to-lucene} + +With classpath + +java -cp "modules/*:modules-thirdparty/*" org.apache.lucene.demo.IndexFiles -docs {path-to-lucene} Review Comment: It works, I confirmed this command. There is "Understanding class path wildcards" section in the documentation. https://docs.oracle.com/javase/7/docs/technotes/tools/windows/classpath.html As for the delimiter, I didn't think we should list up all commands for Windows and Linux/Mac; then I omitted Windows here... sorry but the tutorial is written for Unix-like platform from the beginning. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on a diff in pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial
mocobeta commented on code in PR #923: URL: https://github.com/apache/lucene/pull/923#discussion_r881797523 ## lucene/demo/src/java/overview.html: ## @@ -49,40 +49,49 @@ About the Demo demonstrates various functionalities of Lucene and how you can add Lucene to your applications. - -Setting your CLASSPATH + +Setting your module path (or classpath) First, you should http://www.apache.org/dyn/closer.cgi/lucene/java/";>download the latest Lucene distribution and then extract it to a working directory. -You need four JARs: the Lucene JAR, the queryparser JAR, the common analysis JAR, and the Lucene -demo JAR. You should see the Lucene JAR file in the modules/ directory you created -when you extracted the archive -- it should be named something like +You need Lucene demo and a few dependent modules. +You should see the Lucene modules (JARs) in the modules/ and third party modules in the modules-thirdparty/ directory +you created when you extracted the archive -- it should be named something like lucene-core-{version}.jar. You should also see files called lucene-queryparser-{version}.jar, lucene-analysis-common-{version}.jar and lucene-demo-{version}.jar under queryparser, analysis/common/ and demo/, -respectively. -Put all four of these files in your Java CLASSPATH. +"codefrag">lucene-demo-{version}.jar under modules directory. +There are two ways to run Java program: with module path or with classpath. Either way is fine. Put all of these files in your Java module path or classpath. Indexing Files Once you've gotten this far you're probably itching to go. Let's build an -index! Assuming you've set your CLASSPATH correctly, just type: +index! Just type either command: +With module path -java org.apache.lucene.demo.IndexFiles -docs {path-to-lucene} +java --module-path modules:modules-thirdparty --module org.apache.lucene.demo/org.apache.lucene.demo.IndexFiles -docs {path-to-lucene} + +With classpath + +java -cp "modules/*:modules-thirdparty/*" org.apache.lucene.demo.IndexFiles -docs {path-to-lucene} Review Comment: But of course, we can add a section for Windows platform. I'm not sure how far we should care but if we want to provide "working" commands without previous knowledge for both Unix-like and Windows, we should allow the verboseness? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on a diff in pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial
mocobeta commented on code in PR #923: URL: https://github.com/apache/lucene/pull/923#discussion_r881797523 ## lucene/demo/src/java/overview.html: ## @@ -49,40 +49,49 @@ About the Demo demonstrates various functionalities of Lucene and how you can add Lucene to your applications. - -Setting your CLASSPATH + +Setting your module path (or classpath) First, you should http://www.apache.org/dyn/closer.cgi/lucene/java/";>download the latest Lucene distribution and then extract it to a working directory. -You need four JARs: the Lucene JAR, the queryparser JAR, the common analysis JAR, and the Lucene -demo JAR. You should see the Lucene JAR file in the modules/ directory you created -when you extracted the archive -- it should be named something like +You need Lucene demo and a few dependent modules. +You should see the Lucene modules (JARs) in the modules/ and third party modules in the modules-thirdparty/ directory +you created when you extracted the archive -- it should be named something like lucene-core-{version}.jar. You should also see files called lucene-queryparser-{version}.jar, lucene-analysis-common-{version}.jar and lucene-demo-{version}.jar under queryparser, analysis/common/ and demo/, -respectively. -Put all four of these files in your Java CLASSPATH. +"codefrag">lucene-demo-{version}.jar under modules directory. +There are two ways to run Java program: with module path or with classpath. Either way is fine. Put all of these files in your Java module path or classpath. Indexing Files Once you've gotten this far you're probably itching to go. Let's build an -index! Assuming you've set your CLASSPATH correctly, just type: +index! Just type either command: +With module path -java org.apache.lucene.demo.IndexFiles -docs {path-to-lucene} +java --module-path modules:modules-thirdparty --module org.apache.lucene.demo/org.apache.lucene.demo.IndexFiles -docs {path-to-lucene} + +With classpath + +java -cp "modules/*:modules-thirdparty/*" org.apache.lucene.demo.IndexFiles -docs {path-to-lucene} Review Comment: But of course, we can add a section for Windows platform. I'm not sure how far we should care but if we want to provide "working" commands without previous knowledge for both Unix-like and Windows platforms, we should allow the verboseness? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial
mocobeta commented on PR #923: URL: https://github.com/apache/lucene/pull/923#issuecomment-1137448834 I first thought it is sufficient to have a concrete working command for a Unix-like platform that based on module path, seems like things are not so obvious. I don't know what should I do here - I'll leave it for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on pull request #924: Create Lucene93 Codec and move Lucene92 to backwards_codecs
msokolov commented on PR #924: URL: https://github.com/apache/lucene/pull/924#issuecomment-1137491647 I updated this PR so that: 1. the back-compat Lucene92 Codec no longer has the ability to write HNSW vector format 2. there are unit tests that verify we can still read the Lucene92 vector format, and the vectors writer has been moved into the tests to support that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on a diff in pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial
dweiss commented on code in PR #923: URL: https://github.com/apache/lucene/pull/923#discussion_r881886986 ## lucene/demo/src/java/overview.html: ## @@ -49,40 +49,49 @@ About the Demo demonstrates various functionalities of Lucene and how you can add Lucene to your applications. - -Setting your CLASSPATH + +Setting your module path (or classpath) First, you should http://www.apache.org/dyn/closer.cgi/lucene/java/";>download the latest Lucene distribution and then extract it to a working directory. -You need four JARs: the Lucene JAR, the queryparser JAR, the common analysis JAR, and the Lucene -demo JAR. You should see the Lucene JAR file in the modules/ directory you created -when you extracted the archive -- it should be named something like +You need Lucene demo and a few dependent modules. +You should see the Lucene modules (JARs) in the modules/ and third party modules in the modules-thirdparty/ directory +you created when you extracted the archive -- it should be named something like lucene-core-{version}.jar. You should also see files called lucene-queryparser-{version}.jar, lucene-analysis-common-{version}.jar and lucene-demo-{version}.jar under queryparser, analysis/common/ and demo/, -respectively. -Put all four of these files in your Java CLASSPATH. +"codefrag">lucene-demo-{version}.jar under modules directory. +There are two ways to run Java program: with module path or with classpath. Either way is fine. Put all of these files in your Java module path or classpath. Indexing Files Once you've gotten this far you're probably itching to go. Let's build an -index! Assuming you've set your CLASSPATH correctly, just type: +index! Just type either command: +With module path -java org.apache.lucene.demo.IndexFiles -docs {path-to-lucene} +java --module-path modules:modules-thirdparty --module org.apache.lucene.demo/org.apache.lucene.demo.IndexFiles -docs {path-to-lucene} + +With classpath + +java -cp "modules/*:modules-thirdparty/*" org.apache.lucene.demo.IndexFiles -docs {path-to-lucene} Review Comment: Ok, let's leave Windows out of it. People on Windows will know what to do, I think. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial
mocobeta commented on PR #923: URL: https://github.com/apache/lucene/pull/923#issuecomment-1137531284 Just a note... the thing I wanted to fix is, that the current tutorial has been outdated on many points - I don't think people can run the demo app without trials and error. Classpath vs module path shouldn't be the main interest here, I think the module-path based explanation would be reasonable from several viewpoints though. I might mislead the conversation from the start if we fell into bikeshedding. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mdmarshmallow commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities
mdmarshmallow commented on code in PR #841: URL: https://github.com/apache/lucene/pull/841#discussion_r881945764 ## lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/LongPointFacetField.java: ## @@ -0,0 +1,35 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet.hyperrectangle; + +import org.apache.lucene.document.BinaryDocValuesField; +import org.apache.lucene.document.LongPoint; + +/** Packs an array of longs into a {@link BinaryDocValuesField} */ +public class LongPointFacetField extends BinaryDocValuesField { Review Comment: I was wondering what your thoughts were on just using separate numeric fields rather than packing them. I think this would make the API "nicer" to be honest, but the big drawback would that we would need some hacky multivalued implementation. I can think of some ways to build some sort of UnsortedNumericDV on top of SortedNumericDV, but they would all be super hacky and have limitations and probably not worth implementing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira?
[ https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542180#comment-17542180 ] Tomoko Uchida commented on LUCENE-10557: We are having a long discuss thread on the dev list and many issues are posed. Here is a short summary (with my brief thoughts/opinions). * Concerns for political neutrality of GitHub - in other words, concerns for account bans with no good reason ** Seems there are several cases (including rumors) of GitHub account bans. It's unclear whether they violate its terms of policy or not, and we won't be able to correctly assess the risk to me. I would defer the judgment to the individuals. ** For developers who don't use GitHub regardless of the reason, we will always support contribution paths that do not rely on GitHub. Patches via Jira will be a decent option for good. * Concerns for its parent company, Microsoft ** I'd defer the judgment on that to the individuals for the same reason for the previous subject. One thing I could say is, that the recent trend in their direction is GOOD - they support/sponsor OSS and Java communities and even publish very popular open-source software (VSCode and LightGBM are outstanding examples I think). * Concerns for lack of issue workflow and simpler metadata management ** From the practical viewpoint, it fully makes sense to me that many people talked about it. We would need to carefully think of how to control versions and issue/PR metadata. Large projects that are fully operated on GitHub overcome this shortcoming in various ways - organized issue templates with fixed label sets would be an example. I think we will have a sandbox repository outside ASF, then try some experiments on it before actual migration. * Security issues that only PMC members are allowed to be accessed ** We will be able to continue to use Jira for this purpose, or we could even have an issue-only private GitHub repository for Lucene? * Concerns for migration of whole Jira issue history to GitHub issue ** I don't think it is possible. I'm almost sure there will be some information losses if we attempt to migrate the whole Jira issue with metadata/history into Github. Rather than trying to do that, I would prefer to let Jira issues as is, then simply refer them. ** If we don't aim at perfection, I think we'll be able to migrate all (or part of) issues with APIs as Shad Storhaug kindly shared in this comment. Aside from those concerns, there seems no disagreement with GitHub is superior to Jira in terms of overall UX design, and most new developers like it. > Migrate to GitHub issue from Jira? > -- > > Key: LUCENE-10557 > URL: https://issues.apache.org/jira/browse/LUCENE-10557 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Tomoko Uchida >Priority: Major > > A few (not the majority) Apache projects already use the GitHub issue instead > of Jira. For example, > Airflow: [https://github.com/apache/airflow/issues] > BookKeeper: [https://github.com/apache/bookkeeper/issues] > So I think it'd be technically possible that we move to GitHub issue. I have > little knowledge of how to proceed with it, I'd like to discuss whether we > should migrate to it, and if so, how to smoothly handle the migration. > The major tasks would be: > * Get a consensus about the migration among committers > * Enable Github issue on the lucene's repository (currently, it is disabled > on it) > * Build the convention or rules for issue label/milestone management > * Choose issues that should be moved to GitHub (I think too old or obsolete > issues can remain Jira.) -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10557) Migrate to GitHub issue from Jira?
[ https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542180#comment-17542180 ] Tomoko Uchida edited comment on LUCENE-10557 at 5/25/22 6:27 PM: - We are having a long discuss thread on the dev list and many issues are posed. Here is a short summary (with my brief thoughts/opinions). * Concerns for political neutrality of GitHub - in other words, concerns for account bans with no good reason ** Seems there are several cases (including rumors) of GitHub account bans. It's unclear whether they violate its terms of policy or not, and we won't be able to correctly assess the risk to me. I would defer the judgment to the individuals. ** For developers who don't use GitHub regardless of the reason, we will always support contribution paths that do not rely on GitHub. Patches via Jira will be a decent option for good. * Concerns for its parent company, Microsoft ** I'd defer the judgment on that to the individuals for the same reason for the previous subject. One thing I could say is, that the recent trend in their direction is GOOD - they support/sponsor OSS and Java communities and even publish very popular open-source software (VSCode and LightGBM are outstanding examples I think). * Concerns for lack of issue workflow and simpler metadata management ** From the practical viewpoint, it fully makes sense to me that many people talked about it. We would need to carefully think of how to control versions and issue/PR metadata. Large projects that are fully operated on GitHub overcome this shortcoming in various ways - organized issue templates with fixed label sets would be an example. I think we will have a sandbox repository outside ASF, then try some experiments on it before actual migration. * Security issues that only PMC members are allowed to be accessed ** We will be able to continue to use Jira for this purpose, or we could even have an issue-only private GitHub repository for Lucene? * Concerns for migration of whole Jira issue history to GitHub issue ** I don't think it is possible. I'm almost sure there will be some information losses if we attempt to migrate the whole Jira issue with metadata/history into Github. Rather than trying to do that, I would prefer to let Jira issues as is, then simply refer them. ** If we don't aim at perfection, I think we'll be able to migrate all (or part of) issues with APIs as Shad Storhaug kindly shared in this comment. Aside from those concerns, there seems no disagreement with GitHub is superior to Jira in terms of overall UX design, and most new developers like it. was (Author: tomoko uchida): We are having a long discuss thread on the dev list and many issues are posed. Here is a short summary (with my brief thoughts/opinions). * Concerns for political neutrality of GitHub - in other words, concerns for account bans with no good reason ** Seems there are several cases (including rumors) of GitHub account bans. It's unclear whether they violate its terms of policy or not, and we won't be able to correctly assess the risk to me. I would defer the judgment to the individuals. ** For developers who don't use GitHub regardless of the reason, we will always support contribution paths that do not rely on GitHub. Patches via Jira will be a decent option for good. * Concerns for its parent company, Microsoft ** I'd defer the judgment on that to the individuals for the same reason for the previous subject. One thing I could say is, that the recent trend in their direction is GOOD - they support/sponsor OSS and Java communities and even publish very popular open-source software (VSCode and LightGBM are outstanding examples I think). * Concerns for lack of issue workflow and simpler metadata management ** From the practical viewpoint, it fully makes sense to me that many people talked about it. We would need to carefully think of how to control versions and issue/PR metadata. Large projects that are fully operated on GitHub overcome this shortcoming in various ways - organized issue templates with fixed label sets would be an example. I think we will have a sandbox repository outside ASF, then try some experiments on it before actual migration. * Security issues that only PMC members are allowed to be accessed ** We will be able to continue to use Jira for this purpose, or we could even have an issue-only private GitHub repository for Lucene? * Concerns for migration of whole Jira issue history to GitHub issue ** I don't think it is possible. I'm almost sure there will be some information losses if we attempt to migrate the whole Jira issue with metadata/history into Github. Rather than trying to do that, I would prefer to let Jira issues as is, then simply refer them. ** If we don't aim at perfection, I think we'll be able to migrate all (or pa
[GitHub] [lucene] mdmarshmallow commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities
mdmarshmallow commented on code in PR #841: URL: https://github.com/apache/lucene/pull/841#discussion_r881945764 ## lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/LongPointFacetField.java: ## @@ -0,0 +1,35 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet.hyperrectangle; + +import org.apache.lucene.document.BinaryDocValuesField; +import org.apache.lucene.document.LongPoint; + +/** Packs an array of longs into a {@link BinaryDocValuesField} */ +public class LongPointFacetField extends BinaryDocValuesField { Review Comment: I was wondering what your thoughts were on just using separate numeric fields rather than packing them. I think this would make the API "nicer" to be honest, but the big drawback would that we would need some hacky multivalued implementation. I can think of some ways to build some sort of UnsortedNumericDV on top of SortedNumericDV, but they would all be super hacky and have limitations and probably not worth implementing. Edit: Upon thinking about this further, my suggestion doesn't make sense when we have multi-valued fields -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mdmarshmallow commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities
mdmarshmallow commented on code in PR #841: URL: https://github.com/apache/lucene/pull/841#discussion_r881945764 ## lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/LongPointFacetField.java: ## @@ -0,0 +1,35 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet.hyperrectangle; + +import org.apache.lucene.document.BinaryDocValuesField; +import org.apache.lucene.document.LongPoint; + +/** Packs an array of longs into a {@link BinaryDocValuesField} */ +public class LongPointFacetField extends BinaryDocValuesField { Review Comment: ~~I was wondering what your thoughts were on just using separate numeric fields rather than packing them. I think this would make the API "nicer" to be honest, but the big drawback would that we would need some hacky multivalued implementation. I can think of some ways to build some sort of UnsortedNumericDV on top of SortedNumericDV, but they would all be super hacky and have limitations and probably not worth implementing.~~ Edit: Upon thinking about this further, my suggestion doesn't make sense when we have multi-valued fields -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10557) Migrate to GitHub issue from Jira?
[ https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542180#comment-17542180 ] Tomoko Uchida edited comment on LUCENE-10557 at 5/25/22 6:31 PM: - We are having a long discuss thread on the dev list and many issues are posed. Here is a short summary (with my brief thoughts/opinions). * Concerns for political neutrality of GitHub - in other words, concerns for account bans with no good reason ** Seems there are several cases (including rumors) of GitHub account bans. It's unclear whether they violate its terms of policy or not, and we won't be able to correctly assess the risk to me. I would defer the judgment to the individuals. ** For developers who don't use GitHub regardless of the reason, we will always support contribution paths that do not rely on GitHub. Patches via Jira will be a decent option for good. * Concerns for its parent company, Microsoft ** I'd defer the judgment on that to the individuals for the same reason for the previous subject. One thing I could say is, that the recent trend in their direction is GOOD - they support/sponsor OSS and Java communities and even publish very popular open-source software (VSCode and LightGBM are outstanding examples I think). * Concerns for lack of issue workflow and simpler metadata management ** From the practical viewpoint, it fully makes sense to me that many people talked about it. We would need to carefully think of how to control versions and issue/PR metadata. Large projects that are fully operated on GitHub overcome this shortcoming in various ways - organized issue templates with fixed label sets would be an example. I think we will have a sandbox repository outside ASF, then try some experiments on it before actual migration. * Security issues that only PMC members are allowed to be accessed ** We will be able to continue to use Jira for this purpose, or we could even have an issue-only private GitHub repository for Lucene? * Concerns for migration of whole Jira issue history to GitHub issue ** I don't think it is possible. I'm almost sure there will be some information losses if we attempt to migrate the whole Jira issue with metadata/history into Github. Rather than trying to do that, I would prefer to let Jira issues as is, then simply refer them. ** If we don't aim at perfection, I think we'll be able to migrate all (or part of) issues with APIs as Shad Storhaug kindly shared in [this comment|https://issues.apache.org/jira/browse/LUCENE-10557?focusedCommentId=17535898&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17535898]. Aside from those concerns, there seems no disagreement with GitHub is superior to Jira in terms of overall UX design, and most new developers like it. was (Author: tomoko uchida): We are having a long discuss thread on the dev list and many issues are posed. Here is a short summary (with my brief thoughts/opinions). * Concerns for political neutrality of GitHub - in other words, concerns for account bans with no good reason ** Seems there are several cases (including rumors) of GitHub account bans. It's unclear whether they violate its terms of policy or not, and we won't be able to correctly assess the risk to me. I would defer the judgment to the individuals. ** For developers who don't use GitHub regardless of the reason, we will always support contribution paths that do not rely on GitHub. Patches via Jira will be a decent option for good. * Concerns for its parent company, Microsoft ** I'd defer the judgment on that to the individuals for the same reason for the previous subject. One thing I could say is, that the recent trend in their direction is GOOD - they support/sponsor OSS and Java communities and even publish very popular open-source software (VSCode and LightGBM are outstanding examples I think). * Concerns for lack of issue workflow and simpler metadata management ** From the practical viewpoint, it fully makes sense to me that many people talked about it. We would need to carefully think of how to control versions and issue/PR metadata. Large projects that are fully operated on GitHub overcome this shortcoming in various ways - organized issue templates with fixed label sets would be an example. I think we will have a sandbox repository outside ASF, then try some experiments on it before actual migration. * Security issues that only PMC members are allowed to be accessed ** We will be able to continue to use Jira for this purpose, or we could even have an issue-only private GitHub repository for Lucene? * Concerns for migration of whole Jira issue history to GitHub issue ** I don't think it is possible. I'm almost sure there will be some information losses if we attempt to migrate the whole Jira issue with metadata/history into Github. Rather t
[jira] [Comment Edited] (LUCENE-10557) Migrate to GitHub issue from Jira?
[ https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542180#comment-17542180 ] Tomoko Uchida edited comment on LUCENE-10557 at 5/25/22 6:37 PM: - We are having a long discuss thread on the dev list and many issues are posed. Here is a short summary (with my brief thoughts/opinions). * Concerns for political neutrality of GitHub - in other words, concerns for account bans with no good reason ** Seems there are several cases (including rumors) of GitHub account bans. It's unclear whether they violate its terms of policy or not, and we won't be able to correctly assess the risk to me. I would defer the judgment to the individuals. ** For developers who don't use GitHub for whatever reason, we will always support contribution paths that do not rely on GitHub. Patches via Jira will be a decent option for good. * Concerns for its parent company, Microsoft ** I'd defer the judgment on that to the individuals for the same reason for the previous subject. One thing I could say is, that the recent trend in their direction is GOOD - they support/sponsor OSS and Java communities and even publish very popular open-source software (VSCode and LightGBM are outstanding examples I think). * Concerns for lack of issue workflow and simpler metadata management ** From the practical viewpoint, it fully makes sense to me that many people talked about it. We would need to carefully think of how to control versions and issue/PR metadata. Large projects that are fully operated on GitHub overcome this shortcoming in various ways - organized issue templates with fixed label sets would be an example. I think we will have a sandbox repository outside ASF, then try some experiments on it before actual migration. * Security issues that only PMC members are allowed to be accessed ** We will be able to continue to use Jira for this purpose, or we could even have an issue-only private GitHub repository for Lucene? * Concerns for migration of whole Jira issue history to GitHub issue ** I don't think it is possible. I'm almost sure there will be some information losses if we attempt to migrate the whole Jira issue with metadata/history into Github. Rather than trying to do that, I would prefer to let Jira issues as is, then simply refer them. ** If we don't aim at perfection, I think we'll be able to migrate all (or part of) issues with APIs as Shad Storhaug kindly shared in this comment. Aside from those concerns, there seems no disagreement with GitHub is superior to Jira in terms of overall UX design, and most new developers like it. was (Author: tomoko uchida): We are having a long discuss thread on the dev list and many issues are posed. Here is a short summary (with my brief thoughts/opinions). * Concerns for political neutrality of GitHub - in other words, concerns for account bans with no good reason ** Seems there are several cases (including rumors) of GitHub account bans. It's unclear whether they violate its terms of policy or not, and we won't be able to correctly assess the risk to me. I would defer the judgment to the individuals. ** For developers who don't use GitHub regardless of the reason, we will always support contribution paths that do not rely on GitHub. Patches via Jira will be a decent option for good. * Concerns for its parent company, Microsoft ** I'd defer the judgment on that to the individuals for the same reason for the previous subject. One thing I could say is, that the recent trend in their direction is GOOD - they support/sponsor OSS and Java communities and even publish very popular open-source software (VSCode and LightGBM are outstanding examples I think). * Concerns for lack of issue workflow and simpler metadata management ** From the practical viewpoint, it fully makes sense to me that many people talked about it. We would need to carefully think of how to control versions and issue/PR metadata. Large projects that are fully operated on GitHub overcome this shortcoming in various ways - organized issue templates with fixed label sets would be an example. I think we will have a sandbox repository outside ASF, then try some experiments on it before actual migration. * Security issues that only PMC members are allowed to be accessed ** We will be able to continue to use Jira for this purpose, or we could even have an issue-only private GitHub repository for Lucene? * Concerns for migration of whole Jira issue history to GitHub issue ** I don't think it is possible. I'm almost sure there will be some information losses if we attempt to migrate the whole Jira issue with metadata/history into Github. Rather than trying to do that, I would prefer to let Jira issues as is, then simply refer them. ** If we don't aim at perfection, I think we'll be able to migrate all (or part of)
[jira] [Created] (LUCENE-10592) Should we build HNSW graph on the fly during indexing
Mayya Sharipova created LUCENE-10592: Summary: Should we build HNSW graph on the fly during indexing Key: LUCENE-10592 URL: https://issues.apache.org/jira/browse/LUCENE-10592 Project: Lucene - Core Issue Type: Improvement Reporter: Mayya Sharipova Currently, when we index vectors for KnnVectorField, we buffer those vectors in memory and on flush during a segment construction we build an HNSW graph. As building an HNSW graph is very expensive, this makes flush operation take a lot of time. This also makes overall indexing performance quite unpredictable (as the number of flushes are defined by memory used, and the presence of concurrent searches), e.g. some indexing operations return almost instantly while others that trigger flush take a lot of time. Building an HNSW graph on the fly as we index vectors allows to avoid this problem, and spread a load of HNSW graph construction evenly. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10592) Should we build HNSW graph on the fly during indexing
[ https://issues.apache.org/jira/browse/LUCENE-10592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayya Sharipova updated LUCENE-10592: - Description: Currently, when we index vectors for KnnVectorField, we buffer those vectors in memory and on flush during a segment construction we build an HNSW graph. As building an HNSW graph is very expensive, this makes flush operation take a lot of time. This also makes overall indexing performance quite unpredictable (as the number of flushes are defined by memory used, and the presence of concurrent searches), e.g. some indexing operations return almost instantly while others that trigger flush take a lot of time. Building an HNSW graph on the fly as we index vectors allows to avoid this problem, and spread a load of HNSW graph construction evenly during indexing. was: Currently, when we index vectors for KnnVectorField, we buffer those vectors in memory and on flush during a segment construction we build an HNSW graph. As building an HNSW graph is very expensive, this makes flush operation take a lot of time. This also makes overall indexing performance quite unpredictable (as the number of flushes are defined by memory used, and the presence of concurrent searches), e.g. some indexing operations return almost instantly while others that trigger flush take a lot of time. Building an HNSW graph on the fly as we index vectors allows to avoid this problem, and spread a load of HNSW graph construction evenly. > Should we build HNSW graph on the fly during indexing > - > > Key: LUCENE-10592 > URL: https://issues.apache.org/jira/browse/LUCENE-10592 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Mayya Sharipova >Priority: Minor > > Currently, when we index vectors for KnnVectorField, we buffer those vectors > in memory and on flush during a segment construction we build an HNSW graph. > As building an HNSW graph is very expensive, this makes flush operation take > a lot of time. This also makes overall indexing performance quite > unpredictable (as the number of flushes are defined by memory used, and the > presence of concurrent searches), e.g. some indexing operations return almost > instantly while others that trigger flush take a lot of time. > Building an HNSW graph on the fly as we index vectors allows to avoid this > problem, and spread a load of HNSW graph construction evenly during indexing. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10592) Should we build HNSW graph on the fly during indexing
[ https://issues.apache.org/jira/browse/LUCENE-10592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayya Sharipova updated LUCENE-10592: - Description: Currently, when we index vectors for KnnVectorField, we buffer those vectors in memory and on flush during a segment construction we build an HNSW graph. As building an HNSW graph is very expensive, this makes flush operation take a lot of time. This also makes overall indexing performance quite unpredictable (as the number of flushes are defined by memory used, and the presence of concurrent searches), e.g. some indexing operations return almost instantly while others that trigger flush take a lot of time. Building an HNSW graph on the fly as we index vectors allows to avoid this problem, and spread a load of HNSW graph construction evenly during indexing. This will also supersede LUCENE-10194 was: Currently, when we index vectors for KnnVectorField, we buffer those vectors in memory and on flush during a segment construction we build an HNSW graph. As building an HNSW graph is very expensive, this makes flush operation take a lot of time. This also makes overall indexing performance quite unpredictable (as the number of flushes are defined by memory used, and the presence of concurrent searches), e.g. some indexing operations return almost instantly while others that trigger flush take a lot of time. Building an HNSW graph on the fly as we index vectors allows to avoid this problem, and spread a load of HNSW graph construction evenly during indexing. > Should we build HNSW graph on the fly during indexing > - > > Key: LUCENE-10592 > URL: https://issues.apache.org/jira/browse/LUCENE-10592 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Mayya Sharipova >Priority: Minor > > Currently, when we index vectors for KnnVectorField, we buffer those vectors > in memory and on flush during a segment construction we build an HNSW graph. > As building an HNSW graph is very expensive, this makes flush operation take > a lot of time. This also makes overall indexing performance quite > unpredictable (as the number of flushes are defined by memory used, and the > presence of concurrent searches), e.g. some indexing operations return almost > instantly while others that trigger flush take a lot of time. > Building an HNSW graph on the fly as we index vectors allows to avoid this > problem, and spread a load of HNSW graph construction evenly during indexing. > This will also supersede LUCENE-10194 -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on pull request #728: LUCENE-10194 Buffer KNN vectors on disk
mayya-sharipova commented on PR #728: URL: https://github.com/apache/lucene/pull/728#issuecomment-1137833962 @LuXugang Thanks for looking into this. I was thinking to close this issue and this PR. As @jtibshirani noted the problem with this approach is that flush or a segment creation may take a very substantial time. Instead, I was thinking to have a different approach how we index vectors - building an HNSW graph on the fly while indexing, as explained in the [LUCENE-10592](https://issues.apache.org/jira/browse/LUCENE-10592). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] kiranchitturi opened a new pull request, #2662: SOLR-16215 Escape query characters in Solr SQL Array UDF functions
kiranchitturi opened a new pull request, #2662: URL: https://github.com/apache/lucene-solr/pull/2662 * Backport of https://github.com/apache/solr/pull/879 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on a diff in pull request #924: Create Lucene93 Codec and move Lucene92 to backwards_codecs
jtibshirani commented on code in PR #924: URL: https://github.com/apache/lucene/pull/924#discussion_r882189288 ## lucene/backward-codecs/src/test/org/apache/lucene/backward_codecs/lucene92/TestLucene92HnswVectorsFormat.java: ## @@ -0,0 +1,42 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.backward_codecs.lucene92; + +import org.apache.lucene.codecs.Codec; +import org.apache.lucene.codecs.KnnVectorsFormat; +import org.apache.lucene.tests.index.BaseKnnVectorsFormatTestCase; + +public class TestLucene92HnswVectorsFormat extends BaseKnnVectorsFormatTestCase { + @Override + protected Codec getCodec() { +return new Lucene92RWCodec(); + } + + public void testToString() { +Codec customCodec = +new Lucene92RWCodec() { + @Override + public KnnVectorsFormat getKnnVectorsFormatForField(String field) { +return new Lucene92RWHnswVectorsFormat(); + } +}; +String expectedString = "Lucene92RWHnswVectorsFormat"; Review Comment: Is there a reason to take a different approach to `toString` for this format than we did for the older ones like `Lucene91RWHnswVectorsFormat`? ## lucene/backward-codecs/src/test/org/apache/lucene/backward_codecs/lucene92/Lucene92RWHnswVectorsFormat.java: ## @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.backward_codecs.lucene92; + +import java.io.IOException; +import org.apache.lucene.codecs.KnnVectorsReader; +import org.apache.lucene.codecs.KnnVectorsWriter; +import org.apache.lucene.index.SegmentReadState; +import org.apache.lucene.index.SegmentWriteState; +import org.apache.lucene.util.hnsw.HnswGraph; + +public final class Lucene92RWHnswVectorsFormat extends Lucene92HnswVectorsFormat { + + /** Default number of maximum connections per node */ + public static final int DEFAULT_MAX_CONN = 16; + + /** + * Default number of the size of the queue maintained while searching during a graph construction. + */ + public static final int DEFAULT_BEAM_WIDTH = 100; + + static final int DIRECT_MONOTONIC_BLOCK_SHIFT = 16; + + /** + * Controls how many of the nearest neighbor candidates are connected to the new node. Defaults to + * {@link #DEFAULT_MAX_CONN}. See {@link HnswGraph} for more details. + */ + private final int maxConn; + + /** + * The number of candidate neighbors to track while searching the graph for each newly inserted + * node. Defaults to to {@link #DEFAULT_BEAM_WIDTH}. See {@link HnswGraph} for details. + */ + private final int beamWidth; + + /** Constructs a format using default graph construction parameters. */ + public Lucene92RWHnswVectorsFormat() { Review Comment: Small comment -- for other test formats like `Lucene91RWHnswVectorsFormat` we accepted beamWidth and maxConn as parameters and referred to the static defaults (like `Lucene91HnswVectorsFormat.DEFAULT_MAX_CONN`). Also we didn't have local variables `beamWidth` and `maxConn`. It'd be nice to keep the same pattern for consistency. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --
[GitHub] [lucene] mikemccand merged pull request #925: LUCENE-10591: Fix UTF-8 encoding
mikemccand merged PR #925: URL: https://github.com/apache/lucene/pull/925 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10591) Invalid character in SortableSingleDocSource.java
[ https://issues.apache.org/jira/browse/LUCENE-10591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542289#comment-17542289 ] ASF subversion and git services commented on LUCENE-10591: -- Commit 3a80968ddf30293ddf55c62f8f2f8a6915028408 in lucene's branch refs/heads/main from András Salamon [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=3a80968ddf3 ] LUCENE-10591 Invalid character in SortableSingleDocSource.java (#925) > Invalid character in SortableSingleDocSource.java > - > > Key: LUCENE-10591 > URL: https://issues.apache.org/jira/browse/LUCENE-10591 > Project: Lucene - Core > Issue Type: Bug >Reporter: Andras Salamon >Priority: Trivial > Time Spent: 20m > Remaining Estimate: 0h > > There are invalid UTF-8 characters in SortableSingleDocSource.java > "S�o Tom� and Pr�ncipe" > Sonar gave me a warning because of this. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10591) Invalid character in SortableSingleDocSource.java
[ https://issues.apache.org/jira/browse/LUCENE-10591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542290#comment-17542290 ] ASF subversion and git services commented on LUCENE-10591: -- Commit eecf8ea63b90e1f77bb329a1d6e9d8cd6ad8aeb2 in lucene's branch refs/heads/branch_9x from András Salamon [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=eecf8ea63b9 ] LUCENE-10591 Invalid character in SortableSingleDocSource.java (#925) > Invalid character in SortableSingleDocSource.java > - > > Key: LUCENE-10591 > URL: https://issues.apache.org/jira/browse/LUCENE-10591 > Project: Lucene - Core > Issue Type: Bug >Reporter: Andras Salamon >Priority: Trivial > Time Spent: 20m > Remaining Estimate: 0h > > There are invalid UTF-8 characters in SortableSingleDocSource.java > "S�o Tom� and Pr�ncipe" > Sonar gave me a warning because of this. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10591) Invalid character in SortableSingleDocSource.java
[ https://issues.apache.org/jira/browse/LUCENE-10591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-10591. - Fix Version/s: 10.0 (main) 9.3 Resolution: Fixed Thank you for the attention to detail [~asalamon74]! I merged the PR to main/10.0 and cherry-picked to 9.x (eventuallhy 9.3). > Invalid character in SortableSingleDocSource.java > - > > Key: LUCENE-10591 > URL: https://issues.apache.org/jira/browse/LUCENE-10591 > Project: Lucene - Core > Issue Type: Bug >Reporter: Andras Salamon >Priority: Trivial > Fix For: 10.0 (main), 9.3 > > Time Spent: 20m > Remaining Estimate: 0h > > There are invalid UTF-8 characters in SortableSingleDocSource.java > "S�o Tom� and Pr�ncipe" > Sonar gave me a warning because of this. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on pull request #915: LUCENE-10585: Scrub copy/paste code in the facets module and attempt to simplify a bit
mikemccand commented on PR #915: URL: https://github.com/apache/lucene/pull/915#issuecomment-1138026556 Thank you @Yuti-G for running the dedicated `luceneutil` faceting benchmark! But: the `getAllDims` time for SSDV seems to have gotten much faster with this PR, which is great! Was that expected? Or is this some horrible noise? Is it repeatable? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query
[ https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542341#comment-17542341 ] Ming Zhu commented on LUCENE-10562: --- I'm encountering a similar issue, but the impact is more than performance. My case is, I have a wildcard query with filter, let's say, *wildcard:'*searchvalue*' and term filter 'status':'open'* And I'm using TopTermsScoringBooleanQueryRewrite to hopefully get some meaningful relevance scores to sort all the hits. For my data set, there are millions of documents where status is NOT open, and a handful of them with status:open. So the issue here is with the rewrite with top terms, all the terms which are relevant for documents with *status:open* are ranked very low (because of their low frequencies), but apparently I can't keep increasing the size of terms to be taken in the rewrite phase, as that may lead to the max clause issue. So this query+filter ended up with not hitting anything. Any idea how to get out of this situation? Thanks. [~uschindler] [~tomoko] > Large system: Wildcard search leads to full index scan despite filter query > --- > > Key: LUCENE-10562 > URL: https://issues.apache.org/jira/browse/LUCENE-10562 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 8.11.1 >Reporter: Henrik Hertel >Priority: Major > Labels: performance > > I use Solr and have a large system with 1TB in one core and about 5 million > documents. The textual content of large PDF files is indexed there. My query > is extremely slow (more than 30 seconds) as soon as I use wildcards e.g. > {code:java} > *searchvalue* > {code} > , even though I put a filter query in front of it that reduces to less than > 20 documents. > searchvalue -> less than 1 second > searchvalue* -> less than 1 second > My query: > {code:java} > select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0 > {code} > I've tried everything imaginable. It doesn't make sense to me why a search > over a small subset should take so long. If I omit the filter query > metadataitemids_is:20950, so search the entire inventory, then it also takes > the same amount of time. Therefore, I suspect that despite the filter query, > the main query runs over the entire index. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query
[ https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542341#comment-17542341 ] Ming Zhu edited comment on LUCENE-10562 at 5/26/22 4:39 AM: I'm encountering a similar issue, but the impact is more than performance. My case is, I have a wildcard query with filter, let's say, *wildcard:'*searchvalue*' and term filter 'status':'open'* And I'm using TopTermsScoringBooleanQueryRewrite to hopefully get some meaningful relevance scores to sort all the hits. For my data set, there are millions of documents where status is NOT open, and a handful of them with status:open. So the issue here is with the rewrite with top terms, all the terms which are relevant for documents with *status:open* are ranked very low (because of their low frequencies), but apparently I can't keep increasing the size of terms to be taken in the rewrite phase, as that may lead to the max clause issue. So this query+filter ended up with not hitting anything. Any idea how to get out of this situation? Thanks. [~uschindler] [~tomoko] was (Author: JIRAUSER290042): I'm encountering a similar issue, but the impact is more than performance. My case is, I have a wildcard query with filter, let's say, *wildcard:'*searchvalue*' and term filter 'status':'open'* And I'm using TopTermsScoringBooleanQueryRewrite to hopefully get some meaningful relevance scores to sort all the hits. For my data set, there are millions of documents where status is NOT open, and a handful of them with status:open. So the issue here is with the rewrite with top terms, all the terms which are relevant for documents with *status:open* are ranked very low (because of their low frequencies), but apparently I can't keep increasing the size of terms to be taken in the rewrite phase, as that may lead to the max clause issue. So this query+filter ended up with not hitting anything. Any idea how to get out of this situation? Thanks. [~uschindler] [~tomoko] > Large system: Wildcard search leads to full index scan despite filter query > --- > > Key: LUCENE-10562 > URL: https://issues.apache.org/jira/browse/LUCENE-10562 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 8.11.1 >Reporter: Henrik Hertel >Priority: Major > Labels: performance > > I use Solr and have a large system with 1TB in one core and about 5 million > documents. The textual content of large PDF files is indexed there. My query > is extremely slow (more than 30 seconds) as soon as I use wildcards e.g. > {code:java} > *searchvalue* > {code} > , even though I put a filter query in front of it that reduces to less than > 20 documents. > searchvalue -> less than 1 second > searchvalue* -> less than 1 second > My query: > {code:java} > select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0 > {code} > I've tried everything imaginable. It doesn't make sense to me why a search > over a small subset should take so long. If I omit the filter query > metadataitemids_is:20950, so search the entire inventory, then it also takes > the same amount of time. Therefore, I suspect that despite the filter query, > the main query runs over the entire index. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query
[ https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542341#comment-17542341 ] Ming Zhu edited comment on LUCENE-10562 at 5/26/22 4:40 AM: I'm encountering a similar issue, but the impact is more than performance. My case is, I have a wildcard query with filter, let's say, {*}wildcard:'\*{*}{*}searchvalue\*{*}{*}' and term filter 'status':'open'{*} And I'm using TopTermsScoringBooleanQueryRewrite to hopefully get some meaningful relevance scores to sort all the hits. For my data set, there are millions of documents where status is NOT open, and a handful of them with status:open. So the issue here is with the rewrite with top terms, all the terms which are relevant for documents with *status:open* are ranked very low (because of their low frequencies), but apparently I can't keep increasing the size of terms to be taken in the rewrite phase, as that may lead to the max clause issue. So this query+filter ended up with not hitting anything. Any idea how to get out of this situation? Thanks. [~uschindler] [~tomoko] was (Author: JIRAUSER290042): I'm encountering a similar issue, but the impact is more than performance. My case is, I have a wildcard query with filter, let's say, {*}wildcard:'\{*}{*}searchvalue{*}{*}{*}{*}' and term filter 'status':'open'{*} And I'm using TopTermsScoringBooleanQueryRewrite to hopefully get some meaningful relevance scores to sort all the hits. For my data set, there are millions of documents where status is NOT open, and a handful of them with status:open. So the issue here is with the rewrite with top terms, all the terms which are relevant for documents with *status:open* are ranked very low (because of their low frequencies), but apparently I can't keep increasing the size of terms to be taken in the rewrite phase, as that may lead to the max clause issue. So this query+filter ended up with not hitting anything. Any idea how to get out of this situation? Thanks. [~uschindler] [~tomoko] > Large system: Wildcard search leads to full index scan despite filter query > --- > > Key: LUCENE-10562 > URL: https://issues.apache.org/jira/browse/LUCENE-10562 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 8.11.1 >Reporter: Henrik Hertel >Priority: Major > Labels: performance > > I use Solr and have a large system with 1TB in one core and about 5 million > documents. The textual content of large PDF files is indexed there. My query > is extremely slow (more than 30 seconds) as soon as I use wildcards e.g. > {code:java} > *searchvalue* > {code} > , even though I put a filter query in front of it that reduces to less than > 20 documents. > searchvalue -> less than 1 second > searchvalue* -> less than 1 second > My query: > {code:java} > select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0 > {code} > I've tried everything imaginable. It doesn't make sense to me why a search > over a small subset should take so long. If I omit the filter query > metadataitemids_is:20950, so search the entire inventory, then it also takes > the same amount of time. Therefore, I suspect that despite the filter query, > the main query runs over the entire index. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query
[ https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542341#comment-17542341 ] Ming Zhu edited comment on LUCENE-10562 at 5/26/22 4:40 AM: I'm encountering a similar issue, but the impact is more than performance. My case is, I have a wildcard query with filter, let's say, {*}*wildcard:'\*searchvalue\*{*}' and term filter 'status':'open'* And I'm using TopTermsScoringBooleanQueryRewrite to hopefully get some meaningful relevance scores to sort all the hits. For my data set, there are millions of documents where status is NOT open, and a handful of them with status:open. So the issue here is with the rewrite with top terms, all the terms which are relevant for documents with *status:open* are ranked very low (because of their low frequencies), but apparently I can't keep increasing the size of terms to be taken in the rewrite phase, as that may lead to the max clause issue. So this query+filter ended up with not hitting anything. Any idea how to get out of this situation? Thanks. [~uschindler] [~tomoko] was (Author: JIRAUSER290042): I'm encountering a similar issue, but the impact is more than performance. My case is, I have a wildcard query with filter, let's say, *wildcard:'*searchvalue*' and term filter 'status':'open'* And I'm using TopTermsScoringBooleanQueryRewrite to hopefully get some meaningful relevance scores to sort all the hits. For my data set, there are millions of documents where status is NOT open, and a handful of them with status:open. So the issue here is with the rewrite with top terms, all the terms which are relevant for documents with *status:open* are ranked very low (because of their low frequencies), but apparently I can't keep increasing the size of terms to be taken in the rewrite phase, as that may lead to the max clause issue. So this query+filter ended up with not hitting anything. Any idea how to get out of this situation? Thanks. [~uschindler] [~tomoko] > Large system: Wildcard search leads to full index scan despite filter query > --- > > Key: LUCENE-10562 > URL: https://issues.apache.org/jira/browse/LUCENE-10562 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 8.11.1 >Reporter: Henrik Hertel >Priority: Major > Labels: performance > > I use Solr and have a large system with 1TB in one core and about 5 million > documents. The textual content of large PDF files is indexed there. My query > is extremely slow (more than 30 seconds) as soon as I use wildcards e.g. > {code:java} > *searchvalue* > {code} > , even though I put a filter query in front of it that reduces to less than > 20 documents. > searchvalue -> less than 1 second > searchvalue* -> less than 1 second > My query: > {code:java} > select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0 > {code} > I've tried everything imaginable. It doesn't make sense to me why a search > over a small subset should take so long. If I omit the filter query > metadataitemids_is:20950, so search the entire inventory, then it also takes > the same amount of time. Therefore, I suspect that despite the filter query, > the main query runs over the entire index. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query
[ https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542341#comment-17542341 ] Ming Zhu edited comment on LUCENE-10562 at 5/26/22 4:40 AM: I'm encountering a similar issue, but the impact is more than performance. My case is, I have a wildcard query with filter, let's say, {*}wildcard:'\{*}{*}searchvalue{*}{*}{*}{*}' and term filter 'status':'open'{*} And I'm using TopTermsScoringBooleanQueryRewrite to hopefully get some meaningful relevance scores to sort all the hits. For my data set, there are millions of documents where status is NOT open, and a handful of them with status:open. So the issue here is with the rewrite with top terms, all the terms which are relevant for documents with *status:open* are ranked very low (because of their low frequencies), but apparently I can't keep increasing the size of terms to be taken in the rewrite phase, as that may lead to the max clause issue. So this query+filter ended up with not hitting anything. Any idea how to get out of this situation? Thanks. [~uschindler] [~tomoko] was (Author: JIRAUSER290042): I'm encountering a similar issue, but the impact is more than performance. My case is, I have a wildcard query with filter, let's say, *wildcard:'\{*}searchvalue\{*}' and term filter 'status':'open'* And I'm using TopTermsScoringBooleanQueryRewrite to hopefully get some meaningful relevance scores to sort all the hits. For my data set, there are millions of documents where status is NOT open, and a handful of them with status:open. So the issue here is with the rewrite with top terms, all the terms which are relevant for documents with *status:open* are ranked very low (because of their low frequencies), but apparently I can't keep increasing the size of terms to be taken in the rewrite phase, as that may lead to the max clause issue. So this query+filter ended up with not hitting anything. Any idea how to get out of this situation? Thanks. [~uschindler] [~tomoko] > Large system: Wildcard search leads to full index scan despite filter query > --- > > Key: LUCENE-10562 > URL: https://issues.apache.org/jira/browse/LUCENE-10562 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 8.11.1 >Reporter: Henrik Hertel >Priority: Major > Labels: performance > > I use Solr and have a large system with 1TB in one core and about 5 million > documents. The textual content of large PDF files is indexed there. My query > is extremely slow (more than 30 seconds) as soon as I use wildcards e.g. > {code:java} > *searchvalue* > {code} > , even though I put a filter query in front of it that reduces to less than > 20 documents. > searchvalue -> less than 1 second > searchvalue* -> less than 1 second > My query: > {code:java} > select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0 > {code} > I've tried everything imaginable. It doesn't make sense to me why a search > over a small subset should take so long. If I omit the filter query > metadataitemids_is:20950, so search the entire inventory, then it also takes > the same amount of time. Therefore, I suspect that despite the filter query, > the main query runs over the entire index. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query
[ https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542341#comment-17542341 ] Ming Zhu edited comment on LUCENE-10562 at 5/26/22 4:40 AM: I'm encountering a similar issue, but the impact is more than performance. My case is, I have a wildcard query with filter, let's say, *wildcard:'\{*}searchvalue\{*}' and term filter 'status':'open'* And I'm using TopTermsScoringBooleanQueryRewrite to hopefully get some meaningful relevance scores to sort all the hits. For my data set, there are millions of documents where status is NOT open, and a handful of them with status:open. So the issue here is with the rewrite with top terms, all the terms which are relevant for documents with *status:open* are ranked very low (because of their low frequencies), but apparently I can't keep increasing the size of terms to be taken in the rewrite phase, as that may lead to the max clause issue. So this query+filter ended up with not hitting anything. Any idea how to get out of this situation? Thanks. [~uschindler] [~tomoko] was (Author: JIRAUSER290042): I'm encountering a similar issue, but the impact is more than performance. My case is, I have a wildcard query with filter, let's say, {*}*wildcard:'\*searchvalue\*{*}' and term filter 'status':'open'* And I'm using TopTermsScoringBooleanQueryRewrite to hopefully get some meaningful relevance scores to sort all the hits. For my data set, there are millions of documents where status is NOT open, and a handful of them with status:open. So the issue here is with the rewrite with top terms, all the terms which are relevant for documents with *status:open* are ranked very low (because of their low frequencies), but apparently I can't keep increasing the size of terms to be taken in the rewrite phase, as that may lead to the max clause issue. So this query+filter ended up with not hitting anything. Any idea how to get out of this situation? Thanks. [~uschindler] [~tomoko] > Large system: Wildcard search leads to full index scan despite filter query > --- > > Key: LUCENE-10562 > URL: https://issues.apache.org/jira/browse/LUCENE-10562 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 8.11.1 >Reporter: Henrik Hertel >Priority: Major > Labels: performance > > I use Solr and have a large system with 1TB in one core and about 5 million > documents. The textual content of large PDF files is indexed there. My query > is extremely slow (more than 30 seconds) as soon as I use wildcards e.g. > {code:java} > *searchvalue* > {code} > , even though I put a filter query in front of it that reduces to less than > 20 documents. > searchvalue -> less than 1 second > searchvalue* -> less than 1 second > My query: > {code:java} > select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0 > {code} > I've tried everything imaginable. It doesn't make sense to me why a search > over a small subset should take so long. If I omit the filter query > metadataitemids_is:20950, so search the entire inventory, then it also takes > the same amount of time. Therefore, I suspect that despite the filter query, > the main query runs over the entire index. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query
[ https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542341#comment-17542341 ] Ming Zhu edited comment on LUCENE-10562 at 5/26/22 4:49 AM: I'm encountering a similar issue, but the impact is more than performance. My case is, I have a wildcard query with filter, let's say, *name:'\*searchvalue\*' and term filter 'status':'open'* And I'm using TopTermsScoringBooleanQueryRewrite to hopefully get some meaningful relevance scores to sort all the hits. For my data set, there are millions of documents where status is NOT open, and a handful of them with status:open. So the issue here is with the rewrite with top terms, all the terms which are relevant for documents with *status:open* are ranked very low (because of their low frequencies), but apparently I can't keep increasing the size of terms to be taken in the rewrite phase, as that may lead to the max clause issue. So this query+filter ended up with not hitting anything. Any idea how to get out of this situation? Thanks. [~uschindler] [~tomoko] was (Author: JIRAUSER290042): I'm encountering a similar issue, but the impact is more than performance. My case is, I have a wildcard query with filter, let's say, {*}name:'*{*}{*}searchvalue*{*}{*}' and term filter 'status':'open'{*} And I'm using TopTermsScoringBooleanQueryRewrite to hopefully get some meaningful relevance scores to sort all the hits. For my data set, there are millions of documents where status is NOT open, and a handful of them with status:open. So the issue here is with the rewrite with top terms, all the terms which are relevant for documents with *status:open* are ranked very low (because of their low frequencies), but apparently I can't keep increasing the size of terms to be taken in the rewrite phase, as that may lead to the max clause issue. So this query+filter ended up with not hitting anything. Any idea how to get out of this situation? Thanks. [~uschindler] [~tomoko] > Large system: Wildcard search leads to full index scan despite filter query > --- > > Key: LUCENE-10562 > URL: https://issues.apache.org/jira/browse/LUCENE-10562 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 8.11.1 >Reporter: Henrik Hertel >Priority: Major > Labels: performance > > I use Solr and have a large system with 1TB in one core and about 5 million > documents. The textual content of large PDF files is indexed there. My query > is extremely slow (more than 30 seconds) as soon as I use wildcards e.g. > {code:java} > *searchvalue* > {code} > , even though I put a filter query in front of it that reduces to less than > 20 documents. > searchvalue -> less than 1 second > searchvalue* -> less than 1 second > My query: > {code:java} > select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0 > {code} > I've tried everything imaginable. It doesn't make sense to me why a search > over a small subset should take so long. If I omit the filter query > metadataitemids_is:20950, so search the entire inventory, then it also takes > the same amount of time. Therefore, I suspect that despite the filter query, > the main query runs over the entire index. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query
[ https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542341#comment-17542341 ] Ming Zhu edited comment on LUCENE-10562 at 5/26/22 4:49 AM: I'm encountering a similar issue, but the impact is more than performance. My case is, I have a wildcard query with filter, let's say, {*}name:'*{*}{*}searchvalue*{*}{*}' and term filter 'status':'open'{*} And I'm using TopTermsScoringBooleanQueryRewrite to hopefully get some meaningful relevance scores to sort all the hits. For my data set, there are millions of documents where status is NOT open, and a handful of them with status:open. So the issue here is with the rewrite with top terms, all the terms which are relevant for documents with *status:open* are ranked very low (because of their low frequencies), but apparently I can't keep increasing the size of terms to be taken in the rewrite phase, as that may lead to the max clause issue. So this query+filter ended up with not hitting anything. Any idea how to get out of this situation? Thanks. [~uschindler] [~tomoko] was (Author: JIRAUSER290042): I'm encountering a similar issue, but the impact is more than performance. My case is, I have a wildcard query with filter, let's say, {*}wildcard:'\*{*}{*}searchvalue\*{*}{*}' and term filter 'status':'open'{*} And I'm using TopTermsScoringBooleanQueryRewrite to hopefully get some meaningful relevance scores to sort all the hits. For my data set, there are millions of documents where status is NOT open, and a handful of them with status:open. So the issue here is with the rewrite with top terms, all the terms which are relevant for documents with *status:open* are ranked very low (because of their low frequencies), but apparently I can't keep increasing the size of terms to be taken in the rewrite phase, as that may lead to the max clause issue. So this query+filter ended up with not hitting anything. Any idea how to get out of this situation? Thanks. [~uschindler] [~tomoko] > Large system: Wildcard search leads to full index scan despite filter query > --- > > Key: LUCENE-10562 > URL: https://issues.apache.org/jira/browse/LUCENE-10562 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 8.11.1 >Reporter: Henrik Hertel >Priority: Major > Labels: performance > > I use Solr and have a large system with 1TB in one core and about 5 million > documents. The textual content of large PDF files is indexed there. My query > is extremely slow (more than 30 seconds) as soon as I use wildcards e.g. > {code:java} > *searchvalue* > {code} > , even though I put a filter query in front of it that reduces to less than > 20 documents. > searchvalue -> less than 1 second > searchvalue* -> less than 1 second > My query: > {code:java} > select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0 > {code} > I've tried everything imaginable. It doesn't make sense to me why a search > over a small subset should take so long. If I omit the filter query > metadataitemids_is:20950, so search the entire inventory, then it also takes > the same amount of time. Therefore, I suspect that despite the filter query, > the main query runs over the entire index. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org