date:20220525



mocobeta commented on PR #920:
URL: https://github.com/apache/lucene/pull/920#issuecomment-1136964941

   I'll merge this, thanks for your quick response!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta merged pull request #920: LUCENE-10589: increase upper bound of test range query to the maximum value + 1

2022-05-25 Thread ASF subversion and git services (Jira)



mocobeta merged PR #920:
URL: https://github.com/apache/lucene/pull/920


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10589) Fix corner case in TestKnnVectorQuery.testRandomWithFilter



[ 
https://issues.apache.org/jira/browse/LUCENE-10589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541898#comment-17541898
 ] 

ASF subversion and git services commented on LUCENE-10589:
--

Commit 2620b5669f9a3ccb90439309723314295a850b29 in lucene's branch 
refs/heads/main from Tomoko Uchida
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=2620b5669f9 ]

LUCENE-10589: increase upper bound of test range query (#920)



> Fix corner case in TestKnnVectorQuery.testRandomWithFilter
> --
>
> Key: LUCENE-10589
> URL: https://issues.apache.org/jira/browse/LUCENE-10589
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Tomoko Uchida
>Priority: Minor
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> {{TestKnnVectorQuery.testRandomWithFilter}} can fail with 
> java.lang.UnsupportedOperationException.
> Reproducible command
> {code:java}
> ./gradlew test --tests TestKnnVectorQuery.testRandomWithFilter 
> -Dtests.seed=1DA39B92702DAC45 -Dtests.multiplier=3
> {code}
> {code:java}
> org.apache.lucene.search.TestKnnVectorQuery > testRandomWithFilter FAILED
> java.lang.UnsupportedOperationException: exact search is not supported
> at 
> __randomizedtesting.SeedInfo.seed([1DA39B92702DAC45:6BEAC2197AD96AE0]:0)
> at 
> org.apache.lucene.search.TestKnnVectorQuery$ThrowingKnnVectorQuery.exactSearch(TestKnnVectorQuery.java:715)
> at 
> org.apache.lucene.search.KnnVectorQuery.searchLeaf(KnnVectorQuery.java:151)
> at 
> org.apache.lucene.search.KnnVectorQuery.rewrite(KnnVectorQuery.java:108)
> at 
> org.apache.lucene.search.ConstantScoreQuery.rewrite(ConstantScoreQuery.java:44)
> at 
> org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:789)
> at 
> org.apache.lucene.tests.search.AssertingIndexSearcher.rewrite(AssertingIndexSearcher.java:69)
> at 
> org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:803)
> at 
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:685)
> at 
> org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:667)
> at 
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:584)
> at 
> org.apache.lucene.search.TestKnnVectorQuery.testRandomWithFilter(TestKnnVectorQuery.java:556)
> {code}
> In some edge cases (depending on the random seed), 
> [KnnVectorQuery.java#147|https://github.com/apache/lucene/blob/fe9d26178d033f585c08a5e86708063ac0ec0c9e/lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java#L147]
>  becomes false, and then `exactSearch()` is called.
> The upper bound of [the test range query 
> (filter)|https://github.com/apache/lucene/blob/fe9d26178d033f585c08a5e86708063ac0ec0c9e/lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java#L554]
>  could be 200 (the max value of "tag" field + 1) instead of lower + 150 to 
> make it "unrestrictive"?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10589) Fix corner case in TestKnnVectorQuery.testRandomWithFilter

2022-05-25 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541902#comment-17541902
 ] 

ASF subversion and git services commented on LUCENE-10589:
--

Commit 9188b7f4c49f6a3c6e9a2580916230e56c4a41d1 in lucene's branch 
refs/heads/branch_9x from Tomoko Uchida
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=9188b7f4c49 ]

LUCENE-10589: increase upper bound of test range query (#920)



> Fix corner case in TestKnnVectorQuery.testRandomWithFilter
> --
>
> Key: LUCENE-10589
> URL: https://issues.apache.org/jira/browse/LUCENE-10589
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Tomoko Uchida
>Priority: Minor
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> {{TestKnnVectorQuery.testRandomWithFilter}} can fail with 
> java.lang.UnsupportedOperationException.
> Reproducible command
> {code:java}
> ./gradlew test --tests TestKnnVectorQuery.testRandomWithFilter 
> -Dtests.seed=1DA39B92702DAC45 -Dtests.multiplier=3
> {code}
> {code:java}
> org.apache.lucene.search.TestKnnVectorQuery > testRandomWithFilter FAILED
> java.lang.UnsupportedOperationException: exact search is not supported
> at 
> __randomizedtesting.SeedInfo.seed([1DA39B92702DAC45:6BEAC2197AD96AE0]:0)
> at 
> org.apache.lucene.search.TestKnnVectorQuery$ThrowingKnnVectorQuery.exactSearch(TestKnnVectorQuery.java:715)
> at 
> org.apache.lucene.search.KnnVectorQuery.searchLeaf(KnnVectorQuery.java:151)
> at 
> org.apache.lucene.search.KnnVectorQuery.rewrite(KnnVectorQuery.java:108)
> at 
> org.apache.lucene.search.ConstantScoreQuery.rewrite(ConstantScoreQuery.java:44)
> at 
> org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:789)
> at 
> org.apache.lucene.tests.search.AssertingIndexSearcher.rewrite(AssertingIndexSearcher.java:69)
> at 
> org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:803)
> at 
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:685)
> at 
> org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:667)
> at 
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:584)
> at 
> org.apache.lucene.search.TestKnnVectorQuery.testRandomWithFilter(TestKnnVectorQuery.java:556)
> {code}
> In some edge cases (depending on the random seed), 
> [KnnVectorQuery.java#147|https://github.com/apache/lucene/blob/fe9d26178d033f585c08a5e86708063ac0ec0c9e/lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java#L147]
>  becomes false, and then `exactSearch()` is called.
> The upper bound of [the test range query 
> (filter)|https://github.com/apache/lucene/blob/fe9d26178d033f585c08a5e86708063ac0ec0c9e/lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java#L554]
>  could be 200 (the max value of "tag" field + 1) instead of lower + 150 to 
> make it "unrestrictive"?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-10589) Fix corner case in TestKnnVectorQuery.testRandomWithFilter



 [ 
https://issues.apache.org/jira/browse/LUCENE-10589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida resolved LUCENE-10589.

Fix Version/s: 10.0 (main)
   9.3
   Resolution: Fixed

> Fix corner case in TestKnnVectorQuery.testRandomWithFilter
> --
>
> Key: LUCENE-10589
> URL: https://issues.apache.org/jira/browse/LUCENE-10589
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Tomoko Uchida
>Priority: Minor
> Fix For: 10.0 (main), 9.3
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> {{TestKnnVectorQuery.testRandomWithFilter}} can fail with 
> java.lang.UnsupportedOperationException.
> Reproducible command
> {code:java}
> ./gradlew test --tests TestKnnVectorQuery.testRandomWithFilter 
> -Dtests.seed=1DA39B92702DAC45 -Dtests.multiplier=3
> {code}
> {code:java}
> org.apache.lucene.search.TestKnnVectorQuery > testRandomWithFilter FAILED
> java.lang.UnsupportedOperationException: exact search is not supported
> at 
> __randomizedtesting.SeedInfo.seed([1DA39B92702DAC45:6BEAC2197AD96AE0]:0)
> at 
> org.apache.lucene.search.TestKnnVectorQuery$ThrowingKnnVectorQuery.exactSearch(TestKnnVectorQuery.java:715)
> at 
> org.apache.lucene.search.KnnVectorQuery.searchLeaf(KnnVectorQuery.java:151)
> at 
> org.apache.lucene.search.KnnVectorQuery.rewrite(KnnVectorQuery.java:108)
> at 
> org.apache.lucene.search.ConstantScoreQuery.rewrite(ConstantScoreQuery.java:44)
> at 
> org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:789)
> at 
> org.apache.lucene.tests.search.AssertingIndexSearcher.rewrite(AssertingIndexSearcher.java:69)
> at 
> org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:803)
> at 
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:685)
> at 
> org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:667)
> at 
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:584)
> at 
> org.apache.lucene.search.TestKnnVectorQuery.testRandomWithFilter(TestKnnVectorQuery.java:556)
> {code}
> In some edge cases (depending on the random seed), 
> [KnnVectorQuery.java#147|https://github.com/apache/lucene/blob/fe9d26178d033f585c08a5e86708063ac0ec0c9e/lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java#L147]
>  becomes false, and then `exactSearch()` is called.
> The upper bound of [the test range query 
> (filter)|https://github.com/apache/lucene/blob/fe9d26178d033f585c08a5e86708063ac0ec0c9e/lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java#L554]
>  could be 200 (the max value of "tag" field + 1) instead of lower + 150 to 
> make it "unrestrictive"?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta commented on a diff in pull request #923: Replace classpath with modulepath in the demo tutorial



mocobeta commented on code in PR #923:
URL: https://github.com/apache/lucene/pull/923#discussion_r881413568


##
lucene/demo/src/java/overview.html:
##
@@ -49,36 +49,35 @@ About the Demo
 demonstrates various functionalities of Lucene and how you can add Lucene to
 your applications.
 
-
-Setting your CLASSPATH
+
+Setting your MODULEPATH
 
 First, you should http://www.apache.org/dyn/closer.cgi/lucene/java/";>download the latest
 Lucene distribution and then extract it to a working directory.
-You need four JARs: the Lucene JAR, the queryparser JAR, the common 
analysis JAR, and the Lucene
-demo JAR. You should see the Lucene JAR file in the modules/ directory you 
created
+You need Lucene demo and a few dependent modules.
+You should see the Lucene module (JAR) files in the modules/ and 
modules-thirdparty/ directory you created
 when you extracted the archive -- it should be named something like
 lucene-core-{version}.jar. You should also see
 files called lucene-queryparser-{version}.jar,
 lucene-analysis-common-{version}.jar and lucene-demo-{version}.jar under queryparser, 
analysis/common/ and demo/,
-respectively.
-Put all four of these files in your Java CLASSPATH.
+"codefrag">lucene-demo-{version}.jar under modules directory.
+Put all of these files in your Java MODULEPATH.
 
 
 Indexing Files
 
 Once you've gotten this far you're probably itching to go. Let's build an
-index! Assuming you've set your CLASSPATH correctly, just type:
+index! Just type:
 
-java org.apache.lucene.demo.IndexFiles -docs {path-to-lucene}
+java --module-path modules:modules-thirdparty --module 
org.apache.lucene.demo/org.apache.lucene.demo.IndexFiles -docs {path-to-lucene}
 
 This will produce a subdirectory called index
 which will contain an index of all of the Lucene source code.
 To search the index type:
 
-java org.apache.lucene.demo.SearchFiles
+java --module-path modules:modules-thirdparty --add-modules 
jdk.unsupported --module 
org.apache.lucene.demo/org.apache.lucene.demo.SearchFiles
 
 You'll be prompted for a query. Type in a gibberish or made up word (for 
example: 
 "supercalifragilisticexpialidocious").

Review Comment:
   Fixed in 
https://github.com/apache/lucene/pull/923/commits/d400d7bed703999c050407f9f5b6cf9c0f66b748
 - I still can't detect typos in English by just quickly skimming :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta commented on pull request #923: LUCENE-10200: Replace classpath with modulepath in the demo tutorial



mocobeta commented on PR #923:
URL: https://github.com/apache/lucene/pull/923#issuecomment-1137010121

   I hooked this on LUCENE-10200 - actually, the tutorial has been obsoleted by 
the change in the way to assemble the binary distribution.
   Still, we can stick to classpath though, I feel like it'd be clearer to 
switch to module path to align with the binary release structure (as well as 
the Luke launch script).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10200) Restructure and modernize the release artifacts



[ 
https://issues.apache.org/jira/browse/LUCENE-10200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541987#comment-17541987
 ] 

Tomoko Uchida commented on LUCENE-10200:


Just wanted to leave a quick note.
I happened to notice that "lucene-demo"  tutorial has been outdated by the 
change in the binary package structure. I opened a PR to correct the 
instruction in there (and also switch to using module path instead of 
classpath). This is a small correction in overview.html; will merge it if there 
are no objections. 
https://github.com/apache/lucene/pull/923

> Restructure and modernize the release artifacts
> ---
>
> Key: LUCENE-10200
> URL: https://issues.apache.org/jira/browse/LUCENE-10200
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Major
> Fix For: 9.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> This is an umbrella issue for various sub-tasks as per my e-mail [1].
>  [1] [https://markmail.org/thread/f7yrggnynq2ijgmy]
> In this order, perhaps:
>  * (/) Apply small text file changes (LUCENE-10163)
>  * (/) Simplify artifacts (LUCENE-10199 drop ZIP binary),
>  * (/) LUCENE-10192 drop third party JARs.
>  * -Create an additional binary artifact for Luke (LUCENE-9978).-
>  * (-) -only include relevant binary license/ notice files-
>  * (/) make sure source package can be compiled (no .git folder).
>  * (/) Test everything with the smoke tester.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10591) Invalid character in SortableSingleDocSource.java

2022-05-25 Thread Andras Salamon (Jira)

Andras Salamon created LUCENE-10591:
---

 Summary: Invalid character in SortableSingleDocSource.java
 Key: LUCENE-10591
 URL: https://issues.apache.org/jira/browse/LUCENE-10591
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Andras Salamon


There are invalid UTF-8 characters in SortableSingleDocSource.java

"S�o Tom� and Pr�ncipe"

Sonar gave me a warning because of this.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] asalamon74 opened a new pull request, #925: LUCENE-10591: Fix UTF-8 encoding



asalamon74 opened a new pull request, #925:
URL: https://github.com/apache/lucene/pull/925

   ### Description (or a Jira issue link if you have one)
   
   Fixing invalid UTF-8 characters in SortableSingleDocSource.java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] msokolov commented on pull request #923: LUCENE-10200: Replace classpath with modulepath in the demo tutorial



msokolov commented on PR #923:
URL: https://github.com/apache/lucene/pull/923#issuecomment-1137160783

   Thanks for fixing those (pre-existing) typos. On the classpath/module path 
change I have mixed feelings. On the one hand, we should broadcast that we are 
now fully modularized and support using module paths to declare dependencies on 
code. On the other hand, I don't even know how to "put jars on my module path" 
yet. Maybe I'm just a stick-in-the-mud, but modules still seems very new and I 
suspect many (most) Java devs probably haven't yet figured it out and are still 
using class-path? So, I'm not sure what that means for this documentation. It 
doesn't seem as if this is the right place to explain modules and class paths, 
but we want to make this accessible and easy to use. Maybe we could include 
both sets of directions?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss commented on pull request #923: LUCENE-10200: Replace classpath with modulepath in the demo tutorial



dweiss commented on PR #923:
URL: https://github.com/apache/lucene/pull/923#issuecomment-1137165723

   Modular are not new, they're just not widespread... I agree with @mocobeta 
that if you provide an explicit command line then there is little harm in not 
explaining all the options. The problem with classpath is that you need to 
include all JARs individually  this is handled much better by modules since you 
include the directory path (not each individual module JAR).
   
   This said, I don't have a strong opinion about going either way. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] msokolov commented on a diff in pull request #873: LUCENE-10397: KnnVectorQuery doesn't tie break by doc ID



msokolov commented on code in PR #873:
URL: https://github.com/apache/lucene/pull/873#discussion_r881581848


##
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborQueue.java:
##
@@ -90,26 +92,34 @@ public boolean insertWithOverflow(int newNode, float 
newScore) {
   }
 
   private long encode(int node, float score) {
-return order.applylong) NumericUtils.floatToSortableInt(score)) << 32) 
| node);
+int nodeReverse = reversed ? node : Integer.MAX_VALUE - node;
+return order.applylong) NumericUtils.floatToSortableInt(score)) << 32) 
| nodeReverse);
   }
 
   /** Removes the top element and returns its node id. */
   public int pop() {
-return (int) order.apply(heap.pop());
+return reversed

Review Comment:
   can we move this logic into `Order.apply`? If we do that we can avoid a 
conditional in this hot spot



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta commented on pull request #923: LUCENE-10200: Replace classpath with modulepath in the demo tutorial



mocobeta commented on PR #923:
URL: https://github.com/apache/lucene/pull/923#issuecomment-1137171363

   Anyway the outdated instruction would need to be corrected.
   Maybe we can write both working commands for classpath and module path? Let 
me adjust it...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] msokolov commented on pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial



msokolov commented on PR #923:
URL: https://github.com/apache/lucene/pull/923#issuecomment-1137209317

   >  The problem with classpath is that you need to include all JARs 
individually this is handled much better by modules since you include the 
directory path (not each individual module JAR).
   
   Oh, that is better! Although I have gotten used to using `*.jar` in 
classpaths which helps shrink them down to be more manageable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta commented on a diff in pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial



mocobeta commented on code in PR #923:
URL: https://github.com/apache/lucene/pull/923#discussion_r881678572


##
lucene/demo/src/java/overview.html:
##
@@ -49,40 +49,49 @@ About the Demo
 demonstrates various functionalities of Lucene and how you can add Lucene to
 your applications.
 
-
-Setting your CLASSPATH
+
+Setting your module path (or classpath)
 
 First, you should http://www.apache.org/dyn/closer.cgi/lucene/java/";>download the latest
 Lucene distribution and then extract it to a working directory.
-You need four JARs: the Lucene JAR, the queryparser JAR, the common 
analysis JAR, and the Lucene
-demo JAR. You should see the Lucene JAR file in the modules/ directory you 
created
-when you extracted the archive -- it should be named something like
+You need Lucene demo and a few dependent modules.
+You should see the Lucene modules (JARs) in the modules/ and third party 
modules in the modules-thirdparty/ directory
+you created when you extracted the archive -- it should be named something like
 lucene-core-{version}.jar. You should also see
 files called lucene-queryparser-{version}.jar,
 lucene-analysis-common-{version}.jar and lucene-demo-{version}.jar under queryparser, 
analysis/common/ and demo/,
-respectively.
-Put all four of these files in your Java CLASSPATH.
+"codefrag">lucene-demo-{version}.jar under modules directory.
+There are two ways to run Java program: with module path or with classpath. 
Either way is fine. Put all of these files in your Java module path or 
classpath.
 
 
 Indexing Files
 
 Once you've gotten this far you're probably itching to go. Let's build an
-index! Assuming you've set your CLASSPATH correctly, just type:
+index! Just type either command:
+With module path
 
-java org.apache.lucene.demo.IndexFiles -docs {path-to-lucene}
+java --module-path modules:modules-thirdparty --module 
org.apache.lucene.demo/org.apache.lucene.demo.IndexFiles -docs {path-to-lucene}
+
+With classpath
+
+java -cp "modules/*:modules-thirdparty/*" 
org.apache.lucene.demo.IndexFiles -docs {path-to-lucene}

Review Comment:
   It's not a good practice to put all jars into classpath with wildcard at 
all, however, I don't think we can maintain the correct jar list (it was proved 
in the latest tutorial - the jar list there had been outdated long before 9.0, 
and this is another reason why I'd prefer module path).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta commented on pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial



mocobeta commented on PR #923:
URL: https://github.com/apache/lucene/pull/923#issuecomment-1137271336

   I updated the text so that we have both working commands for module path and 
classpath in it. Please see the updated screenshot in the PR description to see 
how it looks, thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10590) Indexing all zero vectors leads to heat death of the universe

2022-05-25 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542068#comment-17542068
 ] 

Michael Sokolov commented on LUCENE-10590:
--

> Does the indexing logic rely on tie breaking by node ID? If not, maybe 
> index-time graph search could stop as soon as the k-th nearest vector is 
> equal to the input vector?

Seems like that could work although to date we use the same search 
implementation at index time and search time, which is a nice simplification. 
Perhaps in such a case we could also sacrifice the docid tiebreaking given that 
is going to be best effort only

> Indexing all zero vectors leads to heat death of the universe
> -
>
> Key: LUCENE-10590
> URL: https://issues.apache.org/jira/browse/LUCENE-10590
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Michael Sokolov
>Priority: Major
>
> By accident while testing something else, I ran a luceneutil test indexing 1M 
> 100d vectors where all the vectors were all zeroes. This caused indexing to 
> take a very long time (~40x normal - it did eventually complete) and the 
> search performance was similarly bad.  We should not degrade by orders of 
> magnitude with even the worst data though.
> I'm not entirely sure what the issue is, but perhaps as long as we keep 
> finding hits that are "better" we keep exploring the graph, where better 
> means (score, -docid) >= (lowest score, -docid). If that's right and all docs 
> have the same score, then we probably need to either switch to > (but this 
> could lead to poorer recall in normal cases) or introduce some kind of 
> minimum score threshold?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10590) Indexing all zero vectors leads to heat death of the universe

2022-05-25 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542068#comment-17542068
 ] 

Michael Sokolov edited comment on LUCENE-10590 at 5/25/22 2:09 PM:
---

bq. Does the indexing logic rely on tie breaking by node ID? If not, maybe 
index-time graph search could stop as soon as the k-th nearest vector is equal 
to the input vector?

Seems like that could work although to date we use the same search 
implementation at index time and search time, which is a nice simplification. 
Perhaps in such a case we could also sacrifice the docid tiebreaking given that 
is going to be best effort only


was (Author: sokolov):
> Does the indexing logic rely on tie breaking by node ID? If not, maybe 
> index-time graph search could stop as soon as the k-th nearest vector is 
> equal to the input vector?

Seems like that could work although to date we use the same search 
implementation at index time and search time, which is a nice simplification. 
Perhaps in such a case we could also sacrifice the docid tiebreaking given that 
is going to be best effort only

> Indexing all zero vectors leads to heat death of the universe
> -
>
> Key: LUCENE-10590
> URL: https://issues.apache.org/jira/browse/LUCENE-10590
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Michael Sokolov
>Priority: Major
>
> By accident while testing something else, I ran a luceneutil test indexing 1M 
> 100d vectors where all the vectors were all zeroes. This caused indexing to 
> take a very long time (~40x normal - it did eventually complete) and the 
> search performance was similarly bad.  We should not degrade by orders of 
> magnitude with even the worst data though.
> I'm not entirely sure what the issue is, but perhaps as long as we keep 
> finding hits that are "better" we keep exploring the graph, where better 
> means (score, -docid) >= (lowest score, -docid). If that's right and all docs 
> have the same score, then we probably need to either switch to > (but this 
> could lead to poorer recall in normal cases) or introduce some kind of 
> minimum score threshold?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] msokolov commented on pull request #924: Create Lucene93 Codec and move Lucene92 to backwards_codecs



msokolov commented on PR #924:
URL: https://github.com/apache/lucene/pull/924#issuecomment-1137306689

   Thanks for the reminder about the unit tests - I will add. As for the 
approach of using a feature branch, I'm ambivalent. It seems better to me to 
separate out the "new codec version" commit with all of its boilerplate from 
the actual changes to be made to the codec, to make it easier to review and 
understand. Certainly that can be done on a feature branch, but I'm not sure 
why we need a branch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss commented on a diff in pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial



dweiss commented on code in PR #923:
URL: https://github.com/apache/lucene/pull/923#discussion_r881755153


##
lucene/demo/src/java/overview.html:
##
@@ -49,40 +49,49 @@ About the Demo
 demonstrates various functionalities of Lucene and how you can add Lucene to
 your applications.
 
-
-Setting your CLASSPATH
+
+Setting your module path (or classpath)
 
 First, you should http://www.apache.org/dyn/closer.cgi/lucene/java/";>download the latest
 Lucene distribution and then extract it to a working directory.
-You need four JARs: the Lucene JAR, the queryparser JAR, the common 
analysis JAR, and the Lucene
-demo JAR. You should see the Lucene JAR file in the modules/ directory you 
created
-when you extracted the archive -- it should be named something like
+You need Lucene demo and a few dependent modules.
+You should see the Lucene modules (JARs) in the modules/ and third party 
modules in the modules-thirdparty/ directory
+you created when you extracted the archive -- it should be named something like
 lucene-core-{version}.jar. You should also see
 files called lucene-queryparser-{version}.jar,
 lucene-analysis-common-{version}.jar and lucene-demo-{version}.jar under queryparser, 
analysis/common/ and demo/,
-respectively.
-Put all four of these files in your Java CLASSPATH.
+"codefrag">lucene-demo-{version}.jar under modules directory.
+There are two ways to run Java program: with module path or with classpath. 
Either way is fine. Put all of these files in your Java module path or 
classpath.
 
 
 Indexing Files
 
 Once you've gotten this far you're probably itching to go. Let's build an
-index! Assuming you've set your CLASSPATH correctly, just type:
+index! Just type either command:
+With module path
 
-java org.apache.lucene.demo.IndexFiles -docs {path-to-lucene}
+java --module-path modules:modules-thirdparty --module 
org.apache.lucene.demo/org.apache.lucene.demo.IndexFiles -docs {path-to-lucene}
+
+With classpath
+
+java -cp "modules/*:modules-thirdparty/*" 
org.apache.lucene.demo.IndexFiles -docs {path-to-lucene}

Review Comment:
   I don't think it'll work. Java doesn't expand wildcards in arguments. Also, 
the path delimiter varies between platforms (Windows uses a semicolon)...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss commented on a diff in pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial



dweiss commented on code in PR #923:
URL: https://github.com/apache/lucene/pull/923#discussion_r881759841


##
lucene/demo/src/java/overview.html:
##
@@ -49,40 +49,49 @@ About the Demo
 demonstrates various functionalities of Lucene and how you can add Lucene to
 your applications.
 
-
-Setting your CLASSPATH
+
+Setting your module path (or classpath)
 
 First, you should http://www.apache.org/dyn/closer.cgi/lucene/java/";>download the latest
 Lucene distribution and then extract it to a working directory.
-You need four JARs: the Lucene JAR, the queryparser JAR, the common 
analysis JAR, and the Lucene
-demo JAR. You should see the Lucene JAR file in the modules/ directory you 
created
-when you extracted the archive -- it should be named something like
+You need Lucene demo and a few dependent modules.
+You should see the Lucene modules (JARs) in the modules/ and third party 
modules in the modules-thirdparty/ directory
+you created when you extracted the archive -- it should be named something like
 lucene-core-{version}.jar. You should also see
 files called lucene-queryparser-{version}.jar,
 lucene-analysis-common-{version}.jar and lucene-demo-{version}.jar under queryparser, 
analysis/common/ and demo/,
-respectively.
-Put all four of these files in your Java CLASSPATH.
+"codefrag">lucene-demo-{version}.jar under modules directory.
+There are two ways to run Java program: with module path or with classpath. 
Either way is fine. Put all of these files in your Java module path or 
classpath.
 
 
 Indexing Files
 
 Once you've gotten this far you're probably itching to go. Let's build an
-index! Assuming you've set your CLASSPATH correctly, just type:
+index! Just type either command:
+With module path
 
-java org.apache.lucene.demo.IndexFiles -docs {path-to-lucene}
+java --module-path modules:modules-thirdparty --module 
org.apache.lucene.demo/org.apache.lucene.demo.IndexFiles -docs {path-to-lucene}
+
+With classpath
+
+java -cp "modules/*:modules-thirdparty/*" 
org.apache.lucene.demo.IndexFiles -docs {path-to-lucene}

Review Comment:
   I double checked and, with some surprise, discovered that it does support a 
quirky glob format (has to be a single *, not a full glob). Anyway, I wouldn't 
bet this works across platforms with the colon and slashes in the cp 
argument... 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta commented on a diff in pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial



mocobeta commented on code in PR #923:
URL: https://github.com/apache/lucene/pull/923#discussion_r881765516


##
lucene/demo/src/java/overview.html:
##
@@ -49,40 +49,49 @@ About the Demo
 demonstrates various functionalities of Lucene and how you can add Lucene to
 your applications.
 
-
-Setting your CLASSPATH
+
+Setting your module path (or classpath)
 
 First, you should http://www.apache.org/dyn/closer.cgi/lucene/java/";>download the latest
 Lucene distribution and then extract it to a working directory.
-You need four JARs: the Lucene JAR, the queryparser JAR, the common 
analysis JAR, and the Lucene
-demo JAR. You should see the Lucene JAR file in the modules/ directory you 
created
-when you extracted the archive -- it should be named something like
+You need Lucene demo and a few dependent modules.
+You should see the Lucene modules (JARs) in the modules/ and third party 
modules in the modules-thirdparty/ directory
+you created when you extracted the archive -- it should be named something like
 lucene-core-{version}.jar. You should also see
 files called lucene-queryparser-{version}.jar,
 lucene-analysis-common-{version}.jar and lucene-demo-{version}.jar under queryparser, 
analysis/common/ and demo/,
-respectively.
-Put all four of these files in your Java CLASSPATH.
+"codefrag">lucene-demo-{version}.jar under modules directory.
+There are two ways to run Java program: with module path or with classpath. 
Either way is fine. Put all of these files in your Java module path or 
classpath.
 
 
 Indexing Files
 
 Once you've gotten this far you're probably itching to go. Let's build an
-index! Assuming you've set your CLASSPATH correctly, just type:
+index! Just type either command:
+With module path
 
-java org.apache.lucene.demo.IndexFiles -docs {path-to-lucene}
+java --module-path modules:modules-thirdparty --module 
org.apache.lucene.demo/org.apache.lucene.demo.IndexFiles -docs {path-to-lucene}
+
+With classpath
+
+java -cp "modules/*:modules-thirdparty/*" 
org.apache.lucene.demo.IndexFiles -docs {path-to-lucene}

Review Comment:
   It works, I confirmed this command. There is "Understanding class path 
wildcards" section in the documentation.
   https://docs.oracle.com/javase/7/docs/technotes/tools/windows/classpath.html
   
   As for the delimiter, I didn't think we should list up all commands for 
Windows and Linux/Mac; then I committed Windows here... sorry but the tutorial 
is written for Unix-like platform from the beginning.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta commented on a diff in pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial



mocobeta commented on code in PR #923:
URL: https://github.com/apache/lucene/pull/923#discussion_r881765516


##
lucene/demo/src/java/overview.html:
##
@@ -49,40 +49,49 @@ About the Demo
 demonstrates various functionalities of Lucene and how you can add Lucene to
 your applications.
 
-
-Setting your CLASSPATH
+
+Setting your module path (or classpath)
 
 First, you should http://www.apache.org/dyn/closer.cgi/lucene/java/";>download the latest
 Lucene distribution and then extract it to a working directory.
-You need four JARs: the Lucene JAR, the queryparser JAR, the common 
analysis JAR, and the Lucene
-demo JAR. You should see the Lucene JAR file in the modules/ directory you 
created
-when you extracted the archive -- it should be named something like
+You need Lucene demo and a few dependent modules.
+You should see the Lucene modules (JARs) in the modules/ and third party 
modules in the modules-thirdparty/ directory
+you created when you extracted the archive -- it should be named something like
 lucene-core-{version}.jar. You should also see
 files called lucene-queryparser-{version}.jar,
 lucene-analysis-common-{version}.jar and lucene-demo-{version}.jar under queryparser, 
analysis/common/ and demo/,
-respectively.
-Put all four of these files in your Java CLASSPATH.
+"codefrag">lucene-demo-{version}.jar under modules directory.
+There are two ways to run Java program: with module path or with classpath. 
Either way is fine. Put all of these files in your Java module path or 
classpath.
 
 
 Indexing Files
 
 Once you've gotten this far you're probably itching to go. Let's build an
-index! Assuming you've set your CLASSPATH correctly, just type:
+index! Just type either command:
+With module path
 
-java org.apache.lucene.demo.IndexFiles -docs {path-to-lucene}
+java --module-path modules:modules-thirdparty --module 
org.apache.lucene.demo/org.apache.lucene.demo.IndexFiles -docs {path-to-lucene}
+
+With classpath
+
+java -cp "modules/*:modules-thirdparty/*" 
org.apache.lucene.demo.IndexFiles -docs {path-to-lucene}

Review Comment:
   It works, I confirmed this command. There is "Understanding class path 
wildcards" section in the documentation.
   https://docs.oracle.com/javase/7/docs/technotes/tools/windows/classpath.html
   
   As for the delimiter, I didn't think we should list up all commands for 
Windows and Linux/Mac; then I omitted Windows here... sorry but the tutorial is 
written for Unix-like platform from the beginning.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta commented on a diff in pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial



mocobeta commented on code in PR #923:
URL: https://github.com/apache/lucene/pull/923#discussion_r881797523


##
lucene/demo/src/java/overview.html:
##
@@ -49,40 +49,49 @@ About the Demo
 demonstrates various functionalities of Lucene and how you can add Lucene to
 your applications.
 
-
-Setting your CLASSPATH
+
+Setting your module path (or classpath)
 
 First, you should http://www.apache.org/dyn/closer.cgi/lucene/java/";>download the latest
 Lucene distribution and then extract it to a working directory.
-You need four JARs: the Lucene JAR, the queryparser JAR, the common 
analysis JAR, and the Lucene
-demo JAR. You should see the Lucene JAR file in the modules/ directory you 
created
-when you extracted the archive -- it should be named something like
+You need Lucene demo and a few dependent modules.
+You should see the Lucene modules (JARs) in the modules/ and third party 
modules in the modules-thirdparty/ directory
+you created when you extracted the archive -- it should be named something like
 lucene-core-{version}.jar. You should also see
 files called lucene-queryparser-{version}.jar,
 lucene-analysis-common-{version}.jar and lucene-demo-{version}.jar under queryparser, 
analysis/common/ and demo/,
-respectively.
-Put all four of these files in your Java CLASSPATH.
+"codefrag">lucene-demo-{version}.jar under modules directory.
+There are two ways to run Java program: with module path or with classpath. 
Either way is fine. Put all of these files in your Java module path or 
classpath.
 
 
 Indexing Files
 
 Once you've gotten this far you're probably itching to go. Let's build an
-index! Assuming you've set your CLASSPATH correctly, just type:
+index! Just type either command:
+With module path
 
-java org.apache.lucene.demo.IndexFiles -docs {path-to-lucene}
+java --module-path modules:modules-thirdparty --module 
org.apache.lucene.demo/org.apache.lucene.demo.IndexFiles -docs {path-to-lucene}
+
+With classpath
+
+java -cp "modules/*:modules-thirdparty/*" 
org.apache.lucene.demo.IndexFiles -docs {path-to-lucene}

Review Comment:
   But of course, we can add a section for Windows platform. I'm not sure how 
far we should care but if we want to provide "working" commands without 
previous knowledge for both Unix-like and Windows, we should allow the 
verboseness?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta commented on a diff in pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial



mocobeta commented on code in PR #923:
URL: https://github.com/apache/lucene/pull/923#discussion_r881797523


##
lucene/demo/src/java/overview.html:
##
@@ -49,40 +49,49 @@ About the Demo
 demonstrates various functionalities of Lucene and how you can add Lucene to
 your applications.
 
-
-Setting your CLASSPATH
+
+Setting your module path (or classpath)
 
 First, you should http://www.apache.org/dyn/closer.cgi/lucene/java/";>download the latest
 Lucene distribution and then extract it to a working directory.
-You need four JARs: the Lucene JAR, the queryparser JAR, the common 
analysis JAR, and the Lucene
-demo JAR. You should see the Lucene JAR file in the modules/ directory you 
created
-when you extracted the archive -- it should be named something like
+You need Lucene demo and a few dependent modules.
+You should see the Lucene modules (JARs) in the modules/ and third party 
modules in the modules-thirdparty/ directory
+you created when you extracted the archive -- it should be named something like
 lucene-core-{version}.jar. You should also see
 files called lucene-queryparser-{version}.jar,
 lucene-analysis-common-{version}.jar and lucene-demo-{version}.jar under queryparser, 
analysis/common/ and demo/,
-respectively.
-Put all four of these files in your Java CLASSPATH.
+"codefrag">lucene-demo-{version}.jar under modules directory.
+There are two ways to run Java program: with module path or with classpath. 
Either way is fine. Put all of these files in your Java module path or 
classpath.
 
 
 Indexing Files
 
 Once you've gotten this far you're probably itching to go. Let's build an
-index! Assuming you've set your CLASSPATH correctly, just type:
+index! Just type either command:
+With module path
 
-java org.apache.lucene.demo.IndexFiles -docs {path-to-lucene}
+java --module-path modules:modules-thirdparty --module 
org.apache.lucene.demo/org.apache.lucene.demo.IndexFiles -docs {path-to-lucene}
+
+With classpath
+
+java -cp "modules/*:modules-thirdparty/*" 
org.apache.lucene.demo.IndexFiles -docs {path-to-lucene}

Review Comment:
   But of course, we can add a section for Windows platform. I'm not sure how 
far we should care but if we want to provide "working" commands without 
previous knowledge for both Unix-like and Windows platforms, we should allow 
the verboseness?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta commented on pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial



mocobeta commented on PR #923:
URL: https://github.com/apache/lucene/pull/923#issuecomment-1137448834

   I first thought it is sufficient to have a concrete working command for a 
Unix-like platform that based on module path, seems like things are not so 
obvious. I don't know what should I do here - I'll leave it for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] msokolov commented on pull request #924: Create Lucene93 Codec and move Lucene92 to backwards_codecs



msokolov commented on PR #924:
URL: https://github.com/apache/lucene/pull/924#issuecomment-1137491647

   I updated this PR so that:
   1. the back-compat Lucene92 Codec no longer has the ability to write HNSW 
vector format
   2. there are unit tests that verify we can still read the Lucene92 vector 
format, and the vectors writer has been moved into the tests to support that. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss commented on a diff in pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial



dweiss commented on code in PR #923:
URL: https://github.com/apache/lucene/pull/923#discussion_r881886986


##
lucene/demo/src/java/overview.html:
##
@@ -49,40 +49,49 @@ About the Demo
 demonstrates various functionalities of Lucene and how you can add Lucene to
 your applications.
 
-
-Setting your CLASSPATH
+
+Setting your module path (or classpath)
 
 First, you should http://www.apache.org/dyn/closer.cgi/lucene/java/";>download the latest
 Lucene distribution and then extract it to a working directory.
-You need four JARs: the Lucene JAR, the queryparser JAR, the common 
analysis JAR, and the Lucene
-demo JAR. You should see the Lucene JAR file in the modules/ directory you 
created
-when you extracted the archive -- it should be named something like
+You need Lucene demo and a few dependent modules.
+You should see the Lucene modules (JARs) in the modules/ and third party 
modules in the modules-thirdparty/ directory
+you created when you extracted the archive -- it should be named something like
 lucene-core-{version}.jar. You should also see
 files called lucene-queryparser-{version}.jar,
 lucene-analysis-common-{version}.jar and lucene-demo-{version}.jar under queryparser, 
analysis/common/ and demo/,
-respectively.
-Put all four of these files in your Java CLASSPATH.
+"codefrag">lucene-demo-{version}.jar under modules directory.
+There are two ways to run Java program: with module path or with classpath. 
Either way is fine. Put all of these files in your Java module path or 
classpath.
 
 
 Indexing Files
 
 Once you've gotten this far you're probably itching to go. Let's build an
-index! Assuming you've set your CLASSPATH correctly, just type:
+index! Just type either command:
+With module path
 
-java org.apache.lucene.demo.IndexFiles -docs {path-to-lucene}
+java --module-path modules:modules-thirdparty --module 
org.apache.lucene.demo/org.apache.lucene.demo.IndexFiles -docs {path-to-lucene}
+
+With classpath
+
+java -cp "modules/*:modules-thirdparty/*" 
org.apache.lucene.demo.IndexFiles -docs {path-to-lucene}

Review Comment:
   Ok, let's leave Windows out of it. People on Windows will know what to do, I 
think.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta commented on pull request #923: LUCENE-10200: Correct outdated instruction in the demo tutorial



mocobeta commented on PR #923:
URL: https://github.com/apache/lucene/pull/923#issuecomment-1137531284

   Just a note... the thing I wanted to fix is, that the current tutorial has 
been outdated on many points - I don't think people can run the demo app 
without trials and error. 
   Classpath vs module path shouldn't be the main interest here, I think the 
module-path based explanation would be reasonable from several viewpoints 
though. I might mislead the conversation from the start if we fell into 
bikeshedding.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mdmarshmallow commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities



mdmarshmallow commented on code in PR #841:
URL: https://github.com/apache/lucene/pull/841#discussion_r881945764


##
lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/LongPointFacetField.java:
##
@@ -0,0 +1,35 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.hyperrectangle;
+
+import org.apache.lucene.document.BinaryDocValuesField;
+import org.apache.lucene.document.LongPoint;
+
+/** Packs an array of longs into a {@link BinaryDocValuesField} */
+public class LongPointFacetField extends BinaryDocValuesField {

Review Comment:
   I was wondering what your thoughts were on just using separate numeric 
fields rather than packing them. I think this would make the API "nicer" to be 
honest, but the big drawback would that we would need some hacky multivalued 
implementation. I can think of some ways to build some sort of 
UnsortedNumericDV on top of SortedNumericDV, but they would all be super hacky 
and have limitations and probably not worth implementing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira?



[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542180#comment-17542180
 ] 

Tomoko Uchida commented on LUCENE-10557:


We are having a long discuss thread on the dev list and many issues are posed. 
Here is a short summary (with my brief thoughts/opinions).
 * Concerns for political neutrality of GitHub - in other words, concerns for 
account bans with no good reason
 ** Seems there are several cases (including rumors) of GitHub account bans. 
It's unclear whether they violate its terms of policy or not, and we won't be 
able to correctly assess the risk to me. I would defer the judgment to the 
individuals.
 ** For developers who don't use GitHub regardless of the reason, we will 
always support contribution paths that do not rely on GitHub. Patches via Jira 
will be a decent option for good.
 * Concerns for its parent company, Microsoft
 ** I'd defer the judgment on that to the individuals for the same reason for 
the previous subject. One thing I could say is, that the recent trend in their 
direction is GOOD - they support/sponsor OSS and Java communities and even 
publish very popular open-source software (VSCode and LightGBM are outstanding 
examples I think).
 * Concerns for lack of issue workflow and simpler metadata management
 ** From the practical viewpoint, it fully makes sense to me that many people 
talked about it. We would need to carefully think of how to control versions 
and issue/PR metadata. Large projects that are fully operated on GitHub 
overcome this shortcoming in various ways - organized issue templates with 
fixed label sets would be an example. I think we will have a sandbox repository 
outside ASF, then try some experiments on it before actual migration.
 * Security issues that only PMC members are allowed to be accessed
 ** We will be able to continue to use Jira for this purpose, or we could even 
have an issue-only private GitHub repository for Lucene?
 * Concerns for migration of whole Jira issue history to GitHub issue

 ** I don't think it is possible. I'm almost sure there will be some 
information losses if we attempt to migrate the whole Jira issue with 
metadata/history into Github. Rather than trying to do that, I would prefer to 
let Jira issues as is, then simply refer them.
 ** If we don't aim at perfection, I think we'll be able to migrate all (or 
part of) issues with APIs as Shad Storhaug kindly shared in this comment.

Aside from those concerns, there seems no disagreement with GitHub is superior 
to Jira in terms of overall UX design, and most new developers like it.

> Migrate to GitHub issue from Jira?
> --
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Priority: Major
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * Get a consensus about the migration among committers
>  * Enable Github issue on the lucene's repository (currently, it is disabled 
> on it)
>  * Build the convention or rules for issue label/milestone management
>  * Choose issues that should be moved to GitHub (I think too old or obsolete 
> issues can remain Jira.)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10557) Migrate to GitHub issue from Jira?

[
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542180#comment-17542180
]

Tomoko Uchida edited comment on LUCENE-10557 at 5/25/22 6:27 PM:
-

We are having a long discuss thread on the dev list and many issues are posed.
Here is a short summary (with my brief thoughts/opinions).
* Concerns for political neutrality of GitHub - in other words, concerns for
account bans with no good reason
** Seems there are several cases (including rumors) of GitHub account bans.
It's unclear whether they violate its terms of policy or not, and we won't be
able to correctly assess the risk to me. I would defer the judgment to the
individuals.
** For developers who don't use GitHub regardless of the reason, we will
always support contribution paths that do not rely on GitHub. Patches via Jira
will be a decent option for good.
* Concerns for its parent company, Microsoft
** I'd defer the judgment on that to the individuals for the same reason for
the previous subject. One thing I could say is, that the recent trend in their
direction is GOOD - they support/sponsor OSS and Java communities and even
publish very popular open-source software (VSCode and LightGBM are outstanding
examples I think).
* Concerns for lack of issue workflow and simpler metadata management
** From the practical viewpoint, it fully makes sense to me that many people
talked about it. We would need to carefully think of how to control versions
and issue/PR metadata. Large projects that are fully operated on GitHub
overcome this shortcoming in various ways - organized issue templates with
fixed label sets would be an example. I think we will have a sandbox repository
outside ASF, then try some experiments on it before actual migration.
* Security issues that only PMC members are allowed to be accessed
** We will be able to continue to use Jira for this purpose, or we could even
have an issue-only private GitHub repository for Lucene?
* Concerns for migration of whole Jira issue history to GitHub issue
** I don't think it is possible. I'm almost sure there will be some
information losses if we attempt to migrate the whole Jira issue with
metadata/history into Github. Rather than trying to do that, I would prefer to
let Jira issues as is, then simply refer them.
** If we don't aim at perfection, I think we'll be able to migrate all (or
part of) issues with APIs as Shad Storhaug kindly shared in this comment.

Aside from those concerns, there seems no disagreement with GitHub is superior
to Jira in terms of overall UX design, and most new developers like it.

was (Author: tomoko uchida):
We are having a long discuss thread on the dev list and many issues are posed.
Here is a short summary (with my brief thoughts/opinions).
* Concerns for political neutrality of GitHub - in other words, concerns for
account bans with no good reason
** Seems there are several cases (including rumors) of GitHub account bans.
It's unclear whether they violate its terms of policy or not, and we won't be
able to correctly assess the risk to me. I would defer the judgment to the
individuals.
** For developers who don't use GitHub regardless of the reason, we will
always support contribution paths that do not rely on GitHub. Patches via Jira
will be a decent option for good.
* Concerns for its parent company, Microsoft
** I'd defer the judgment on that to the individuals for the same reason for
the previous subject. One thing I could say is, that the recent trend in their
direction is GOOD - they support/sponsor OSS and Java communities and even
publish very popular open-source software (VSCode and LightGBM are outstanding
examples I think).
* Concerns for lack of issue workflow and simpler metadata management
** From the practical viewpoint, it fully makes sense to me that many people
talked about it. We would need to carefully think of how to control versions
and issue/PR metadata. Large projects that are fully operated on GitHub
overcome this shortcoming in various ways - organized issue templates with
fixed label sets would be an example. I think we will have a sandbox repository
outside ASF, then try some experiments on it before actual migration.
* Security issues that only PMC members are allowed to be accessed
** We will be able to continue to use Jira for this purpose, or we could even
have an issue-only private GitHub repository for Lucene?
* Concerns for migration of whole Jira issue history to GitHub issue

** I don't think it is possible. I'm almost sure there will be some
information losses if we attempt to migrate the whole Jira issue with
metadata/history into Github. Rather than trying to do that, I would prefer to
let Jira issues as is, then simply refer them.
** If we don't aim at perfection, I think we'll be able to migrate all (or
pa

[GitHub] [lucene] mdmarshmallow commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities



mdmarshmallow commented on code in PR #841:
URL: https://github.com/apache/lucene/pull/841#discussion_r881945764


##
lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/LongPointFacetField.java:
##
@@ -0,0 +1,35 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.hyperrectangle;
+
+import org.apache.lucene.document.BinaryDocValuesField;
+import org.apache.lucene.document.LongPoint;
+
+/** Packs an array of longs into a {@link BinaryDocValuesField} */
+public class LongPointFacetField extends BinaryDocValuesField {

Review Comment:
   I was wondering what your thoughts were on just using separate numeric 
fields rather than packing them. I think this would make the API "nicer" to be 
honest, but the big drawback would that we would need some hacky multivalued 
implementation. I can think of some ways to build some sort of 
UnsortedNumericDV on top of SortedNumericDV, but they would all be super hacky 
and have limitations and probably not worth implementing.
   
   Edit: Upon thinking about this further, my suggestion doesn't make sense 
when we have multi-valued fields



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mdmarshmallow commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities



mdmarshmallow commented on code in PR #841:
URL: https://github.com/apache/lucene/pull/841#discussion_r881945764


##
lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/LongPointFacetField.java:
##
@@ -0,0 +1,35 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.hyperrectangle;
+
+import org.apache.lucene.document.BinaryDocValuesField;
+import org.apache.lucene.document.LongPoint;
+
+/** Packs an array of longs into a {@link BinaryDocValuesField} */
+public class LongPointFacetField extends BinaryDocValuesField {

Review Comment:
   ~~I was wondering what your thoughts were on just using separate numeric 
fields rather than packing them. I think this would make the API "nicer" to be 
honest, but the big drawback would that we would need some hacky multivalued 
implementation. I can think of some ways to build some sort of 
UnsortedNumericDV on top of SortedNumericDV, but they would all be super hacky 
and have limitations and probably not worth implementing.~~
   
   Edit: Upon thinking about this further, my suggestion doesn't make sense 
when we have multi-valued fields



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10557) Migrate to GitHub issue from Jira?

[
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542180#comment-17542180
]

Tomoko Uchida edited comment on LUCENE-10557 at 5/25/22 6:31 PM:
-

Aside from those concerns, there seems no disagreement with GitHub is superior
to Jira in terms of overall UX design, and most new developers like it.

[jira] [Comment Edited] (LUCENE-10557) Migrate to GitHub issue from Jira?

[
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542180#comment-17542180
]

Tomoko Uchida edited comment on LUCENE-10557 at 5/25/22 6:37 PM:
-

We are having a long discuss thread on the dev list and many issues are posed.
Here is a short summary (with my brief thoughts/opinions).
* Concerns for political neutrality of GitHub - in other words, concerns for
account bans with no good reason
** Seems there are several cases (including rumors) of GitHub account bans.
It's unclear whether they violate its terms of policy or not, and we won't be
able to correctly assess the risk to me. I would defer the judgment to the
individuals.
** For developers who don't use GitHub for whatever reason, we will always
support contribution paths that do not rely on GitHub. Patches via Jira will be
a decent option for good.
* Concerns for its parent company, Microsoft
** I'd defer the judgment on that to the individuals for the same reason for
the previous subject. One thing I could say is, that the recent trend in their
direction is GOOD - they support/sponsor OSS and Java communities and even
publish very popular open-source software (VSCode and LightGBM are outstanding
examples I think).
* Concerns for lack of issue workflow and simpler metadata management
** From the practical viewpoint, it fully makes sense to me that many people
talked about it. We would need to carefully think of how to control versions
and issue/PR metadata. Large projects that are fully operated on GitHub
overcome this shortcoming in various ways - organized issue templates with
fixed label sets would be an example. I think we will have a sandbox repository
outside ASF, then try some experiments on it before actual migration.
* Security issues that only PMC members are allowed to be accessed
** We will be able to continue to use Jira for this purpose, or we could even
have an issue-only private GitHub repository for Lucene?
* Concerns for migration of whole Jira issue history to GitHub issue
** I don't think it is possible. I'm almost sure there will be some
information losses if we attempt to migrate the whole Jira issue with
metadata/history into Github. Rather than trying to do that, I would prefer to
let Jira issues as is, then simply refer them.
** If we don't aim at perfection, I think we'll be able to migrate all (or
part of) issues with APIs as Shad Storhaug kindly shared in this comment.

Aside from those concerns, there seems no disagreement with GitHub is superior
to Jira in terms of overall UX design, and most new developers like it.

[jira] [Created] (LUCENE-10592) Should we build HNSW graph on the fly during indexing

2022-05-25 Thread Mayya Sharipova (Jira)

Mayya Sharipova created LUCENE-10592:


 Summary: Should we build HNSW graph on the fly during indexing
 Key: LUCENE-10592
 URL: https://issues.apache.org/jira/browse/LUCENE-10592
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Mayya Sharipova


Currently, when we index vectors for KnnVectorField, we buffer those vectors in 
memory and on flush during a segment construction we build an HNSW graph.  As 
building an HNSW graph is very expensive, this makes flush operation take a lot 
of time. This also makes overall indexing performance quite unpredictable (as 
the number of flushes are defined by memory used, and the presence of 
concurrent searches), e.g. some indexing operations return almost instantly 
while others that trigger flush take a lot of time. 

Building an HNSW graph on the fly as we index vectors allows to avoid this 
problem, and spread a load of HNSW graph construction evenly. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10592) Should we build HNSW graph on the fly during indexing

2022-05-25 Thread Mayya Sharipova (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayya Sharipova updated LUCENE-10592:
-
Description: 
Currently, when we index vectors for KnnVectorField, we buffer those vectors in 
memory and on flush during a segment construction we build an HNSW graph.  As 
building an HNSW graph is very expensive, this makes flush operation take a lot 
of time. This also makes overall indexing performance quite unpredictable (as 
the number of flushes are defined by memory used, and the presence of 
concurrent searches), e.g. some indexing operations return almost instantly 
while others that trigger flush take a lot of time. 

Building an HNSW graph on the fly as we index vectors allows to avoid this 
problem, and spread a load of HNSW graph construction evenly during indexing.

  was:
Currently, when we index vectors for KnnVectorField, we buffer those vectors in 
memory and on flush during a segment construction we build an HNSW graph.  As 
building an HNSW graph is very expensive, this makes flush operation take a lot 
of time. This also makes overall indexing performance quite unpredictable (as 
the number of flushes are defined by memory used, and the presence of 
concurrent searches), e.g. some indexing operations return almost instantly 
while others that trigger flush take a lot of time. 

Building an HNSW graph on the fly as we index vectors allows to avoid this 
problem, and spread a load of HNSW graph construction evenly. 


> Should we build HNSW graph on the fly during indexing
> -
>
> Key: LUCENE-10592
> URL: https://issues.apache.org/jira/browse/LUCENE-10592
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mayya Sharipova
>Priority: Minor
>
> Currently, when we index vectors for KnnVectorField, we buffer those vectors 
> in memory and on flush during a segment construction we build an HNSW graph.  
> As building an HNSW graph is very expensive, this makes flush operation take 
> a lot of time. This also makes overall indexing performance quite 
> unpredictable (as the number of flushes are defined by memory used, and the 
> presence of concurrent searches), e.g. some indexing operations return almost 
> instantly while others that trigger flush take a lot of time. 
> Building an HNSW graph on the fly as we index vectors allows to avoid this 
> problem, and spread a load of HNSW graph construction evenly during indexing.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10592) Should we build HNSW graph on the fly during indexing

2022-05-25 Thread Mayya Sharipova (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayya Sharipova updated LUCENE-10592:
-
Description: 
Currently, when we index vectors for KnnVectorField, we buffer those vectors in 
memory and on flush during a segment construction we build an HNSW graph.  As 
building an HNSW graph is very expensive, this makes flush operation take a lot 
of time. This also makes overall indexing performance quite unpredictable (as 
the number of flushes are defined by memory used, and the presence of 
concurrent searches), e.g. some indexing operations return almost instantly 
while others that trigger flush take a lot of time. 

Building an HNSW graph on the fly as we index vectors allows to avoid this 
problem, and spread a load of HNSW graph construction evenly during indexing.

This will also supersede LUCENE-10194

  was:
Currently, when we index vectors for KnnVectorField, we buffer those vectors in 
memory and on flush during a segment construction we build an HNSW graph.  As 
building an HNSW graph is very expensive, this makes flush operation take a lot 
of time. This also makes overall indexing performance quite unpredictable (as 
the number of flushes are defined by memory used, and the presence of 
concurrent searches), e.g. some indexing operations return almost instantly 
while others that trigger flush take a lot of time. 

Building an HNSW graph on the fly as we index vectors allows to avoid this 
problem, and spread a load of HNSW graph construction evenly during indexing.


> Should we build HNSW graph on the fly during indexing
> -
>
> Key: LUCENE-10592
> URL: https://issues.apache.org/jira/browse/LUCENE-10592
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mayya Sharipova
>Priority: Minor
>
> Currently, when we index vectors for KnnVectorField, we buffer those vectors 
> in memory and on flush during a segment construction we build an HNSW graph.  
> As building an HNSW graph is very expensive, this makes flush operation take 
> a lot of time. This also makes overall indexing performance quite 
> unpredictable (as the number of flushes are defined by memory used, and the 
> presence of concurrent searches), e.g. some indexing operations return almost 
> instantly while others that trigger flush take a lot of time. 
> Building an HNSW graph on the fly as we index vectors allows to avoid this 
> problem, and spread a load of HNSW graph construction evenly during indexing.
> This will also supersede LUCENE-10194



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on pull request #728: LUCENE-10194 Buffer KNN vectors on disk



mayya-sharipova commented on PR #728:
URL: https://github.com/apache/lucene/pull/728#issuecomment-1137833962

   @LuXugang  Thanks for looking into this.  I was thinking to close this issue 
and this PR. As @jtibshirani noted the problem with this approach is that flush 
or a segment creation may take a very substantial time. 
   
   Instead, I was thinking to have a different approach how we index vectors - 
building an HNSW graph on the fly while indexing, as explained in the 
[LUCENE-10592](https://issues.apache.org/jira/browse/LUCENE-10592). 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] kiranchitturi opened a new pull request, #2662: SOLR-16215 Escape query characters in Solr SQL Array UDF functions



kiranchitturi opened a new pull request, #2662:
URL: https://github.com/apache/lucene-solr/pull/2662

   * Backport of https://github.com/apache/solr/pull/879


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on a diff in pull request #924: Create Lucene93 Codec and move Lucene92 to backwards_codecs



jtibshirani commented on code in PR #924:
URL: https://github.com/apache/lucene/pull/924#discussion_r882189288


##
lucene/backward-codecs/src/test/org/apache/lucene/backward_codecs/lucene92/TestLucene92HnswVectorsFormat.java:
##
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.backward_codecs.lucene92;
+
+import org.apache.lucene.codecs.Codec;
+import org.apache.lucene.codecs.KnnVectorsFormat;
+import org.apache.lucene.tests.index.BaseKnnVectorsFormatTestCase;
+
+public class TestLucene92HnswVectorsFormat extends 
BaseKnnVectorsFormatTestCase {
+  @Override
+  protected Codec getCodec() {
+return new Lucene92RWCodec();
+  }
+
+  public void testToString() {
+Codec customCodec =
+new Lucene92RWCodec() {
+  @Override
+  public KnnVectorsFormat getKnnVectorsFormatForField(String field) {
+return new Lucene92RWHnswVectorsFormat();
+  }
+};
+String expectedString = "Lucene92RWHnswVectorsFormat";

Review Comment:
   Is there a reason to take a different approach to `toString` for this format 
than we did for the older ones like `Lucene91RWHnswVectorsFormat`?



##
lucene/backward-codecs/src/test/org/apache/lucene/backward_codecs/lucene92/Lucene92RWHnswVectorsFormat.java:
##
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.backward_codecs.lucene92;
+
+import java.io.IOException;
+import org.apache.lucene.codecs.KnnVectorsReader;
+import org.apache.lucene.codecs.KnnVectorsWriter;
+import org.apache.lucene.index.SegmentReadState;
+import org.apache.lucene.index.SegmentWriteState;
+import org.apache.lucene.util.hnsw.HnswGraph;
+
+public final class Lucene92RWHnswVectorsFormat extends 
Lucene92HnswVectorsFormat {
+
+  /** Default number of maximum connections per node */
+  public static final int DEFAULT_MAX_CONN = 16;
+
+  /**
+   * Default number of the size of the queue maintained while searching during 
a graph construction.
+   */
+  public static final int DEFAULT_BEAM_WIDTH = 100;
+
+  static final int DIRECT_MONOTONIC_BLOCK_SHIFT = 16;
+
+  /**
+   * Controls how many of the nearest neighbor candidates are connected to the 
new node. Defaults to
+   * {@link #DEFAULT_MAX_CONN}. See {@link HnswGraph} for more details.
+   */
+  private final int maxConn;
+
+  /**
+   * The number of candidate neighbors to track while searching the graph for 
each newly inserted
+   * node. Defaults to to {@link #DEFAULT_BEAM_WIDTH}. See {@link HnswGraph} 
for details.
+   */
+  private final int beamWidth;
+
+  /** Constructs a format using default graph construction parameters. */
+  public Lucene92RWHnswVectorsFormat() {

Review Comment:
   Small comment -- for other test formats like `Lucene91RWHnswVectorsFormat` 
we accepted beamWidth and maxConn as parameters and referred to the static 
defaults (like `Lucene91HnswVectorsFormat.DEFAULT_MAX_CONN`). Also we didn't 
have local variables `beamWidth` and `maxConn`. It'd be nice to keep the same 
pattern for consistency.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


--

[GitHub] [lucene] mikemccand merged pull request #925: LUCENE-10591: Fix UTF-8 encoding

2022-05-25 Thread ASF subversion and git services (Jira)



mikemccand merged PR #925:
URL: https://github.com/apache/lucene/pull/925


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10591) Invalid character in SortableSingleDocSource.java



[ 
https://issues.apache.org/jira/browse/LUCENE-10591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542289#comment-17542289
 ] 

ASF subversion and git services commented on LUCENE-10591:
--

Commit 3a80968ddf30293ddf55c62f8f2f8a6915028408 in lucene's branch 
refs/heads/main from András Salamon
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=3a80968ddf3 ]

LUCENE-10591 Invalid character in SortableSingleDocSource.java (#925)



> Invalid character in SortableSingleDocSource.java
> -
>
> Key: LUCENE-10591
> URL: https://issues.apache.org/jira/browse/LUCENE-10591
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Andras Salamon
>Priority: Trivial
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> There are invalid UTF-8 characters in SortableSingleDocSource.java
> "S�o Tom� and Pr�ncipe"
> Sonar gave me a warning because of this.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10591) Invalid character in SortableSingleDocSource.java

2022-05-25 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542290#comment-17542290
 ] 

ASF subversion and git services commented on LUCENE-10591:
--

Commit eecf8ea63b90e1f77bb329a1d6e9d8cd6ad8aeb2 in lucene's branch 
refs/heads/branch_9x from András Salamon
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=eecf8ea63b9 ]

LUCENE-10591 Invalid character in SortableSingleDocSource.java (#925)



> Invalid character in SortableSingleDocSource.java
> -
>
> Key: LUCENE-10591
> URL: https://issues.apache.org/jira/browse/LUCENE-10591
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Andras Salamon
>Priority: Trivial
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> There are invalid UTF-8 characters in SortableSingleDocSource.java
> "S�o Tom� and Pr�ncipe"
> Sonar gave me a warning because of this.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-10591) Invalid character in SortableSingleDocSource.java

2022-05-25 Thread Michael McCandless (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-10591.
-
Fix Version/s: 10.0 (main)
   9.3
   Resolution: Fixed

Thank you for the attention to detail [~asalamon74]!  I merged the PR to 
main/10.0 and cherry-picked to 9.x (eventuallhy 9.3).

> Invalid character in SortableSingleDocSource.java
> -
>
> Key: LUCENE-10591
> URL: https://issues.apache.org/jira/browse/LUCENE-10591
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Andras Salamon
>Priority: Trivial
> Fix For: 10.0 (main), 9.3
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> There are invalid UTF-8 characters in SortableSingleDocSource.java
> "S�o Tom� and Pr�ncipe"
> Sonar gave me a warning because of this.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mikemccand commented on pull request #915: LUCENE-10585: Scrub copy/paste code in the facets module and attempt to simplify a bit



mikemccand commented on PR #915:
URL: https://github.com/apache/lucene/pull/915#issuecomment-1138026556

   Thank you @Yuti-G for running the dedicated `luceneutil` faceting benchmark!
   
   But: the `getAllDims` time for SSDV seems to have gotten much faster with 
this PR, which is great!  Was that expected?  Or is this some horrible noise?  
Is it repeatable?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query



[ 
https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542341#comment-17542341
 ] 

Ming Zhu commented on LUCENE-10562:
---

I'm encountering a similar issue, but the impact is more than performance.

My case is, I have a wildcard query with filter, let's say,

*wildcard:'*searchvalue*' and term filter 'status':'open'*

And I'm using TopTermsScoringBooleanQueryRewrite to hopefully get some 
meaningful relevance scores to sort all the hits.

For my data set, there are millions of documents where status is NOT open, and 
a handful of them with status:open. So the issue here is with the rewrite with 
top terms, all the terms which are relevant for documents with *status:open* 
are ranked very low (because of their low frequencies), but apparently I can't 
keep increasing the size of terms to be taken in the rewrite phase, as that may 
lead to the max clause issue.

So this query+filter ended up with not hitting anything.

Any idea how to get out of this situation? Thanks.

[~uschindler]  [~tomoko] 

 

> Large system: Wildcard search leads to full index scan despite filter query
> ---
>
> Key: LUCENE-10562
> URL: https://issues.apache.org/jira/browse/LUCENE-10562
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.11.1
>Reporter: Henrik Hertel
>Priority: Major
>  Labels: performance
>
> I use Solr and have a large system with 1TB in one core and about 5 million 
> documents. The textual content of large PDF files is indexed there. My query 
> is extremely slow (more than 30 seconds)  as soon as I use wildcards e.g. 
> {code:java}
> *searchvalue*
> {code}
> , even though I put a filter query in front of it that reduces to less than 
> 20 documents.
> searchvalue -> less than 1 second
> searchvalue* -> less than 1 second
> My query:
> {code:java}
> select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0
>  {code}
> I've tried everything imaginable. It doesn't make sense to me why a search 
> over a small subset should take so long. If I omit the filter query 
> metadataitemids_is:20950, so search the entire inventory, then it also takes 
> the same amount of time. Therefore, I suspect that despite the filter query, 
> the main query runs over the entire index.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query



[ 
https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542341#comment-17542341
 ] 

Ming Zhu edited comment on LUCENE-10562 at 5/26/22 4:39 AM:


I'm encountering a similar issue, but the impact is more than performance.

My case is, I have a wildcard query with filter, let's say,

*wildcard:'*searchvalue*' and term filter 'status':'open'*

And I'm using TopTermsScoringBooleanQueryRewrite to hopefully get some 
meaningful relevance scores to sort all the hits.

For my data set, there are millions of documents where status is NOT open, and 
a handful of them with status:open. So the issue here is with the rewrite with 
top terms, all the terms which are relevant for documents with *status:open* 
are ranked very low (because of their low frequencies), but apparently I can't 
keep increasing the size of terms to be taken in the rewrite phase, as that may 
lead to the max clause issue.

So this query+filter ended up with not hitting anything.

Any idea how to get out of this situation? Thanks.

[~uschindler]  [~tomoko] 

 


was (Author: JIRAUSER290042):
I'm encountering a similar issue, but the impact is more than performance.

My case is, I have a wildcard query with filter, let's say,

*wildcard:'*searchvalue*' and term filter 'status':'open'*

And I'm using TopTermsScoringBooleanQueryRewrite to hopefully get some 
meaningful relevance scores to sort all the hits.

For my data set, there are millions of documents where status is NOT open, and 
a handful of them with status:open. So the issue here is with the rewrite with 
top terms, all the terms which are relevant for documents with *status:open* 
are ranked very low (because of their low frequencies), but apparently I can't 
keep increasing the size of terms to be taken in the rewrite phase, as that may 
lead to the max clause issue.

So this query+filter ended up with not hitting anything.

Any idea how to get out of this situation? Thanks.

[~uschindler]  [~tomoko] 

 

> Large system: Wildcard search leads to full index scan despite filter query
> ---
>
> Key: LUCENE-10562
> URL: https://issues.apache.org/jira/browse/LUCENE-10562
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.11.1
>Reporter: Henrik Hertel
>Priority: Major
>  Labels: performance
>
> I use Solr and have a large system with 1TB in one core and about 5 million 
> documents. The textual content of large PDF files is indexed there. My query 
> is extremely slow (more than 30 seconds)  as soon as I use wildcards e.g. 
> {code:java}
> *searchvalue*
> {code}
> , even though I put a filter query in front of it that reduces to less than 
> 20 documents.
> searchvalue -> less than 1 second
> searchvalue* -> less than 1 second
> My query:
> {code:java}
> select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0
>  {code}
> I've tried everything imaginable. It doesn't make sense to me why a search 
> over a small subset should take so long. If I omit the filter query 
> metadataitemids_is:20950, so search the entire inventory, then it also takes 
> the same amount of time. Therefore, I suspect that despite the filter query, 
> the main query runs over the entire index.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query



[ 
https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542341#comment-17542341
 ] 

Ming Zhu edited comment on LUCENE-10562 at 5/26/22 4:40 AM:


I'm encountering a similar issue, but the impact is more than performance.

My case is, I have a wildcard query with filter, let's say,

{*}wildcard:'\*{*}{*}searchvalue\*{*}{*}' and term filter 'status':'open'{*}

And I'm using TopTermsScoringBooleanQueryRewrite to hopefully get some 
meaningful relevance scores to sort all the hits.

For my data set, there are millions of documents where status is NOT open, and 
a handful of them with status:open. So the issue here is with the rewrite with 
top terms, all the terms which are relevant for documents with *status:open* 
are ranked very low (because of their low frequencies), but apparently I can't 
keep increasing the size of terms to be taken in the rewrite phase, as that may 
lead to the max clause issue.

So this query+filter ended up with not hitting anything.

Any idea how to get out of this situation? Thanks.

[~uschindler]  [~tomoko] 

 


was (Author: JIRAUSER290042):
I'm encountering a similar issue, but the impact is more than performance.

My case is, I have a wildcard query with filter, let's say,

{*}wildcard:'\{*}{*}searchvalue{*}{*}{*}{*}' and term filter 'status':'open'{*}

And I'm using TopTermsScoringBooleanQueryRewrite to hopefully get some 
meaningful relevance scores to sort all the hits.

For my data set, there are millions of documents where status is NOT open, and 
a handful of them with status:open. So the issue here is with the rewrite with 
top terms, all the terms which are relevant for documents with *status:open* 
are ranked very low (because of their low frequencies), but apparently I can't 
keep increasing the size of terms to be taken in the rewrite phase, as that may 
lead to the max clause issue.

So this query+filter ended up with not hitting anything.

Any idea how to get out of this situation? Thanks.

[~uschindler]  [~tomoko] 

 

> Large system: Wildcard search leads to full index scan despite filter query
> ---
>
> Key: LUCENE-10562
> URL: https://issues.apache.org/jira/browse/LUCENE-10562
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.11.1
>Reporter: Henrik Hertel
>Priority: Major
>  Labels: performance
>
> I use Solr and have a large system with 1TB in one core and about 5 million 
> documents. The textual content of large PDF files is indexed there. My query 
> is extremely slow (more than 30 seconds)  as soon as I use wildcards e.g. 
> {code:java}
> *searchvalue*
> {code}
> , even though I put a filter query in front of it that reduces to less than 
> 20 documents.
> searchvalue -> less than 1 second
> searchvalue* -> less than 1 second
> My query:
> {code:java}
> select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0
>  {code}
> I've tried everything imaginable. It doesn't make sense to me why a search 
> over a small subset should take so long. If I omit the filter query 
> metadataitemids_is:20950, so search the entire inventory, then it also takes 
> the same amount of time. Therefore, I suspect that despite the filter query, 
> the main query runs over the entire index.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query



[ 
https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542341#comment-17542341
 ] 

Ming Zhu edited comment on LUCENE-10562 at 5/26/22 4:40 AM:


I'm encountering a similar issue, but the impact is more than performance.

My case is, I have a wildcard query with filter, let's say,

{*}*wildcard:'\*searchvalue\*{*}' and term filter 'status':'open'*

And I'm using TopTermsScoringBooleanQueryRewrite to hopefully get some 
meaningful relevance scores to sort all the hits.

For my data set, there are millions of documents where status is NOT open, and 
a handful of them with status:open. So the issue here is with the rewrite with 
top terms, all the terms which are relevant for documents with *status:open* 
are ranked very low (because of their low frequencies), but apparently I can't 
keep increasing the size of terms to be taken in the rewrite phase, as that may 
lead to the max clause issue.

So this query+filter ended up with not hitting anything.

Any idea how to get out of this situation? Thanks.

[~uschindler]  [~tomoko] 

 


was (Author: JIRAUSER290042):
I'm encountering a similar issue, but the impact is more than performance.

My case is, I have a wildcard query with filter, let's say,

*wildcard:'*searchvalue*' and term filter 'status':'open'*

And I'm using TopTermsScoringBooleanQueryRewrite to hopefully get some 
meaningful relevance scores to sort all the hits.

For my data set, there are millions of documents where status is NOT open, and 
a handful of them with status:open. So the issue here is with the rewrite with 
top terms, all the terms which are relevant for documents with *status:open* 
are ranked very low (because of their low frequencies), but apparently I can't 
keep increasing the size of terms to be taken in the rewrite phase, as that may 
lead to the max clause issue.

So this query+filter ended up with not hitting anything.

Any idea how to get out of this situation? Thanks.

[~uschindler]  [~tomoko] 

 

> Large system: Wildcard search leads to full index scan despite filter query
> ---
>
> Key: LUCENE-10562
> URL: https://issues.apache.org/jira/browse/LUCENE-10562
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.11.1
>Reporter: Henrik Hertel
>Priority: Major
>  Labels: performance
>
> I use Solr and have a large system with 1TB in one core and about 5 million 
> documents. The textual content of large PDF files is indexed there. My query 
> is extremely slow (more than 30 seconds)  as soon as I use wildcards e.g. 
> {code:java}
> *searchvalue*
> {code}
> , even though I put a filter query in front of it that reduces to less than 
> 20 documents.
> searchvalue -> less than 1 second
> searchvalue* -> less than 1 second
> My query:
> {code:java}
> select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0
>  {code}
> I've tried everything imaginable. It doesn't make sense to me why a search 
> over a small subset should take so long. If I omit the filter query 
> metadataitemids_is:20950, so search the entire inventory, then it also takes 
> the same amount of time. Therefore, I suspect that despite the filter query, 
> the main query runs over the entire index.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query



[ 
https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542341#comment-17542341
 ] 

Ming Zhu edited comment on LUCENE-10562 at 5/26/22 4:40 AM:


I'm encountering a similar issue, but the impact is more than performance.

My case is, I have a wildcard query with filter, let's say,

{*}wildcard:'\{*}{*}searchvalue{*}{*}{*}{*}' and term filter 'status':'open'{*}

And I'm using TopTermsScoringBooleanQueryRewrite to hopefully get some 
meaningful relevance scores to sort all the hits.

For my data set, there are millions of documents where status is NOT open, and 
a handful of them with status:open. So the issue here is with the rewrite with 
top terms, all the terms which are relevant for documents with *status:open* 
are ranked very low (because of their low frequencies), but apparently I can't 
keep increasing the size of terms to be taken in the rewrite phase, as that may 
lead to the max clause issue.

So this query+filter ended up with not hitting anything.

Any idea how to get out of this situation? Thanks.

[~uschindler]  [~tomoko] 

 


was (Author: JIRAUSER290042):
I'm encountering a similar issue, but the impact is more than performance.

My case is, I have a wildcard query with filter, let's say,

*wildcard:'\{*}searchvalue\{*}' and term filter 'status':'open'*

And I'm using TopTermsScoringBooleanQueryRewrite to hopefully get some 
meaningful relevance scores to sort all the hits.

For my data set, there are millions of documents where status is NOT open, and 
a handful of them with status:open. So the issue here is with the rewrite with 
top terms, all the terms which are relevant for documents with *status:open* 
are ranked very low (because of their low frequencies), but apparently I can't 
keep increasing the size of terms to be taken in the rewrite phase, as that may 
lead to the max clause issue.

So this query+filter ended up with not hitting anything.

Any idea how to get out of this situation? Thanks.

[~uschindler]  [~tomoko] 

 

> Large system: Wildcard search leads to full index scan despite filter query
> ---
>
> Key: LUCENE-10562
> URL: https://issues.apache.org/jira/browse/LUCENE-10562
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.11.1
>Reporter: Henrik Hertel
>Priority: Major
>  Labels: performance
>
> I use Solr and have a large system with 1TB in one core and about 5 million 
> documents. The textual content of large PDF files is indexed there. My query 
> is extremely slow (more than 30 seconds)  as soon as I use wildcards e.g. 
> {code:java}
> *searchvalue*
> {code}
> , even though I put a filter query in front of it that reduces to less than 
> 20 documents.
> searchvalue -> less than 1 second
> searchvalue* -> less than 1 second
> My query:
> {code:java}
> select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0
>  {code}
> I've tried everything imaginable. It doesn't make sense to me why a search 
> over a small subset should take so long. If I omit the filter query 
> metadataitemids_is:20950, so search the entire inventory, then it also takes 
> the same amount of time. Therefore, I suspect that despite the filter query, 
> the main query runs over the entire index.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query



[ 
https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542341#comment-17542341
 ] 

Ming Zhu edited comment on LUCENE-10562 at 5/26/22 4:40 AM:


I'm encountering a similar issue, but the impact is more than performance.

My case is, I have a wildcard query with filter, let's say,

*wildcard:'\{*}searchvalue\{*}' and term filter 'status':'open'*

And I'm using TopTermsScoringBooleanQueryRewrite to hopefully get some 
meaningful relevance scores to sort all the hits.

For my data set, there are millions of documents where status is NOT open, and 
a handful of them with status:open. So the issue here is with the rewrite with 
top terms, all the terms which are relevant for documents with *status:open* 
are ranked very low (because of their low frequencies), but apparently I can't 
keep increasing the size of terms to be taken in the rewrite phase, as that may 
lead to the max clause issue.

So this query+filter ended up with not hitting anything.

Any idea how to get out of this situation? Thanks.

[~uschindler]  [~tomoko] 

 


was (Author: JIRAUSER290042):
I'm encountering a similar issue, but the impact is more than performance.

My case is, I have a wildcard query with filter, let's say,

{*}*wildcard:'\*searchvalue\*{*}' and term filter 'status':'open'*

And I'm using TopTermsScoringBooleanQueryRewrite to hopefully get some 
meaningful relevance scores to sort all the hits.

For my data set, there are millions of documents where status is NOT open, and 
a handful of them with status:open. So the issue here is with the rewrite with 
top terms, all the terms which are relevant for documents with *status:open* 
are ranked very low (because of their low frequencies), but apparently I can't 
keep increasing the size of terms to be taken in the rewrite phase, as that may 
lead to the max clause issue.

So this query+filter ended up with not hitting anything.

Any idea how to get out of this situation? Thanks.

[~uschindler]  [~tomoko] 

 

> Large system: Wildcard search leads to full index scan despite filter query
> ---
>
> Key: LUCENE-10562
> URL: https://issues.apache.org/jira/browse/LUCENE-10562
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.11.1
>Reporter: Henrik Hertel
>Priority: Major
>  Labels: performance
>
> I use Solr and have a large system with 1TB in one core and about 5 million 
> documents. The textual content of large PDF files is indexed there. My query 
> is extremely slow (more than 30 seconds)  as soon as I use wildcards e.g. 
> {code:java}
> *searchvalue*
> {code}
> , even though I put a filter query in front of it that reduces to less than 
> 20 documents.
> searchvalue -> less than 1 second
> searchvalue* -> less than 1 second
> My query:
> {code:java}
> select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0
>  {code}
> I've tried everything imaginable. It doesn't make sense to me why a search 
> over a small subset should take so long. If I omit the filter query 
> metadataitemids_is:20950, so search the entire inventory, then it also takes 
> the same amount of time. Therefore, I suspect that despite the filter query, 
> the main query runs over the entire index.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query



[ 
https://issues.apache.org/jira/browse/LUCENE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542341#comment-17542341
 ] 

Ming Zhu edited comment on LUCENE-10562 at 5/26/22 4:49 AM:


I'm encountering a similar issue, but the impact is more than performance.

My case is, I have a wildcard query with filter, let's say,

*name:'\*searchvalue\*' and term filter 'status':'open'*

And I'm using TopTermsScoringBooleanQueryRewrite to hopefully get some 
meaningful relevance scores to sort all the hits.

For my data set, there are millions of documents where status is NOT open, and 
a handful of them with status:open. So the issue here is with the rewrite with 
top terms, all the terms which are relevant for documents with *status:open* 
are ranked very low (because of their low frequencies), but apparently I can't 
keep increasing the size of terms to be taken in the rewrite phase, as that may 
lead to the max clause issue.

So this query+filter ended up with not hitting anything.

Any idea how to get out of this situation? Thanks.

[~uschindler]  [~tomoko] 

 


was (Author: JIRAUSER290042):
I'm encountering a similar issue, but the impact is more than performance.

My case is, I have a wildcard query with filter, let's say,

{*}name:'*{*}{*}searchvalue*{*}{*}' and term filter 'status':'open'{*}

And I'm using TopTermsScoringBooleanQueryRewrite to hopefully get some 
meaningful relevance scores to sort all the hits.

For my data set, there are millions of documents where status is NOT open, and 
a handful of them with status:open. So the issue here is with the rewrite with 
top terms, all the terms which are relevant for documents with *status:open* 
are ranked very low (because of their low frequencies), but apparently I can't 
keep increasing the size of terms to be taken in the rewrite phase, as that may 
lead to the max clause issue.

So this query+filter ended up with not hitting anything.

Any idea how to get out of this situation? Thanks.

[~uschindler]  [~tomoko] 

 

> Large system: Wildcard search leads to full index scan despite filter query
> ---
>
> Key: LUCENE-10562
> URL: https://issues.apache.org/jira/browse/LUCENE-10562
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.11.1
>Reporter: Henrik Hertel
>Priority: Major
>  Labels: performance
>
> I use Solr and have a large system with 1TB in one core and about 5 million 
> documents. The textual content of large PDF files is indexed there. My query 
> is extremely slow (more than 30 seconds)  as soon as I use wildcards e.g. 
> {code:java}
> *searchvalue*
> {code}
> , even though I put a filter query in front of it that reduces to less than 
> 20 documents.
> searchvalue -> less than 1 second
> searchvalue* -> less than 1 second
> My query:
> {code:java}
> select?defType=lucene&q=content_t:*searchvalue*&fq=metadataitemids_is:20950&fl=id&rows=50&start=0
>  {code}
> I've tried everything imaginable. It doesn't make sense to me why a search 
> over a small subset should take so long. If I omit the filter query 
> metadataitemids_is:20950, so search the entire inventory, then it also takes 
> the same amount of time. Therefore, I suspect that despite the filter query, 
> the main query runs over the entire index.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10562) Large system: Wildcard search leads to full index scan despite filter query