[jira] [Created] (LUCENE-10588) Make Luke launching code faster
Tomoko Uchida created LUCENE-10588: -- Summary: Make Luke launching code faster Key: LUCENE-10588 URL: https://issues.apache.org/jira/browse/LUCENE-10588 Project: Lucene - Core Issue Type: Improvement Reporter: Tomoko Uchida Starting Luke can take multiple seconds since it renders all GUI components when launching; It could be possible to make it faster (within sub-second) by lazily rendering panels to avoid loading too many classes when starting. This typically becomes an issue on CI job, but a quicker launch would be also good for humans. https://github.com/apache/lucene/pull/917 -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #917: LUCENE-10531: Disable distribution test (gui test) on windows.
mocobeta commented on PR #917: URL: https://github.com/apache/lucene/pull/917#issuecomment-1134306924 > One change I think we could try is to run the forked command with a higher priority Thanks for your suggestion; we could try this workaround though, I feel like it'd be better to keep it disabled and try to solve the slowness of launching the app. https://issues.apache.org/jira/browse/LUCENE-10588 It's great to know we can run the gui test with a (virtual) display with Github Actions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] shaie merged pull request #919: Update dev-docs
shaie merged PR #919: URL: https://github.com/apache/lucene/pull/919 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] romseygeek commented on pull request #909: LUCENE-10370: pass proper classpath/module arguments for forking jvms from within tests
romseygeek commented on PR #909: URL: https://github.com/apache/lucene/pull/909#issuecomment-1134360770 Hiya, this is making the elasticsearch CI cross; all builds are failing with this message: ``` * Where: 08:09:37 Script '/var/lib/jenkins/workspace/apache+lucene+main/gradle/java/modules.gradle' line: 215 08:09:37 08:09:37 * What went wrong: 08:09:37 Execution failed for task ':lucene:core.tests:test'. 08:09:37 > java.nio.file.NoSuchFileException: /var/lib/jenkins/workspace/apache+lucene+main/lucene/core.tests/build/tmp/test/jvm-forking.properties ``` I think the problem is that `core.tests` in the middle there, which should instead be a `core/tests`, but I'm not sure if that's something wrong with our Jenkins environment or if its a bug in the gradle logic that is constructing the path. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] romseygeek commented on pull request #909: LUCENE-10370: pass proper classpath/module arguments for forking jvms from within tests
romseygeek commented on PR #909: URL: https://github.com/apache/lucene/pull/909#issuecomment-1134364141 Aha, and it's failing locally for me as well. I'll see if I can work out where the issue is! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on pull request #909: LUCENE-10370: pass proper classpath/module arguments for forking jvms from within tests
dweiss commented on PR #909: URL: https://github.com/apache/lucene/pull/909#issuecomment-1134368823 Hi Alan. Is the Lucene build failing for you locally? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] romseygeek commented on pull request #909: LUCENE-10370: pass proper classpath/module arguments for forking jvms from within tests
romseygeek commented on PR #909: URL: https://github.com/apache/lucene/pull/909#issuecomment-1134370810 Hey Dawid, yes I get local failures if I run `./gradlew clean check`. ``` * Where: Script '/Users/romseygeek/projects/lucene/gradle/java/modules.gradle' line: 215 * What went wrong: Execution failed for task ':lucene:backward-codecs:test'. > java.nio.file.NoSuchFileException: /Users/romseygeek/projects/lucene/lucene/backward-codecs/build/tmp/test/jvm-forking.properties ``` I think possibly the problem is that it's not creating the `jvm-forking.properties` file before trying to write to it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on pull request #909: LUCENE-10370: pass proper classpath/module arguments for forking jvms from within tests
dweiss commented on PR #909: URL: https://github.com/apache/lucene/pull/909#issuecomment-1134373557 No. I think it's the parent path that is missing here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on pull request #909: LUCENE-10370: pass proper classpath/module arguments for forking jvms from within tests
dweiss commented on PR #909: URL: https://github.com/apache/lucene/pull/909#issuecomment-1134374008 try Files.createDirectories(forkProperties.toPath().getParent()); -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on pull request #909: LUCENE-10370: pass proper classpath/module arguments for forking jvms from within tests
dweiss commented on PR #909: URL: https://github.com/apache/lucene/pull/909#issuecomment-1134375202 I know why you're getting it. clean executes after configuration and wipes the temporary task directory for test. We'll have to recreate it properly. I'll commit a fix soon. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] romseygeek commented on pull request #909: LUCENE-10370: pass proper classpath/module arguments for forking jvms from within tests
romseygeek commented on PR #909: URL: https://github.com/apache/lucene/pull/909#issuecomment-1134378224 Can confirm that adding the `createDirectories` line before the `writeString` fixes the problem. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on pull request #909: LUCENE-10370: pass proper classpath/module arguments for forking jvms from within tests
uschindler commented on PR #909: URL: https://github.com/apache/lucene/pull/909#issuecomment-1134378329 Yes, your Jenkins seems to call `gradlew clean test`, too, so this is failing. On ASF and Policeman Jenkins we do not do this, so it passes. Jenkins on my managed Jenkins instances have the "reset git reporitoy to clean checkout" feature enabled, so when job starts it has a completely clean git checkout, so `gradlew clean` is not needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this
[ https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17540835#comment-17540835 ] Adrien Grand commented on LUCENE-10574: --- The stored fields benchmark aimed at reproducing a pathological case, but I don't think this case is uncommon. The only thing you need to be affected by O(n^2) merges is to flush segments that are significantly smaller than the default floor segment size of TieredMergePolicy (2MB). We almost never see this in our benchmarks because our indexing logic always tries to max out indexing speed, so even with the default RAM buffer size of 16MB, the smallest segments in the index would be above 2MB. However in the real world where there are frequent reopens, this wouldn't be unlikely. For instance, if your documents require ~100 bytes of disk space each in the index, and your indexing/refresh rate trigger creation of segments of ~100 documents each, then you'll end up with ~10kB flush segments and hit pathological merges. > Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't > do this > --- > > Key: LUCENE-10574 > URL: https://issues.apache.org/jira/browse/LUCENE-10574 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir >Priority: Major > Fix For: 9.3 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge > policy that doesn't merge in an O(n^2) way. > I have the feeling it might have to be the latter, as folks seem really wed > to this crazy O(n^2) behavior. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10370) Fix classpath/module path of tests forking their own Java (TestNRTReplication)
[ https://issues.apache.org/jira/browse/LUCENE-10370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17540836#comment-17540836 ] ASF subversion and git services commented on LUCENE-10370: -- Commit 5b92002fed3ca316e98c822c1afdccd30f00feb7 in lucene's branch refs/heads/main from Dawid Weiss [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=5b92002fed3 ] LUCENE-10370: recreate temporary location in case it's wiped by a clean. > Fix classpath/module path of tests forking their own Java (TestNRTReplication) > -- > > Key: LUCENE-10370 > URL: https://issues.apache.org/jira/browse/LUCENE-10370 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Minor > Fix For: 9.3 > > Time Spent: 2.5h > Remaining Estimate: 0h > > TestNRTReplication fails because it assumes classpath can just be copied to a > sub-process - this is no longer the case. > PR at: > https://github.com/apache/lucene/pull/909 -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10370) Fix classpath/module path of tests forking their own Java (TestNRTReplication)
[ https://issues.apache.org/jira/browse/LUCENE-10370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17540837#comment-17540837 ] ASF subversion and git services commented on LUCENE-10370: -- Commit fa411e053f690a9f3087c5112150d7b08477aa73 in lucene's branch refs/heads/branch_9x from Dawid Weiss [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=fa411e053f6 ] LUCENE-10370: recreate temporary location in case it's wiped by a clean. > Fix classpath/module path of tests forking their own Java (TestNRTReplication) > -- > > Key: LUCENE-10370 > URL: https://issues.apache.org/jira/browse/LUCENE-10370 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Minor > Fix For: 9.3 > > Time Spent: 2h 50m > Remaining Estimate: 0h > > TestNRTReplication fails because it assumes classpath can just be copied to a > sub-process - this is no longer the case. > PR at: > https://github.com/apache/lucene/pull/909 -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on pull request #909: LUCENE-10370: pass proper classpath/module arguments for forking jvms from within tests
dweiss commented on PR #909: URL: https://github.com/apache/lucene/pull/909#issuecomment-1134379491 I've committed a fix. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] romseygeek commented on pull request #909: LUCENE-10370: pass proper classpath/module arguments for forking jvms from within tests
romseygeek commented on PR #909: URL: https://github.com/apache/lucene/pull/909#issuecomment-1134383958 Thanks Dawid! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10589) Fix corner case in TestKnnVectorQuery.testRandomWithFilter
Tomoko Uchida created LUCENE-10589: -- Summary: Fix corner case in TestKnnVectorQuery.testRandomWithFilter Key: LUCENE-10589 URL: https://issues.apache.org/jira/browse/LUCENE-10589 Project: Lucene - Core Issue Type: Improvement Reporter: Tomoko Uchida {{TestKnnVectorQuery.testRandomWithFilter}} can fail with java.lang.UnsupportedOperationException. Reproducible command {code:java} ./gradlew test --tests TestKnnVectorQuery.testRandomWithFilter -Dtests.seed=1DA39B92702DAC45 -Dtests.multiplier=3 {code} {code:java} org.apache.lucene.search.TestKnnVectorQuery > testRandomWithFilter FAILED java.lang.UnsupportedOperationException: exact search is not supported at __randomizedtesting.SeedInfo.seed([1DA39B92702DAC45:6BEAC2197AD96AE0]:0) at org.apache.lucene.search.TestKnnVectorQuery$ThrowingKnnVectorQuery.exactSearch(TestKnnVectorQuery.java:715) at org.apache.lucene.search.KnnVectorQuery.searchLeaf(KnnVectorQuery.java:151) at org.apache.lucene.search.KnnVectorQuery.rewrite(KnnVectorQuery.java:108) at org.apache.lucene.search.ConstantScoreQuery.rewrite(ConstantScoreQuery.java:44) at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:789) at org.apache.lucene.tests.search.AssertingIndexSearcher.rewrite(AssertingIndexSearcher.java:69) at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:803) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:685) at org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:667) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:584) at org.apache.lucene.search.TestKnnVectorQuery.testRandomWithFilter(TestKnnVectorQuery.java:556) {code} In some edge cases (depending on the random seed), [KnnVectorQuery.java#147|https://github.com/apache/lucene/blob/fe9d26178d033f585c08a5e86708063ac0ec0c9e/lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java#L147] becomes false, and then `exactSearch()` is called. The upper bound of [the test range query (filter)|https://github.com/apache/lucene/blob/fe9d26178d033f585c08a5e86708063ac0ec0c9e/lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java#L554] could be 200 (the max value of "tag" field + 1) instead of lower + 150 to make it "unrestrictive"? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta opened a new pull request, #920: LUCENE-10589: increase upper bound of test range query to the maximum value + 1
mocobeta opened a new pull request, #920: URL: https://github.com/apache/lucene/pull/920 This is a small tweak for `TestKnnVectorQuery.testRandomWithFilter()`. See https://issues.apache.org/jira/browse/LUCENE-10589. On main: ``` ./gradlew test --tests TestKnnVectorQuery.testRandomWithFilter -Dtests.seed=1DA39B92702DAC45 -Dtests.multiplier=3 org.apache.lucene.search.TestKnnVectorQuery > testRandomWithFilter FAILED java.lang.UnsupportedOperationException: exact search is not supported at __randomizedtesting.SeedInfo.seed([1DA39B92702DAC45:6BEAC2197AD96AE0]:0) at org.apache.lucene.search.TestKnnVectorQuery$ThrowingKnnVectorQuery.exactSearch(TestKnnVectorQuery.java:715) ``` With this patch: ``` ./gradlew test --tests TestKnnVectorQuery.testRandomWithFilter -Dtests.seed=1DA39B92702DAC45 -Dtests.multiplier=3 :lucene:core:test (SUCCESS): 1 test(s) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10589) Fix corner case in TestKnnVectorQuery.testRandomWithFilter
[ https://issues.apache.org/jira/browse/LUCENE-10589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17540935#comment-17540935 ] Dawid Weiss commented on LUCENE-10589: -- I don't know anything about this code area but thank you for following up on jenkins failures, [~tomoko]! > Fix corner case in TestKnnVectorQuery.testRandomWithFilter > -- > > Key: LUCENE-10589 > URL: https://issues.apache.org/jira/browse/LUCENE-10589 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Tomoko Uchida >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > {{TestKnnVectorQuery.testRandomWithFilter}} can fail with > java.lang.UnsupportedOperationException. > Reproducible command > {code:java} > ./gradlew test --tests TestKnnVectorQuery.testRandomWithFilter > -Dtests.seed=1DA39B92702DAC45 -Dtests.multiplier=3 > {code} > {code:java} > org.apache.lucene.search.TestKnnVectorQuery > testRandomWithFilter FAILED > java.lang.UnsupportedOperationException: exact search is not supported > at > __randomizedtesting.SeedInfo.seed([1DA39B92702DAC45:6BEAC2197AD96AE0]:0) > at > org.apache.lucene.search.TestKnnVectorQuery$ThrowingKnnVectorQuery.exactSearch(TestKnnVectorQuery.java:715) > at > org.apache.lucene.search.KnnVectorQuery.searchLeaf(KnnVectorQuery.java:151) > at > org.apache.lucene.search.KnnVectorQuery.rewrite(KnnVectorQuery.java:108) > at > org.apache.lucene.search.ConstantScoreQuery.rewrite(ConstantScoreQuery.java:44) > at > org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:789) > at > org.apache.lucene.tests.search.AssertingIndexSearcher.rewrite(AssertingIndexSearcher.java:69) > at > org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:803) > at > org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:685) > at > org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:667) > at > org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:584) > at > org.apache.lucene.search.TestKnnVectorQuery.testRandomWithFilter(TestKnnVectorQuery.java:556) > {code} > In some edge cases (depending on the random seed), > [KnnVectorQuery.java#147|https://github.com/apache/lucene/blob/fe9d26178d033f585c08a5e86708063ac0ec0c9e/lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java#L147] > becomes false, and then `exactSearch()` is called. > The upper bound of [the test range query > (filter)|https://github.com/apache/lucene/blob/fe9d26178d033f585c08a5e86708063ac0ec0c9e/lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java#L554] > could be 200 (the max value of "tag" field + 1) instead of lower + 150 to > make it "unrestrictive"? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #920: LUCENE-10589: increase upper bound of test range query to the maximum value + 1
mocobeta commented on PR #920: URL: https://github.com/apache/lucene/pull/920#issuecomment-1134641142 It looks like the test can be tweaked not to fall into the corner cases but I'm not fully sure if this is correct - is there a better way to fix it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz opened a new pull request, #921: LUCENE-10078: Enable merge-on-refresh by default.
jpountz opened a new pull request, #921: URL: https://github.com/apache/lucene/pull/921 This gives implementations of `findFullFlushMerges` to `LogMergePolicy` and `TieredMergePolicy` and enables merge-on-refresh with a default timeout of 500ms. The idea behind the 500ms default is that it felt both high-enough to have time to run merges of small segments, and low enough that the freshness of the data wouldn't look badly affected for users who have high refresh rates (e.g. refreshing every second). For `findFullFlushMerges`, `LogMergePolicy` looks at tail segments to see if it can find at least `mergeFactor` flush segments below the min segment size, and `TieredMergePolicy` looks for a merge that has at least `segmentsPerTier` segments where the largest segment of the merge is a flush segment and below the floor size. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10589) Fix corner case in TestKnnVectorQuery.testRandomWithFilter
[ https://issues.apache.org/jira/browse/LUCENE-10589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17540950#comment-17540950 ] Tomoko Uchida commented on LUCENE-10589: You’re welcome - debugging this was a good chance to follow/play around with the code for me. > Fix corner case in TestKnnVectorQuery.testRandomWithFilter > -- > > Key: LUCENE-10589 > URL: https://issues.apache.org/jira/browse/LUCENE-10589 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Tomoko Uchida >Priority: Minor > Time Spent: 20m > Remaining Estimate: 0h > > {{TestKnnVectorQuery.testRandomWithFilter}} can fail with > java.lang.UnsupportedOperationException. > Reproducible command > {code:java} > ./gradlew test --tests TestKnnVectorQuery.testRandomWithFilter > -Dtests.seed=1DA39B92702DAC45 -Dtests.multiplier=3 > {code} > {code:java} > org.apache.lucene.search.TestKnnVectorQuery > testRandomWithFilter FAILED > java.lang.UnsupportedOperationException: exact search is not supported > at > __randomizedtesting.SeedInfo.seed([1DA39B92702DAC45:6BEAC2197AD96AE0]:0) > at > org.apache.lucene.search.TestKnnVectorQuery$ThrowingKnnVectorQuery.exactSearch(TestKnnVectorQuery.java:715) > at > org.apache.lucene.search.KnnVectorQuery.searchLeaf(KnnVectorQuery.java:151) > at > org.apache.lucene.search.KnnVectorQuery.rewrite(KnnVectorQuery.java:108) > at > org.apache.lucene.search.ConstantScoreQuery.rewrite(ConstantScoreQuery.java:44) > at > org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:789) > at > org.apache.lucene.tests.search.AssertingIndexSearcher.rewrite(AssertingIndexSearcher.java:69) > at > org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:803) > at > org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:685) > at > org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:667) > at > org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:584) > at > org.apache.lucene.search.TestKnnVectorQuery.testRandomWithFilter(TestKnnVectorQuery.java:556) > {code} > In some edge cases (depending on the random seed), > [KnnVectorQuery.java#147|https://github.com/apache/lucene/blob/fe9d26178d033f585c08a5e86708063ac0ec0c9e/lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java#L147] > becomes false, and then `exactSearch()` is called. > The upper bound of [the test range query > (filter)|https://github.com/apache/lucene/blob/fe9d26178d033f585c08a5e86708063ac0ec0c9e/lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java#L554] > could be 200 (the max value of "tag" field + 1) instead of lower + 150 to > make it "unrestrictive"? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir merged pull request #901: remove commented-out/obselete AwaitsFix
rmuir merged PR #901: URL: https://github.com/apache/lucene/pull/901 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10229) Match offsets should be consistent for fields with positions and fields with offsets
[ https://issues.apache.org/jira/browse/LUCENE-10229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17540955#comment-17540955 ] ASF subversion and git services commented on LUCENE-10229: -- Commit c86f9b2d8c1ccdb85a33b64ace70a1b1d3a4e2d4 in lucene's branch refs/heads/main from Robert Muir [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=c86f9b2d8c1 ] remove commented-out/obselete AwaitsFix (#901) * remove commented-out/obselete AwaitsFix All of these issues are fixed, but the AwaitsFix annotation is still there, just commented out. This causes confusion and makes it harder to keep an eye/review the AwaitsFix tests, e.g. false positives when running 'git grep AwaitsFix' * Remove @AwaitsFix from TestMatchRegionRetriever. The problem has been fixed in LUCENE-10229. Co-authored-by: Dawid Weiss > Match offsets should be consistent for fields with positions and fields with > offsets > > > Key: LUCENE-10229 > URL: https://issues.apache.org/jira/browse/LUCENE-10229 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Minor > Fix For: 9.2 > > Time Spent: 3h 50m > Remaining Estimate: 0h > > This is a follow-up of LUCENE-10223 in which it was discovered that fields > with > offsets don't highlight some more complex interval queries properly. Alan > says: > {quote} > It's because it returns the position of the inner match, but the offsets of > the outer. And so if you're re-analyzing and retrieving offsets by looking > at the positions, you get the 'right' thing. It's not obvious to me what the > correct response is here, but thinking about it the current behaviour is kind > of the worst of both worlds, and perhaps we should change it so that you get > offsets of the inner match as standard, and then the outer match is returned > as part of the sub matches. > {quote} > Intervals are nicely separated into "basic intervals" and "filters" which > restrict some other source of intervals, here is the original documentation: > https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/intervals/package-info.java#L29-L50 > My experience from an extended period of using interval queries in a frontend > where they're highlighted is that filters are restrictions that should not be > highlighted - it's the source intervals that people care about. Filters are > what you remove or where you give proper context to source intervals. > The test code contributed in LUCENE-10223 contains numerous query-highlight > examples (on fields with positions) where this intuition is demonstrated on > all kinds of interval functions: > https://github.com/apache/lucene/blob/main/lucene/highlighter/src/test/org/apache/lucene/search/matchhighlight/TestMatchHighlighter.java#L335-L542 > This issue is about making the internals work consistently for fields with > positions and fields with offsets. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10229) Match offsets should be consistent for fields with positions and fields with offsets
[ https://issues.apache.org/jira/browse/LUCENE-10229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17540961#comment-17540961 ] ASF subversion and git services commented on LUCENE-10229: -- Commit 6edc8a4cff5fc6bb2aca8847d8edd2d6eb01ec13 in lucene's branch refs/heads/branch_9x from Robert Muir [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=6edc8a4cff5 ] remove commented-out/obselete AwaitsFix (#901) * remove commented-out/obselete AwaitsFix All of these issues are fixed, but the AwaitsFix annotation is still there, just commented out. This causes confusion and makes it harder to keep an eye/review the AwaitsFix tests, e.g. false positives when running 'git grep AwaitsFix' * Remove @AwaitsFix from TestMatchRegionRetriever. The problem has been fixed in LUCENE-10229. Co-authored-by: Dawid Weiss > Match offsets should be consistent for fields with positions and fields with > offsets > > > Key: LUCENE-10229 > URL: https://issues.apache.org/jira/browse/LUCENE-10229 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Minor > Fix For: 9.2 > > Time Spent: 3h 50m > Remaining Estimate: 0h > > This is a follow-up of LUCENE-10223 in which it was discovered that fields > with > offsets don't highlight some more complex interval queries properly. Alan > says: > {quote} > It's because it returns the position of the inner match, but the offsets of > the outer. And so if you're re-analyzing and retrieving offsets by looking > at the positions, you get the 'right' thing. It's not obvious to me what the > correct response is here, but thinking about it the current behaviour is kind > of the worst of both worlds, and perhaps we should change it so that you get > offsets of the inner match as standard, and then the outer match is returned > as part of the sub matches. > {quote} > Intervals are nicely separated into "basic intervals" and "filters" which > restrict some other source of intervals, here is the original documentation: > https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/intervals/package-info.java#L29-L50 > My experience from an extended period of using interval queries in a frontend > where they're highlighted is that filters are restrictions that should not be > highlighted - it's the source intervals that people care about. Filters are > what you remove or where you give proper context to source intervals. > The test code contributed in LUCENE-10223 contains numerous query-highlight > examples (on fields with positions) where this intuition is demonstrated on > all kinds of interval functions: > https://github.com/apache/lucene/blob/main/lucene/highlighter/src/test/org/apache/lucene/search/matchhighlight/TestMatchHighlighter.java#L335-L542 > This issue is about making the internals work consistently for fields with > positions and fields with offsets. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10078) Enable merge-on-refresh by default?
[ https://issues.apache.org/jira/browse/LUCENE-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17540979#comment-17540979 ] Adrien Grand commented on LUCENE-10078: --- We had discussions about this in the context of the O(n^2) merging that {{floorSegmentSize}} introduces (LUCENE-10574), so I took a stab at this issue, so that users fully benefit from the trade-off we're making of creating unbalanced merges for the sake of having fewer segments to deal with at search time. > Enable merge-on-refresh by default? > --- > > Key: LUCENE-10078 > URL: https://issues.apache.org/jira/browse/LUCENE-10078 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > This is a spinoff from the discussion in LUCENE-10073. > The newish merge-on-refresh ([crazy origin > story|https://blog.mikemccandless.com/2021/03/open-source-collaboration-or-how-we.html]) > feature is a powerful way to reduce searched segment counts, especially > helpful for applications using many indexing threads. Such usage will write > many tiny segments on each refresh, which could quickly be merged up during > the {{refresh}} operation. > We would have to implement a default for {{findFullFlushMerges}} > (LUCENE-10064 is open for this), and then we would need > {{IndexWriterConfig.getMaxFullFlushMergeWaitMillis}} a non-zero value (this > issue). -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #918: LUCENE-10586: Minor cleanup for local variables in BlockTreeTermsReader
mocobeta commented on PR #918: URL: https://github.com/apache/lucene/pull/918#issuecomment-1134773996 Thanks @mikemccand for confirming this. I'll keep this open for a few more days for others to review it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10586) Minor refactoring in Lucene90BlockTreeTermsReader local variables: metaIn, indexMetaIn, termsMetaIn
[ https://issues.apache.org/jira/browse/LUCENE-10586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17540996#comment-17540996 ] Adrien Grand commented on LUCENE-10586: --- +1 The reason is historical indeed. The earlier version of this class, Lucene40BlockTreeTermsReader, used to record metadata interleaved with the actual data. At some point, we moved metadata to a dedicated file, so that we could verify checksums upon opening the segment, so this required assigning `indexMetaIn` and `termsMetaIn` to either the data files or the metadata file depending on the version. It's good we can clean this up now that we're always reading metadata from the metadata file! > Minor refactoring in Lucene90BlockTreeTermsReader local variables: metaIn, > indexMetaIn, termsMetaIn > --- > > Key: LUCENE-10586 > URL: https://issues.apache.org/jira/browse/LUCENE-10586 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Tomoko Uchida >Priority: Trivial > Time Spent: 10m > Remaining Estimate: 0h > > Those three local variables refer to the same {{IndexInput}} object (no > clone() is called). > {code} > indexMetaIn = termsMetaIn = metaIn; > {code} > I'm not sure but maybe there are some historical reasons. I wonder if it > would be better to have only one reference for the underlying {{IndexInput}} > object to make it a little easy to follow the code. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on pull request #920: LUCENE-10589: increase upper bound of test range query to the maximum value + 1
msokolov commented on PR #920: URL: https://github.com/apache/lucene/pull/920#issuecomment-1134805154 I suspect what's happening is RandomIndexWriter is causing some very small segment to be written, and within that segment the query *is* highly selective causing us to fall back to brute force scan. I would probably fix by using a more "normal" IndexWriter and always indexing a single segment? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8519) MultiDocValues.getNormValues should not call getMergedFieldInfos
[ https://issues.apache.org/jira/browse/LUCENE-8519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541000#comment-17541000 ] Rushabh Shah commented on LUCENE-8519: -- [~dsmiley] Thank you for the review and the merge. > MultiDocValues.getNormValues should not call getMergedFieldInfos > > > Key: LUCENE-8519 > URL: https://issues.apache.org/jira/browse/LUCENE-8519 > Project: Lucene - Core > Issue Type: Improvement >Reporter: David Smiley >Priority: Minor > Fix For: 9.3 > > Time Spent: 2h 50m > Remaining Estimate: 0h > > {{MultiDocValues.getNormValues}} should not call {{getMergedFieldInfos}} > because it's a needless expense. getNormValues simply wants to know if each > LeafReader that has this field has norms as well; that's all. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10590) Indexing all zero vectors leads to heat death of the universe
Michael Sokolov created LUCENE-10590: Summary: Indexing all zero vectors leads to heat death of the universe Key: LUCENE-10590 URL: https://issues.apache.org/jira/browse/LUCENE-10590 Project: Lucene - Core Issue Type: Bug Reporter: Michael Sokolov By accident while testing something else, I ran a luceneutil test indexing 1M 100d vectors where all the vectors were all zeroes. This caused indexing to take a very long time (~40x normal - it did eventually complete) and the search performance was similarly bad. We should not degrade by orders of magnitude with even the worst data though. I'm not entirely sure what the issue is, but perhaps as long as we keep finding hits that are "better" we keep exploring the graph, where better means (score, -docid) >= (lowest score, -docid). If that's right and all docs have the same score, then we probably need to either switch to > (but this could lead to poorer recall in normal cases) or introduce some kind of minimum score threshold? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #916: Refine contribution guide and pull request template
rmuir commented on code in PR #916: URL: https://github.com/apache/lucene/pull/916#discussion_r879605545 ## CONTRIBUTING.md: ## @@ -78,8 +78,11 @@ Please be patient. Committers are busy people too. If no one responds to your pa Please refer to [GitHub's documentation](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests) for an explanation of how to create a pull request. +You should open a pull request against the `main` branch. It is also recommended to give Lucene maintainers [access](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to your PR branch. Review Comment: Is this "access" step required anymore? Isn't it the default in github these days? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #920: LUCENE-10589: increase upper bound of test range query to the maximum value + 1
mocobeta commented on PR #920: URL: https://github.com/apache/lucene/pull/920#issuecomment-1134837438 According to the javadocs of the test, using a randomly skewed index with RandomIndexWriter is an intentional choice I think? ``` /** Tests with random vectors and a random filter. Uses RandomIndexWriter. */ ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on pull request #920: LUCENE-10589: increase upper bound of test range query to the maximum value + 1
msokolov commented on PR #920: URL: https://github.com/apache/lucene/pull/920#issuecomment-1134842970 Yeah, I'm just not sure that all the implications are desirable for this test. For example if we have a segment with 5 docs, their "tags" might all be > the threshold of the filter? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on a diff in pull request #916: Refine contribution guide and pull request template
mocobeta commented on code in PR #916: URL: https://github.com/apache/lucene/pull/916#discussion_r879634562 ## CONTRIBUTING.md: ## @@ -78,8 +78,11 @@ Please be patient. Committers are busy people too. If no one responds to your pa Please refer to [GitHub's documentation](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests) for an explanation of how to create a pull request. +You should open a pull request against the `main` branch. It is also recommended to give Lucene maintainers [access](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to your PR branch. Review Comment: For sure. We already have the good default, then the request to give access wouldn't be needed anymore. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] shahrs87 commented on a diff in pull request #897: LUCENE-10266 Move nearest-neighbor search on points to core
shahrs87 commented on code in PR #897: URL: https://github.com/apache/lucene/pull/897#discussion_r879640507 ## lucene/core/src/java/org/apache/lucene/document/LatLonPoint.java: ## @@ -362,4 +377,72 @@ public static Query newDistanceFeatureQuery( } return query; } + + /** + * Finds the {@code n} nearest indexed points to the provided point, according to Haversine + * distance. + * + * This is functionally equivalent to running {@link MatchAllDocsQuery} with a {@link + * LatLonDocValuesField#newDistanceSort}, but is far more efficient since it takes advantage of + * properties the indexed BKD tree. Currently this only works with {@link Lucene90PointsFormat} Review Comment: Removed in the latest revision. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] shahrs87 commented on a diff in pull request #897: LUCENE-10266 Move nearest-neighbor search on points to core
shahrs87 commented on code in PR #897: URL: https://github.com/apache/lucene/pull/897#discussion_r879640718 ## lucene/core/src/java/org/apache/lucene/search/NearestNeighbor.java: ## @@ -31,12 +31,8 @@ import org.apache.lucene.util.Bits; import org.apache.lucene.util.SloppyMath; -/** - * KNN search on top of 2D lat/lon indexed points. - * - * @lucene.experimental - */ -class NearestNeighbor { +/** KNN search on top of 2D lat/lon indexed points. */ +public class NearestNeighbor { Review Comment: Yes. Changed in the latest revision. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] shahrs87 commented on pull request #897: LUCENE-10266 Move nearest-neighbor search on points to core
shahrs87 commented on PR #897: URL: https://github.com/apache/lucene/pull/897#issuecomment-1134875992 @jpountz Thank you for the feedback. I have addressed your comments in the latest revision. I have one question. To make changes in the CHANGES.txt file, will this change go in `API changes` section or `Other`. Please advise. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10590) Indexing all zero vectors leads to heat death of the universe
[ https://issues.apache.org/jira/browse/LUCENE-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541100#comment-17541100 ] Dawid Weiss commented on LUCENE-10590: -- Love the title, [~sokolov]. Very Douglas-y Adams-y. > Indexing all zero vectors leads to heat death of the universe > - > > Key: LUCENE-10590 > URL: https://issues.apache.org/jira/browse/LUCENE-10590 > Project: Lucene - Core > Issue Type: Bug >Reporter: Michael Sokolov >Priority: Major > > By accident while testing something else, I ran a luceneutil test indexing 1M > 100d vectors where all the vectors were all zeroes. This caused indexing to > take a very long time (~40x normal - it did eventually complete) and the > search performance was similarly bad. We should not degrade by orders of > magnitude with even the worst data though. > I'm not entirely sure what the issue is, but perhaps as long as we keep > finding hits that are "better" we keep exploring the graph, where better > means (score, -docid) >= (lowest score, -docid). If that's right and all docs > have the same score, then we probably need to either switch to > (but this > could lead to poorer recall in normal cases) or introduce some kind of > minimum score threshold? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] Yuti-G commented on a diff in pull request #915: LUCENE-10585: Scrub copy/paste code in the facets module and attempt to simplify a bit
Yuti-G commented on code in PR #915: URL: https://github.com/apache/lucene/pull/915#discussion_r879786369 ## lucene/facet/src/java/org/apache/lucene/facet/sortedset/AbstractSortedSetDocValueFacetCounts.java: ## @@ -0,0 +1,333 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet.sortedset; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Comparator; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.PrimitiveIterator; +import org.apache.lucene.facet.FacetResult; +import org.apache.lucene.facet.Facets; +import org.apache.lucene.facet.FacetsConfig; +import org.apache.lucene.facet.FacetsConfig.DimConfig; +import org.apache.lucene.facet.LabelAndValue; +import org.apache.lucene.facet.TopOrdAndIntQueue; +import org.apache.lucene.facet.sortedset.SortedSetDocValuesReaderState.DimTree; +import org.apache.lucene.facet.sortedset.SortedSetDocValuesReaderState.OrdRange; +import org.apache.lucene.index.SortedSetDocValues; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.PriorityQueue; + +/** Base class for SSDV faceting implementations. */ +abstract class AbstractSortedSetDocValueFacetCounts extends Facets { + + private static final Comparator FACET_RESULT_COMPARATOR = + new Comparator<>() { +@Override +public int compare(FacetResult a, FacetResult b) { + if (a.value.intValue() > b.value.intValue()) { +return -1; + } else if (b.value.intValue() > a.value.intValue()) { +return 1; + } else { +return a.dim.compareTo(b.dim); + } +} + }; + + final SortedSetDocValuesReaderState state; + final FacetsConfig stateConfig; + final SortedSetDocValues dv; + final String field; + + AbstractSortedSetDocValueFacetCounts(SortedSetDocValuesReaderState state) throws IOException { +this.state = state; +this.field = state.getField(); +this.stateConfig = state.getFacetsConfig(); +this.dv = state.getDocValues(); + } + + @Override + public FacetResult getTopChildren(int topN, String dim, String... path) throws IOException { +validateTopN(topN); +TopChildrenForPath topChildrenForPath = getTopChildrenForPath(topN, dim, path); +return createFacetResult(topChildrenForPath, dim, path); + } + + @Override + public Number getSpecificValue(String dim, String... path) throws IOException { +if (path.length != 1) { + throw new IllegalArgumentException("path must be length=1"); +} +int ord = (int) dv.lookupTerm(new BytesRef(FacetsConfig.pathToString(dim, path))); +if (ord < 0) { + return -1; +} + +return getCount(ord); + } + + @Override + public List getAllDims(int topN) throws IOException { +validateTopN(topN); +List results = new ArrayList<>(); +for (String dim : state.getDims()) { + TopChildrenForPath topChildrenForPath = getTopChildrenForPath(topN, dim); + FacetResult facetResult = createFacetResult(topChildrenForPath, dim); + if (facetResult != null) { +results.add(facetResult); + } +} + +// Sort by highest count: +results.sort(FACET_RESULT_COMPARATOR); +return results; + } + + @Override + public List getTopDims(int topNDims, int topNChildren) throws IOException { +validateTopN(topNDims); +validateTopN(topNChildren); + +// Creates priority queue to store top dimensions and sort by their aggregated values/hits and +// string values. +PriorityQueue pq = +new PriorityQueue<>(topNDims) { + @Override + protected boolean lessThan(DimValue a, DimValue b) { +if (a.value > b.value) { + return false; +} else if (a.value < b.value) { + return true; +} else { + return a.dim.compareTo(b.dim) > 0; +} + } +}; + +// Keep track of intermediate results, if we compute them, so we can reuse them later: +Map intermediateResults = null; + +for (String dim : state.getDims()) { + DimConfig dimConfig = stateConfig.getDimConfig(dim); + int d
[GitHub] [lucene] Yuti-G commented on pull request #915: LUCENE-10585: Scrub copy/paste code in the facets module and attempt to simplify a bit
Yuti-G commented on PR #915: URL: https://github.com/apache/lucene/pull/915#issuecomment-1135050190 Hi @gsmiller, thanks for making a lot of improvements to the code, and it looks great to me! I also ran the benchmarks for facet and do not observe much difference from the main branch. I added getTopDims to benchmarks but the PR hasn't merged yet, so the attached results are from my local. Thanks! Main: https://user-images.githubusercontent.com/4710/169889777-cf059966-a38d-49b8-8699-e8ff5172967c.png";> pr/915: https://user-images.githubusercontent.com/4710/169890719-ec235cb4-4f49-47f9-9ca0-a1f822073bf5.png";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities
gsmiller commented on code in PR #841: URL: https://github.com/apache/lucene/pull/841#discussion_r879866385 ## lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/HyperRectangleFacetCounts.java: ## @@ -0,0 +1,149 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet.hyperrectangle; + +import java.io.IOException; +import java.util.Arrays; +import java.util.Collections; +import java.util.List; +import org.apache.lucene.document.LongPoint; +import org.apache.lucene.facet.FacetResult; +import org.apache.lucene.facet.Facets; +import org.apache.lucene.facet.FacetsCollector; +import org.apache.lucene.facet.LabelAndValue; +import org.apache.lucene.index.BinaryDocValues; +import org.apache.lucene.index.DocValues; +import org.apache.lucene.search.DocIdSetIterator; + +/** Get counts given a list of HyperRectangles (which must be of the same type) */ +public class HyperRectangleFacetCounts extends Facets { + /** Hypper rectangles passed to constructor. */ + protected final HyperRectangle[] hyperRectangles; + + /** Counts, initialized in subclass. */ + protected final int[] counts; + + /** Our field name. */ + protected final String field; + + /** Number of dimensions for field */ + protected final int dims; + + /** Total number of hits. */ + protected int totCount; + + /** + * Create HyperRectangleFacetCounts using this + * + * @param field Field name + * @param hits Hits to facet on + * @param hyperRectangles List of hyper rectangle facets + * @throws IOException If there is a problem reading the field + */ + public HyperRectangleFacetCounts( + String field, FacetsCollector hits, HyperRectangle... hyperRectangles) throws IOException { +assert hyperRectangles.length > 0 : "Hyper rectangle ranges cannot be empty"; +assert areHyperRectangleDimsConsistent(hyperRectangles) +: "All hyper rectangles must be the same dimensionality"; +this.field = field; +this.hyperRectangles = hyperRectangles; +this.dims = hyperRectangles[0].dims; +this.counts = new int[hyperRectangles.length]; +count(field, hits.getMatchingDocs()); + } + + private boolean areHyperRectangleDimsConsistent(HyperRectangle[] hyperRectangles) { +int dims = hyperRectangles[0].dims; +return Arrays.stream(hyperRectangles).allMatch(hyperRectangle -> hyperRectangle.dims == dims); + } + + /** Counts from the provided field. */ + private void count(String field, List matchingDocs) + throws IOException { + +for (int i = 0; i < matchingDocs.size(); i++) { + + FacetsCollector.MatchingDocs hits = matchingDocs.get(i); + + BinaryDocValues binaryDocValues = DocValues.getBinary(hits.context.reader(), field); + + final DocIdSetIterator it = hits.bits.iterator(); + if (it == null) { +continue; + } Review Comment: Yeah, this convenience is nice. It also might optimize a little internally by figuring out what to lead with, etc. for doing the conjunction. So definitely nice to use. ## lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/HyperRectangleFacetCounts.java: ## @@ -0,0 +1,149 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet.hyperrectangle; + +import java.io.IOException; +import java.util.Arrays; +import java.util.Collections; +import java.util.List; +import org.apache.lucene.document.LongPoint; +import org.apac
[GitHub] [lucene] gsmiller commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities
gsmiller commented on code in PR #841: URL: https://github.com/apache/lucene/pull/841#discussion_r879869751 ## lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/LongPointFacetField.java: ## @@ -0,0 +1,35 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet.hyperrectangle; + +import org.apache.lucene.document.BinaryDocValuesField; +import org.apache.lucene.document.LongPoint; + +/** Packs an array of longs into a {@link BinaryDocValuesField} */ +public class LongPointFacetField extends BinaryDocValuesField { Review Comment: Full transparency: Marc and I had a discussion about this offline so I wanted to circle back here with a suggestion I made to him so it's fully out in the open and we can carry a conversation forward with the community. While I initially suggested adding this as a sub-class of `BinaryRangeDocValuesField` (similar to what `LongRangeDocValuesField` does), I wonder if the right thing would be to actually formalize a new doc values format type. If we're building faceting, and potentially "slow range query" support on top of these, it seems like formalizing the format encoding might be the right thing to do. I'd be really curious what the community thinks of this though, and recommended that Marc start that discussion. I'm personally leaning towards formalizing the format, and maybe even having single-valued and multi-valued versions (analogous to `(Sorted)NumericDocValues`). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities
gsmiller commented on code in PR #841: URL: https://github.com/apache/lucene/pull/841#discussion_r879870847 ## lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/HyperRectangle.java: ## @@ -0,0 +1,101 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet.hyperrectangle; + +/** Holds the name and the number of dims for a HyperRectangle */ +public abstract class HyperRectangle { Review Comment: Does `HyperRectangle` itself actually need to be part of the public API though? Users certainly need the definitions for `Long/DoubleHyperRectangle` but do they need the `HyperRectangle` definition itself? Like would they need a generic reference to `HyperRectangle`? I'm not sure? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities
gsmiller commented on code in PR #841: URL: https://github.com/apache/lucene/pull/841#discussion_r879871471 ## lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/HyperRectangleFacetCounts.java: ## @@ -0,0 +1,149 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet.hyperrectangle; + +import java.io.IOException; +import java.util.Arrays; +import java.util.Collections; +import java.util.List; +import org.apache.lucene.document.LongPoint; +import org.apache.lucene.facet.FacetResult; +import org.apache.lucene.facet.Facets; +import org.apache.lucene.facet.FacetsCollector; +import org.apache.lucene.facet.LabelAndValue; +import org.apache.lucene.index.BinaryDocValues; +import org.apache.lucene.index.DocValues; +import org.apache.lucene.search.DocIdSetIterator; + +/** Get counts given a list of HyperRectangles (which must be of the same type) */ +public class HyperRectangleFacetCounts extends Facets { + /** Hypper rectangles passed to constructor. */ + protected final HyperRectangle[] hyperRectangles; + + /** Counts, initialized in subclass. */ + protected final int[] counts; + + /** Our field name. */ + protected final String field; + + /** Number of dimensions for field */ + protected final int dims; + + /** Total number of hits. */ + protected int totCount; Review Comment: That makes sense. I think leaving it `private` until there's a need is good. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10590) Indexing all zero vectors leads to heat death of the universe
[ https://issues.apache.org/jira/browse/LUCENE-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541180#comment-17541180 ] Michael Sokolov commented on LUCENE-10590: -- > Love the title, Michael Sokolov. Very Douglas-y Adams-y. :starry eyes: So I wrote a unit test, wrapped the `RandomVectorValues.vectorValue(int)` method to see where it was being called, and fiddled around with `BoundsChecker` to see what would happen if we swapped its `<` with a `<=`, and what I found is that in the existing situation, indeed, we crawl over the entire graph every time we insert a node, because every node looks like a viable candidate (we only exclude nodes whose scores are `<` the current least score (or `>` for the inverse scoring functions)). But ... if we change to using `<=` (resp. `>=`) then the cost shifts over to `HnswGraphBuilder.findWorstNonDiverse` since there we early terminate in the opposite way. Anyway that isn't very clear but the point is that these boundary conditions are sensitive to this equality case (where everything is equally distant to everything else) and they explode in different directions! Basically what we need to do is bias them to give up when stuff is exactly ==. Possibly BoundsChecker should get a new parameter (open/closed) > Indexing all zero vectors leads to heat death of the universe > - > > Key: LUCENE-10590 > URL: https://issues.apache.org/jira/browse/LUCENE-10590 > Project: Lucene - Core > Issue Type: Bug >Reporter: Michael Sokolov >Priority: Major > > By accident while testing something else, I ran a luceneutil test indexing 1M > 100d vectors where all the vectors were all zeroes. This caused indexing to > take a very long time (~40x normal - it did eventually complete) and the > search performance was similarly bad. We should not degrade by orders of > magnitude with even the worst data though. > I'm not entirely sure what the issue is, but perhaps as long as we keep > finding hits that are "better" we keep exploring the graph, where better > means (score, -docid) >= (lowest score, -docid). If that's right and all docs > have the same score, then we probably need to either switch to > (but this > could lead to poorer recall in normal cases) or introduce some kind of > minimum score threshold? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on pull request #920: LUCENE-10589: increase upper bound of test range query to the maximum value + 1
jtibshirani commented on PR #920: URL: https://github.com/apache/lucene/pull/920#issuecomment-1135164651 Thank you @mocobeta for looking into this! I don't think the failure is caused having multiple segments, since we make sure to force merge to one segment before starting the searches. Stepping through what happens, it looks like we just hit a really unlucky query + data combination where it takes more than 150 steps to conclude the search. Your proposed fix makes sense to me. Another option is to decrease `k` to make the search more restrictive (currently it's set to 5, I think 1 would work instead). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10590) Indexing all zero vectors leads to heat death of the universe
[ https://issues.apache.org/jira/browse/LUCENE-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541185#comment-17541185 ] Julie Tibshirani commented on LUCENE-10590: --- I don't have a deep understanding of what's happening, but wanted to share this discussion from hnswlib: [https://github.com/nmslib/hnswlib/issues/263#issuecomment-739549454.] It looks like HNSW can really fall apart if there are a lot of duplicate vectors. The duplicates all link to each other, creating a highly disconnected graph. I've often seen libraries recommend that users deduplicate vectors before indexing them ([https://github.com/facebookresearch/faiss/wiki/FAQ#searching-duplicate-vectors-is-slow).] I guess indexing all zero vectors is an extreme version of this! > Indexing all zero vectors leads to heat death of the universe > - > > Key: LUCENE-10590 > URL: https://issues.apache.org/jira/browse/LUCENE-10590 > Project: Lucene - Core > Issue Type: Bug >Reporter: Michael Sokolov >Priority: Major > > By accident while testing something else, I ran a luceneutil test indexing 1M > 100d vectors where all the vectors were all zeroes. This caused indexing to > take a very long time (~40x normal - it did eventually complete) and the > search performance was similarly bad. We should not degrade by orders of > magnitude with even the worst data though. > I'm not entirely sure what the issue is, but perhaps as long as we keep > finding hits that are "better" we keep exploring the graph, where better > means (score, -docid) >= (lowest score, -docid). If that's right and all docs > have the same score, then we probably need to either switch to > (but this > could lead to poorer recall in normal cases) or introduce some kind of > minimum score threshold? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10590) Indexing all zero vectors leads to heat death of the universe
[ https://issues.apache.org/jira/browse/LUCENE-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541193#comment-17541193 ] Michael Sokolov commented on LUCENE-10590: -- Thanks Julie, this is definitely the same problem. I fiddled around with bounds checking but it's not so obvious how to fix this. I wonder if we can impose a default visitedLimit to avoid this kind of runaway explosion. Duplicate detection sounds challenging > Indexing all zero vectors leads to heat death of the universe > - > > Key: LUCENE-10590 > URL: https://issues.apache.org/jira/browse/LUCENE-10590 > Project: Lucene - Core > Issue Type: Bug >Reporter: Michael Sokolov >Priority: Major > > By accident while testing something else, I ran a luceneutil test indexing 1M > 100d vectors where all the vectors were all zeroes. This caused indexing to > take a very long time (~40x normal - it did eventually complete) and the > search performance was similarly bad. We should not degrade by orders of > magnitude with even the worst data though. > I'm not entirely sure what the issue is, but perhaps as long as we keep > finding hits that are "better" we keep exploring the graph, where better > means (score, -docid) >= (lowest score, -docid). If that's right and all docs > have the same score, then we probably need to either switch to > (but this > could lead to poorer recall in normal cases) or introduce some kind of > minimum score threshold? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10590) Indexing all zero vectors leads to heat death of the universe
[ https://issues.apache.org/jira/browse/LUCENE-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541185#comment-17541185 ] Julie Tibshirani edited comment on LUCENE-10590 at 5/23/22 10:26 PM: - I don't have a deep understanding of what's happening, but wanted to share this discussion from hnswlib: [https://github.com/nmslib/hnswlib/issues/263#issuecomment-739549454]. It looks like HNSW can really fall apart if there are a lot of duplicate vectors. The duplicates all link to each other, creating a highly disconnected graph. I've often seen libraries recommend that users deduplicate vectors before indexing them ([https://github.com/facebookresearch/faiss/wiki/FAQ#searching-duplicate-vectors-is-slow]). I guess indexing all zero vectors is an extreme version of this! was (Author: julietibs): I don't have a deep understanding of what's happening, but wanted to share this discussion from hnswlib: [https://github.com/nmslib/hnswlib/issues/263#issuecomment-739549454]. It looks like HNSW can really fall apart if there are a lot of duplicate vectors. The duplicates all link to each other, creating a highly disconnected graph. I've often seen libraries recommend that users deduplicate vectors before indexing them ([https://github.com/facebookresearch/faiss/wiki/FAQ#searching-duplicate-vectors-is-slow).] I guess indexing all zero vectors is an extreme version of this! > Indexing all zero vectors leads to heat death of the universe > - > > Key: LUCENE-10590 > URL: https://issues.apache.org/jira/browse/LUCENE-10590 > Project: Lucene - Core > Issue Type: Bug >Reporter: Michael Sokolov >Priority: Major > > By accident while testing something else, I ran a luceneutil test indexing 1M > 100d vectors where all the vectors were all zeroes. This caused indexing to > take a very long time (~40x normal - it did eventually complete) and the > search performance was similarly bad. We should not degrade by orders of > magnitude with even the worst data though. > I'm not entirely sure what the issue is, but perhaps as long as we keep > finding hits that are "better" we keep exploring the graph, where better > means (score, -docid) >= (lowest score, -docid). If that's right and all docs > have the same score, then we probably need to either switch to > (but this > could lead to poorer recall in normal cases) or introduce some kind of > minimum score threshold? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10590) Indexing all zero vectors leads to heat death of the universe
[ https://issues.apache.org/jira/browse/LUCENE-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541185#comment-17541185 ] Julie Tibshirani edited comment on LUCENE-10590 at 5/23/22 10:26 PM: - I don't have a deep understanding of what's happening, but wanted to share this discussion from hnswlib: [https://github.com/nmslib/hnswlib/issues/263#issuecomment-739549454]. It looks like HNSW can really fall apart if there are a lot of duplicate vectors. The duplicates all link to each other, creating a highly disconnected graph. I've often seen libraries recommend that users deduplicate vectors before indexing them ([https://github.com/facebookresearch/faiss/wiki/FAQ#searching-duplicate-vectors-is-slow).] I guess indexing all zero vectors is an extreme version of this! was (Author: julietibs): I don't have a deep understanding of what's happening, but wanted to share this discussion from hnswlib: [https://github.com/nmslib/hnswlib/issues/263#issuecomment-739549454.] It looks like HNSW can really fall apart if there are a lot of duplicate vectors. The duplicates all link to each other, creating a highly disconnected graph. I've often seen libraries recommend that users deduplicate vectors before indexing them ([https://github.com/facebookresearch/faiss/wiki/FAQ#searching-duplicate-vectors-is-slow).] I guess indexing all zero vectors is an extreme version of this! > Indexing all zero vectors leads to heat death of the universe > - > > Key: LUCENE-10590 > URL: https://issues.apache.org/jira/browse/LUCENE-10590 > Project: Lucene - Core > Issue Type: Bug >Reporter: Michael Sokolov >Priority: Major > > By accident while testing something else, I ran a luceneutil test indexing 1M > 100d vectors where all the vectors were all zeroes. This caused indexing to > take a very long time (~40x normal - it did eventually complete) and the > search performance was similarly bad. We should not degrade by orders of > magnitude with even the worst data though. > I'm not entirely sure what the issue is, but perhaps as long as we keep > finding hits that are "better" we keep exploring the graph, where better > means (score, -docid) >= (lowest score, -docid). If that's right and all docs > have the same score, then we probably need to either switch to > (but this > could lead to poorer recall in normal cases) or introduce some kind of > minimum score threshold? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on a diff in pull request #916: Refine contribution guide and pull request template
mocobeta commented on code in PR #916: URL: https://github.com/apache/lucene/pull/916#discussion_r879964992 ## CONTRIBUTING.md: ## @@ -78,8 +78,11 @@ Please be patient. Committers are busy people too. If no one responds to your pa Please refer to [GitHub's documentation](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests) for an explanation of how to create a pull request. +You should open a pull request against the `main` branch. It is also recommended to give Lucene maintainers [access](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to your PR branch. Review Comment: ```suggestion You should open a pull request against the `main` branch. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on a diff in pull request #916: Refine contribution guide and pull request template
mocobeta commented on code in PR #916: URL: https://github.com/apache/lucene/pull/916#discussion_r879967339 ## CONTRIBUTING.md: ## @@ -78,8 +78,11 @@ Please be patient. Committers are busy people too. If no one responds to your pa Please refer to [GitHub's documentation](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests) for an explanation of how to create a pull request. +You should open a pull request against the `main` branch. Review Comment: ```suggestion You should open a pull request against the `main` branch. Committers will backport it to the maintenance branches once the change is merged into `main` (as far as it is possible). ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10590) Indexing all zero vectors leads to heat death of the universe
[ https://issues.apache.org/jira/browse/LUCENE-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lu Xugang updated LUCENE-10590: --- Attachment: image.png > Indexing all zero vectors leads to heat death of the universe > - > > Key: LUCENE-10590 > URL: https://issues.apache.org/jira/browse/LUCENE-10590 > Project: Lucene - Core > Issue Type: Bug >Reporter: Michael Sokolov >Priority: Major > Attachments: image.png > > > By accident while testing something else, I ran a luceneutil test indexing 1M > 100d vectors where all the vectors were all zeroes. This caused indexing to > take a very long time (~40x normal - it did eventually complete) and the > search performance was similarly bad. We should not degrade by orders of > magnitude with even the worst data though. > I'm not entirely sure what the issue is, but perhaps as long as we keep > finding hits that are "better" we keep exploring the graph, where better > means (score, -docid) >= (lowest score, -docid). If that's right and all docs > have the same score, then we probably need to either switch to > (but this > could lead to poorer recall in normal cases) or introduce some kind of > minimum score threshold? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10590) Indexing all zero vectors leads to heat death of the universe
[ https://issues.apache.org/jira/browse/LUCENE-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lu Xugang updated LUCENE-10590: --- Attachment: (was: image.png) > Indexing all zero vectors leads to heat death of the universe > - > > Key: LUCENE-10590 > URL: https://issues.apache.org/jira/browse/LUCENE-10590 > Project: Lucene - Core > Issue Type: Bug >Reporter: Michael Sokolov >Priority: Major > > By accident while testing something else, I ran a luceneutil test indexing 1M > 100d vectors where all the vectors were all zeroes. This caused indexing to > take a very long time (~40x normal - it did eventually complete) and the > search performance was similarly bad. We should not degrade by orders of > magnitude with even the worst data though. > I'm not entirely sure what the issue is, but perhaps as long as we keep > finding hits that are "better" we keep exploring the graph, where better > means (score, -docid) >= (lowest score, -docid). If that's right and all docs > have the same score, then we probably need to either switch to > (but this > could lead to poorer recall in normal cases) or introduce some kind of > minimum score threshold? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10559) Add preFilter/postFilter options to KnnGraphTester
[ https://issues.apache.org/jira/browse/LUCENE-10559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541282#comment-17541282 ] Kaival Parikh commented on LUCENE-10559: The graph construction parameters were: docs = path_to_vec_file, ndoc = 100, dim = 100, fanout = 0, maxConn = 150, beamWidthIndex = 300 All these were the same for search time, with additional: search = path_to_query_file, niter = 1000, selectivity = (as required, 0.01 ~ 0.8), prefilter (as required) Also you were right about the search vectors, there was an overlap with the training set. I created a fresh query file excluding trained vectors and re-ran the utility: ||selectivity||effective topK||post-filter recall||post-filter time||pre-filter recall||pre-filter time|| |0.8|125|0.965|1.57|0.976|1.61| |0.6|166|0.959|2.07|0.981|2.00| |0.4|250|0.962|2.71|0.986|2.65| |0.2|500|0.958|4.80|0.992|4.51| |0.1|1000|0.954|8.61|0.994|7.74| |0.01|1|0.971|58.78|1.000|9.44| The recall and time seem to be in the same range as before. The high recall for selective queries (selectivity = 0.01, prefilter, recall = 1.000) may be due to performing an exact search when the nodes visited limit is reached > Add preFilter/postFilter options to KnnGraphTester > -- > > Key: LUCENE-10559 > URL: https://issues.apache.org/jira/browse/LUCENE-10559 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael Sokolov >Priority: Major > > We want to be able to test the efficacy of pre-filtering in KnnVectorQuery: > if you (say) want the top K nearest neighbors subject to a constraint Q, are > you better off over-selecting (say 2K) top hits and *then* filtering > (post-filtering), or incorporating the filtering into the query > (pre-filtering). How does it depend on the selectivity of the filter? > I think we can get a reasonable testbed by generating a uniform random filter > with some selectivity (that is consistent and repeatable). Possibly we'd also > want to try filters that are correlated with index order, but it seems they'd > be unlikely to be correlated with vector values in a way that the graph > structure would notice, so random is a pretty good starting point for this. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #916: Refine contribution guide and pull request template
mocobeta commented on PR #916: URL: https://github.com/apache/lucene/pull/916#issuecomment-1135442476 Thank you @rmuir for reviewing. I'd merge this - let's restart from a blank sheet (all necessary information should be written in the contribution guide), but if anyone has suggestions to make the pull request template helpful for committers/contributors, please make another issue/PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta merged pull request #916: Refine contribution guide and pull request template
mocobeta merged PR #916: URL: https://github.com/apache/lucene/pull/916 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta merged pull request #918: LUCENE-10586: Minor cleanup for local variables in BlockTreeTermsReader
mocobeta merged PR #918: URL: https://github.com/apache/lucene/pull/918 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10586) Minor refactoring in Lucene90BlockTreeTermsReader local variables: metaIn, indexMetaIn, termsMetaIn
[ https://issues.apache.org/jira/browse/LUCENE-10586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541290#comment-17541290 ] ASF subversion and git services commented on LUCENE-10586: -- Commit f5c1f11a2afeb685d919a47904d525f076c90fda in lucene's branch refs/heads/main from Tomoko Uchida [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=f5c1f11a2af ] LUCENE-10586: Minor cleanup local variables in BlockTreeTermsReader (#918) > Minor refactoring in Lucene90BlockTreeTermsReader local variables: metaIn, > indexMetaIn, termsMetaIn > --- > > Key: LUCENE-10586 > URL: https://issues.apache.org/jira/browse/LUCENE-10586 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Tomoko Uchida >Priority: Trivial > Time Spent: 20m > Remaining Estimate: 0h > > Those three local variables refer to the same {{IndexInput}} object (no > clone() is called). > {code} > indexMetaIn = termsMetaIn = metaIn; > {code} > I'm not sure but maybe there are some historical reasons. I wonder if it > would be better to have only one reference for the underlying {{IndexInput}} > object to make it a little easy to follow the code. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10586) Minor refactoring in Lucene90BlockTreeTermsReader local variables: metaIn, indexMetaIn, termsMetaIn
[ https://issues.apache.org/jira/browse/LUCENE-10586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541292#comment-17541292 ] ASF subversion and git services commented on LUCENE-10586: -- Commit 2cd9eb13262a0598c6f3c0409103121e72256772 in lucene's branch refs/heads/branch_9x from Tomoko Uchida [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=2cd9eb13262 ] LUCENE-10586: Minor cleanup local variables in BlockTreeTermsReader (#918) > Minor refactoring in Lucene90BlockTreeTermsReader local variables: metaIn, > indexMetaIn, termsMetaIn > --- > > Key: LUCENE-10586 > URL: https://issues.apache.org/jira/browse/LUCENE-10586 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Tomoko Uchida >Priority: Trivial > Time Spent: 20m > Remaining Estimate: 0h > > Those three local variables refer to the same {{IndexInput}} object (no > clone() is called). > {code} > indexMetaIn = termsMetaIn = metaIn; > {code} > I'm not sure but maybe there are some historical reasons. I wonder if it > would be better to have only one reference for the underlying {{IndexInput}} > object to make it a little easy to follow the code. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org