[jira] [Commented] (LUCENE-10292) AnalyzingInfixSuggester thread safety: lookup() fails during (re)build()
[ https://issues.apache.org/jira/browse/LUCENE-10292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529831#comment-17529831 ] Dawid Weiss commented on LUCENE-10292: -- > All I was really trying to do with these tests was demonstrate that data you > get out of the Lookup before you call build(), can still be gotten from the > Lookup while build() is incrementally consuming an iterator (which may take a > long time if you are building up from a long iterator) and that this behavior > is consistent across Lookup impls (as opposed to before i filed this issue, > when most Lookups worked that way, but AnalyzingInfixSuggester would throw an > ugly exception – which was certainly confusing to users who might switch from > one impl to another). I guess I am not comfortable with the fact that this test works only by a lucky coincidence and tests the behavior that isn't guaranteed or documented by the Lookup class - this got me confused and I guess it'll confuse people looking at this code after me. It's not a personal stab at you, it's just something that smells fishy around this code in general. When I was looking at the failure and tried to debug the test, I didn't see the reason why this test was necessary (I looked at the Lookup class documentation). When I understood what the test did, I looked at the implementations and they seemed to be designed with a single-thread model in mind (external synchronization between lookups and rebuilds). For example, even now, if you had a tight loop in one thread calling lookup on an FSTCompletionLookup and this loop got compiled, then there's nothing preventing the compiler from reading higherWeightsCompletion and normalCompletion fields once and never again (they're regular fields in FSTCompletionLookup), even if you call build there multiple times in between... Is this likely to happen? I don't know. Is this possible? Sure. Maybe I'm oversensitive because I grew up on machines with much less strict cache coherency protocols but code like this makes me itchy. > I didn't set out to make any hard & fast guarantee about the thread safety of > all lookups – just improve the one that awas obviously inconsistent with the > others (progress, not perfection) That's my point. Either we should make the Lookup interface explicitly state that it's safe to call the build method from another thread or we shouldn't really guarantee (or test) this behavior. I don't want you to revert the changes you made but my gut feeling is that lookup implementations should be designed to be single-threaded or at least immutable (one publisher-multiple readers model) as it makes implementing them much easier - no volatiles, synchronization blocks, etc. Concurrency concerns should be handled by the code that uses Lookups - this code should know whether synchronization or two concurrent instances are required (one doing the lookups, potentially via multiple threads, one rebuilding). Perhaps a change in the API is needed to separate those two phases (build-use) and then the downstream code has to take care of handling/ swapping out Lookup reference where they're used - I don't know, I just state what I think. > AnalyzingInfixSuggester thread safety: lookup() fails during (re)build() > > > Key: LUCENE-10292 > URL: https://issues.apache.org/jira/browse/LUCENE-10292 > Project: Lucene - Core > Issue Type: Bug >Reporter: Chris M. Hostetter >Assignee: Chris M. Hostetter >Priority: Major > Fix For: 10.0 (main), 9.2 > > Attachments: LUCENE-10292-1.patch, LUCENE-10292-2.patch, > LUCENE-10292-3.patch, LUCENE-10292.patch > > > I'm filing this based on anecdotal information from a Solr user w/o > experiencing it first hand (and I don't have a test case to demonstrate it) > but based on a reading of the code the underlying problem seems self > evident... > With all other Lookup implementations I've examined, it is possible to call > {{lookup()}} regardless of whether another thread is concurrently calling > {{build()}} – in all cases I've seen, it is even possible to call > {{lookup()}} even if {{build()}} has never been called: the result is just an > "empty" {{List}} > Typically this is works because the {{build()}} method uses temporary > datastructures until it's "build logic" is complete, at which point it > atomically replaces the datastructures used by the {{lookup()}} method. In > the case of {{AnalyzingInfixSuggester}} however, the {{build()}} method > starts by closing & null'ing out the {{protected SearcherManager > searcherMgr}} (which it only populates again once it's completed building up > it's index) and then the lookup method starts with... > {code:java} > if (searcherMgr
[GitHub] [lucene] dweiss commented on pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255.
dweiss commented on PR #850: URL: https://github.com/apache/lucene/pull/850#issuecomment-1112959221 Thanks @mocobeta ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10541) What to do about massive terms in our Wikipedia EN LineFileDocs?
[ https://issues.apache.org/jira/browse/LUCENE-10541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529837#comment-17529837 ] ASF subversion and git services commented on LUCENE-10541: -- Commit 6e6c61eb13479c4cc8470cd74573009ef2faaf6c in lucene's branch refs/heads/main from Dawid Weiss [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=6e6c61eb134 ] LUCENE-10541: Test-framework: limit the default length of MockTokenizer tokens to 255. > What to do about massive terms in our Wikipedia EN LineFileDocs? > > > Key: LUCENE-10541 > URL: https://issues.apache.org/jira/browse/LUCENE-10541 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > Time Spent: 3h > Remaining Estimate: 0h > > Spinoff from this fun build failure that [~dweiss] root caused: > [https://lucene.markmail.org/thread/pculfuazll4oebra] > Thank you and sorry [~dweiss]!! > This test failure happened because the test case randomly indexed a chunk of > the nightly (many GBs) LineFileDocs Wikipedia file that had a massive (> IW's > ~32 KB limit) term, and IW threw an {{IllegalArgumentException}} failing the > test. > It's crazy that it took so long for Lucene's randomized tests to discover > this too-massive term in Lucene's nightly benchmarks. It's like searching > for Nessie, or > [SETI|https://en.wikipedia.org/wiki/Search_for_extraterrestrial_intelligence]. > We need to prevent such false failures, somehow, and there are multiple > options: fix this test to not use {{{}LineFileDocs{}}}, remove all "massive" > terms from all tests (nightly and git) {{{}LineFileDocs{}}}, fix > {{MockTokenizer}} to trim such ridiculous terms (I think this is the best > option?), ... -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10541) What to do about massive terms in our Wikipedia EN LineFileDocs?
[ https://issues.apache.org/jira/browse/LUCENE-10541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529839#comment-17529839 ] ASF subversion and git services commented on LUCENE-10541: -- Commit e7684708935a859c2255912457f95a616010cfea in lucene's branch refs/heads/branch_9x from Dawid Weiss [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=e7684708935 ] LUCENE-10541: Test-framework: limit the default length of MockTokenizer tokens to 255. > What to do about massive terms in our Wikipedia EN LineFileDocs? > > > Key: LUCENE-10541 > URL: https://issues.apache.org/jira/browse/LUCENE-10541 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > Time Spent: 3h > Remaining Estimate: 0h > > Spinoff from this fun build failure that [~dweiss] root caused: > [https://lucene.markmail.org/thread/pculfuazll4oebra] > Thank you and sorry [~dweiss]!! > This test failure happened because the test case randomly indexed a chunk of > the nightly (many GBs) LineFileDocs Wikipedia file that had a massive (> IW's > ~32 KB limit) term, and IW threw an {{IllegalArgumentException}} failing the > test. > It's crazy that it took so long for Lucene's randomized tests to discover > this too-massive term in Lucene's nightly benchmarks. It's like searching > for Nessie, or > [SETI|https://en.wikipedia.org/wiki/Search_for_extraterrestrial_intelligence]. > We need to prevent such false failures, somehow, and there are multiple > options: fix this test to not use {{{}LineFileDocs{}}}, remove all "massive" > terms from all tests (nightly and git) {{{}LineFileDocs{}}}, fix > {{MockTokenizer}} to trim such ridiculous terms (I think this is the best > option?), ... -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10541) What to do about massive terms in our Wikipedia EN LineFileDocs?
[ https://issues.apache.org/jira/browse/LUCENE-10541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529844#comment-17529844 ] ASF subversion and git services commented on LUCENE-10541: -- Commit 47ca4bc21c8c75226f7ba9878044a9f3a5e99833 in lucene's branch refs/heads/branch_9_1 from Dawid Weiss [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=47ca4bc21c8 ] LUCENE-10541: Test-framework: limit the default length of MockTokenizer tokens to 255. > What to do about massive terms in our Wikipedia EN LineFileDocs? > > > Key: LUCENE-10541 > URL: https://issues.apache.org/jira/browse/LUCENE-10541 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > Time Spent: 3h > Remaining Estimate: 0h > > Spinoff from this fun build failure that [~dweiss] root caused: > [https://lucene.markmail.org/thread/pculfuazll4oebra] > Thank you and sorry [~dweiss]!! > This test failure happened because the test case randomly indexed a chunk of > the nightly (many GBs) LineFileDocs Wikipedia file that had a massive (> IW's > ~32 KB limit) term, and IW threw an {{IllegalArgumentException}} failing the > test. > It's crazy that it took so long for Lucene's randomized tests to discover > this too-massive term in Lucene's nightly benchmarks. It's like searching > for Nessie, or > [SETI|https://en.wikipedia.org/wiki/Search_for_extraterrestrial_intelligence]. > We need to prevent such false failures, somehow, and there are multiple > options: fix this test to not use {{{}LineFileDocs{}}}, remove all "massive" > terms from all tests (nightly and git) {{{}LineFileDocs{}}}, fix > {{MockTokenizer}} to trim such ridiculous terms (I think this is the best > option?), ... -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255.
dweiss commented on PR #850: URL: https://github.com/apache/lucene/pull/850#issuecomment-1112986527 I've applied this patch on main, branch_9x and branch_9_1. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss closed pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255.
dweiss closed pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255. URL: https://github.com/apache/lucene/pull/850 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10541) What to do about massive terms in our Wikipedia EN LineFileDocs?
[ https://issues.apache.org/jira/browse/LUCENE-10541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529845#comment-17529845 ] Dawid Weiss commented on LUCENE-10541: -- I've applied the PR - we can close this issue (for now)? > What to do about massive terms in our Wikipedia EN LineFileDocs? > > > Key: LUCENE-10541 > URL: https://issues.apache.org/jira/browse/LUCENE-10541 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > Time Spent: 3h 20m > Remaining Estimate: 0h > > Spinoff from this fun build failure that [~dweiss] root caused: > [https://lucene.markmail.org/thread/pculfuazll4oebra] > Thank you and sorry [~dweiss]!! > This test failure happened because the test case randomly indexed a chunk of > the nightly (many GBs) LineFileDocs Wikipedia file that had a massive (> IW's > ~32 KB limit) term, and IW threw an {{IllegalArgumentException}} failing the > test. > It's crazy that it took so long for Lucene's randomized tests to discover > this too-massive term in Lucene's nightly benchmarks. It's like searching > for Nessie, or > [SETI|https://en.wikipedia.org/wiki/Search_for_extraterrestrial_intelligence]. > We need to prevent such false failures, somehow, and there are multiple > options: fix this test to not use {{{}LineFileDocs{}}}, remove all "massive" > terms from all tests (nightly and git) {{{}LineFileDocs{}}}, fix > {{MockTokenizer}} to trim such ridiculous terms (I think this is the best > option?), ... -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase opened a new pull request, #855: Use MIN_WIDE_EXTENT for GeoWideDegenerateHorizontalLine
iverase opened a new pull request, #855: URL: https://github.com/apache/lucene/pull/855 Follow up of https://github.com/apache/lucene/pull/845 where I miss GeoWideDegenerateHorizontalLine. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase merged pull request #855: LUCENE-10508: Use MIN_WIDE_EXTENT for GeoWideDegenerateHorizontalLine
iverase merged PR #855: URL: https://github.com/apache/lucene/pull/855 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10508) GeoArea failure with degenerated latitude
[ https://issues.apache.org/jira/browse/LUCENE-10508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529853#comment-17529853 ] ASF subversion and git services commented on LUCENE-10508: -- Commit 0dad9ddae87beb888b9374f9fb5f223e5940e586 in lucene's branch refs/heads/main from Ignacio Vera [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=0dad9ddae87 ] LUCENE-10508: Use MIN_WIDE_EXTENT for GeoWideDegenerateHorizontalLine (#855) > GeoArea failure with degenerated latitude > - > > Key: LUCENE-10508 > URL: https://issues.apache.org/jira/browse/LUCENE-10508 > Project: Lucene - Core > Issue Type: Bug > Components: modules/spatial3d >Reporter: Ignacio Vera >Assignee: Ignacio Vera >Priority: Major > Fix For: 9.2 > > Time Spent: 1.5h > Remaining Estimate: 0h > > I hit a failure when trying to build a GeoArea using the GeoAreaFactory. The > issue seems to happen when you have an almost degenerated minLatitude and > maxLatitude and you are close to the poles. Then you might hit the following > exception" > {code} > java.lang.IllegalArgumentException: Cannot determine sidedness because check > point is on plane. > at > __randomizedtesting.SeedInfo.seed([EA56BB13E754A996:C7560EE2BA56A507]:0) > at > org.apache.lucene.spatial3d.geom.SidedPlane.(SidedPlane.java:137) > at > org.apache.lucene.spatial3d.geom.GeoDegenerateVerticalLine.(GeoDegenerateVerticalLine.java:110) > at > org.apache.lucene.spatial3d.geom.GeoBBoxFactory.makeGeoBBox(GeoBBoxFactory.java:100) > at > org.apache.lucene.spatial3d.geom.GeoAreaFactory.makeGeoArea(GeoAreaFactory.java:43) > {code} > The situation is easy to reproduce with the following test: > {code:java} > public void testBBoxRandomDegenerate() { > double minX = Geo3DUtil.fromDegrees(GeoTestUtil.nextLongitude());; > double maxX = Math.nextUp(minX + Vector.MINIMUM_ANGULAR_RESOLUTION); > double minY = Geo3DUtil.fromDegrees(GeoTestUtil.nextLatitude()); > double maxY = Math.nextUp(minY + Vector.MINIMUM_ANGULAR_RESOLUTION); > assertNotNull(GeoAreaFactory.makeGeoArea(PlanetModel.SPHERE, maxY, minY, > minX, maxX)); > } > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10508) GeoArea failure with degenerated latitude
[ https://issues.apache.org/jira/browse/LUCENE-10508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529854#comment-17529854 ] ASF subversion and git services commented on LUCENE-10508: -- Commit a0fc1921fde103749daa666b0641754de60718a0 in lucene's branch refs/heads/branch_9x from Ignacio Vera [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=a0fc1921fde ] LUCENE-10508: Use MIN_WIDE_EXTENT for GeoWideDegenerateHorizontalLine (#855) > GeoArea failure with degenerated latitude > - > > Key: LUCENE-10508 > URL: https://issues.apache.org/jira/browse/LUCENE-10508 > Project: Lucene - Core > Issue Type: Bug > Components: modules/spatial3d >Reporter: Ignacio Vera >Assignee: Ignacio Vera >Priority: Major > Fix For: 9.2 > > Time Spent: 1.5h > Remaining Estimate: 0h > > I hit a failure when trying to build a GeoArea using the GeoAreaFactory. The > issue seems to happen when you have an almost degenerated minLatitude and > maxLatitude and you are close to the poles. Then you might hit the following > exception" > {code} > java.lang.IllegalArgumentException: Cannot determine sidedness because check > point is on plane. > at > __randomizedtesting.SeedInfo.seed([EA56BB13E754A996:C7560EE2BA56A507]:0) > at > org.apache.lucene.spatial3d.geom.SidedPlane.(SidedPlane.java:137) > at > org.apache.lucene.spatial3d.geom.GeoDegenerateVerticalLine.(GeoDegenerateVerticalLine.java:110) > at > org.apache.lucene.spatial3d.geom.GeoBBoxFactory.makeGeoBBox(GeoBBoxFactory.java:100) > at > org.apache.lucene.spatial3d.geom.GeoAreaFactory.makeGeoArea(GeoAreaFactory.java:43) > {code} > The situation is easy to reproduce with the following test: > {code:java} > public void testBBoxRandomDegenerate() { > double minX = Geo3DUtil.fromDegrees(GeoTestUtil.nextLongitude());; > double maxX = Math.nextUp(minX + Vector.MINIMUM_ANGULAR_RESOLUTION); > double minY = Geo3DUtil.fromDegrees(GeoTestUtil.nextLatitude()); > double maxY = Math.nextUp(minY + Vector.MINIMUM_ANGULAR_RESOLUTION); > assertNotNull(GeoAreaFactory.makeGeoArea(PlanetModel.SPHERE, maxY, minY, > minX, maxX)); > } > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on pull request #844: LUCENE-10539: Return a stream of completions from FSTCompletion.
dweiss commented on PR #844: URL: https://github.com/apache/lucene/pull/844#issuecomment-1113025766 I plan to commit this in if there are no objections. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?
[ https://issues.apache.org/jira/browse/LUCENE-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529875#comment-17529875 ] Adrien Grand commented on LUCENE-10544: --- Right, conjunctions of high-frequency clauses that have few documents in common have the same issue. In my opinion, a better solution that has less overhead and would still support cancelling such slow queries consists of leveraging {{BulkScorer#score}} to score small-ish ranges of doc IDs at a time. Long-term I'd like ExitableDirectoryReader and other tooling to handle cancellation/timeout to become mostly implementation details, and have proper support directly on IndexSearcher (LUCENE-10151). > Should ExitableTermsEnum wrap postings and impacts? > --- > > Key: LUCENE-10544 > URL: https://issues.apache.org/jira/browse/LUCENE-10544 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Reporter: Greg Miller >Priority: Major > > While looking into options for LUCENE-10151, I noticed that > {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you > start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} > wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do > anything to wrap postings or impacts. So timeouts will be enforced when > moving to the "next" term, but not when iterating the postings/impacts > associated with a term. > I think we ought to wrap the postings/impacts as well with some form of > timeout checking so timeouts can be enforced on long-running queries. I'm not > sure why this wasn't done originally (back in 2014), but it was questioned > back in 2020 on the original Jira SOLR-5986. Does anyone know of a good > reason why we shouldn't enforce timeouts in this way? > Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} > given that only {{next}} is being wrapped currently. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on a diff in pull request #853: LUCENE-10524 Add benchmark suite details to CONTRIBUTING.md
mikemccand commented on code in PR #853: URL: https://github.com/apache/lucene/pull/853#discussion_r861733565 ## CONTRIBUTING.md: ## @@ -66,6 +66,16 @@ In case your contribution fixes a bug, please create a new test case that fails - *Eclipse* - Basic support ([help/IDEs.txt](https://github.com/apache/lucene/blob/main/help/IDEs.txt#L7)). - *Netbeans* - Not tested. +## Benchmarking + +Use the tool suite at [luceneutil](https://github.com/mikemccand/luceneutil) to benchmark your code changes +if you think that your change may have measurably changed the performance of a task. Review Comment: Maybe add "These are the same benchmarks that run in [nightly benchmarks](https://home.apache.org/~mikemccand/lucenebench/)" or so? ## CONTRIBUTING.md: ## @@ -66,6 +66,16 @@ In case your contribution fixes a bug, please create a new test case that fails - *Eclipse* - Basic support ([help/IDEs.txt](https://github.com/apache/lucene/blob/main/help/IDEs.txt#L7)). - *Netbeans* - Not tested. +## Benchmarking + +Use the tool suite at [luceneutil](https://github.com/mikemccand/luceneutil) to benchmark your code changes +if you think that your change may have measurably changed the performance of a task. + +The instructions for running the benchmarks can be found in the luceneutil [README](https://github.com/mikemccand/luceneutil/blob/master/README.md). + +The Lucene community is also interested in other implementations of these benchmark tasks. +Feel free to share your findings (especially if your implementation performs better!) through the [Lucene mailing lists](https://lucene.apache.org/core/discussion.html). Review Comment: Maybe also suggest "...or open an issue or pr in the luceneutil project"? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #854: LUCENE-10545: allow to link github pr from changes
mocobeta commented on PR #854: URL: https://github.com/apache/lucene/pull/854#issuecomment-1113258104 @rmuir Thanks for confirming. I'd love to commit this in and simplify the contribution workflow. I plan to make a post to the dev list to let others know about this change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?
[ https://issues.apache.org/jira/browse/LUCENE-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529965#comment-17529965 ] Deepika Sharma commented on LUCENE-10544: - Yeah, I think you’re right [~jpountz] about the BulkScorer#score. One edge case though would probably be if a user passes their own BulkScorer, in which case this approach might not work properly. I guess what we could do is to allow a user to use a custom BulkScorer, when timeout is enabled, but this might not be a desirable restriction. > Should ExitableTermsEnum wrap postings and impacts? > --- > > Key: LUCENE-10544 > URL: https://issues.apache.org/jira/browse/LUCENE-10544 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Reporter: Greg Miller >Priority: Major > > While looking into options for LUCENE-10151, I noticed that > {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you > start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} > wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do > anything to wrap postings or impacts. So timeouts will be enforced when > moving to the "next" term, but not when iterating the postings/impacts > associated with a term. > I think we ought to wrap the postings/impacts as well with some form of > timeout checking so timeouts can be enforced on long-running queries. I'm not > sure why this wasn't done originally (back in 2014), but it was questioned > back in 2020 on the original Jira SOLR-5986. Does anyone know of a good > reason why we shouldn't enforce timeouts in this way? > Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} > given that only {{next}} is being wrapped currently. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on pull request #856: gradle 7.3.3 quick upgrade
dweiss commented on PR #856: URL: https://github.com/apache/lucene/pull/856#issuecomment-1113307043 Thanks. This is a quick try to help Tobias Hartman who gets a very strange exception when running gradlew with (Oracle) Java 17. Doesn't help though. :( Uwe knows more. https://bugs.openjdk.java.net/browse/JDK-8285835 I'll try to reproduce with Oracle's JDK - so far no luck. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?
[ https://issues.apache.org/jira/browse/LUCENE-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1753#comment-1753 ] Greg Miller commented on LUCENE-10544: -- {quote}In my opinion, a better solution that has less overhead and would still support cancelling such slow queries consists of leveraging {{BulkScorer#score}} to score small-ish ranges of doc IDs at a time. {quote} +1. We've had success by implementing a "timeout enforcing" Query that does timeout enforcement within the Scorer it provides as a short-term solution, but there are a number of flaws with this approach. Hooking into the BulkScorer makes sense but does need some thought as [~dpsharma] mentions since Queries may (and do!) provide their own BulkScorers in some cases (e.g., {{{}BooleanScorer{}}}). {quote}Long-term I'd like ExitableDirectoryReader and other tooling to handle cancellation/timeout to become mostly implementation details, and have proper support directly on IndexSearcher (LUCENE-10151). {quote} +1. For full disclosure, [~dpsharma] and I work together at Amazon and she is working on LUCENE-10151. One idea is to use {{ExitableDirectoryReader}} as an internal implementation detail of {{IndexSearcher}} to add first-class timeout support. While we were debugging some prototype code, we ran into this issue with {{ExitableDirectoryReader}} and I thought it warranted a spin-off issue since it seems like something we might want to generally fix. > Should ExitableTermsEnum wrap postings and impacts? > --- > > Key: LUCENE-10544 > URL: https://issues.apache.org/jira/browse/LUCENE-10544 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Reporter: Greg Miller >Priority: Major > > While looking into options for LUCENE-10151, I noticed that > {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you > start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} > wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do > anything to wrap postings or impacts. So timeouts will be enforced when > moving to the "next" term, but not when iterating the postings/impacts > associated with a term. > I think we ought to wrap the postings/impacts as well with some form of > timeout checking so timeouts can be enforced on long-running queries. I'm not > sure why this wasn't done originally (back in 2014), but it was questioned > back in 2020 on the original Jira SOLR-5986. Does anyone know of a good > reason why we shouldn't enforce timeouts in this way? > Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} > given that only {{next}} is being wrapped currently. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #856: gradle 7.3.3 quick upgrade
rmuir commented on PR #856: URL: https://github.com/apache/lucene/pull/856#issuecomment-1113334476 yeah, i saw his struggles on the ticket. groovy strikes again. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #843: LUCENE-10538: TopN is not being used in getTopChildren in RangeFacetCounts
gsmiller commented on PR #843: URL: https://github.com/apache/lucene/pull/843#issuecomment-1113335265 @Yuti-G thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10546) Update Faceting user guide
Greg Miller created LUCENE-10546: Summary: Update Faceting user guide Key: LUCENE-10546 URL: https://issues.apache.org/jira/browse/LUCENE-10546 Project: Lucene - Core Issue Type: Wish Components: modules/facet Reporter: Greg Miller The [facet user guide|https://lucene.apache.org/core/4_1_0/facet/org/apache/lucene/facet/doc-files/userguide.html] was written based on 4.1. Since there's been a fair amount of active facet-related development over the last year+, it would be nice to review the guide and see what updates make sense. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] janhoy commented on a diff in pull request #854: LUCENE-10545: allow to link github pr from changes
janhoy commented on code in PR #854: URL: https://github.com/apache/lucene/pull/854#discussion_r861861134 ## lucene/CHANGES.txt: ## @@ -53,7 +53,7 @@ Other * LUCENE-10253: The @BadApple annotation has been removed from the test framework. (Adrien Grand) -* LUCENE-10393: Unify binary dictionary and dictionary writer in Kuromoji and Nori. +* GITHUB-740: Unify binary dictionary and dictionary writer in Kuromoji and Nori. Review Comment: Interesting. In Solr we recently fixed this to work with the new repo as well, but there we only support the `GitHub #123` or `gh-123`. I kind of like the hash syntax, which further distinguishes it from JIRA keys (which would suggest bug number 740 in some ASF 'github" project). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10547) Implement "flattened" Facets#getTopChildren
Greg Miller created LUCENE-10547: Summary: Implement "flattened" Facets#getTopChildren Key: LUCENE-10547 URL: https://issues.apache.org/jira/browse/LUCENE-10547 Project: Lucene - Core Issue Type: New Feature Components: modules/facet Reporter: Greg Miller The currently implementation of {{Facets#getTopChildren}} only considers the immediate children of the user-provided path. In many cases, this is probably what the user is looking for, but it would be useful to also have an implementation that considers any descendant of the path, regardless of "level." This would allow the user to build a deeper set of facet path options in "one shot," instead of having to iteratively call {{getTopChildren}}. Of course the shallower paths, and specifically the immediate children of the provided path, will always outweigh "deeper" paths due to counts/weights accumulating along the ancestry paths, but by providing a topN value larger than the number of immediate children, the user could build up a more complete view of path options in a taxonomy with a lot of depth. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #848: LUCENE-10530: Avoid floating point precision bug in TestTaxonomyFacetAssociations
gsmiller commented on PR #848: URL: https://github.com/apache/lucene/pull/848#issuecomment-1113476269 Since it doesn't look like there's any opposition to the approach I took in fixing this, and since this is a test case bug that is occasionally showing up and would be good to address soon, I'm going to go ahead and merge this fix. Happy to rework the solution if anyone has additional feedback. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller merged pull request #848: LUCENE-10530: Avoid floating point precision bug in TestTaxonomyFacetAssociations
gsmiller merged PR #848: URL: https://github.com/apache/lucene/pull/848 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10530) TestTaxonomyFacetAssociations test failure
[ https://issues.apache.org/jira/browse/LUCENE-10530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530075#comment-17530075 ] ASF subversion and git services commented on LUCENE-10530: -- Commit 902a7df0e500f1e986918b9fd697f208bd04ef85 in lucene's branch refs/heads/main from Greg Miller [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=902a7df0e50 ] LUCENE-10530: Avoid floating point precision bug in TestTaxonomyFacetAssociations (#848) > TestTaxonomyFacetAssociations test failure > -- > > Key: LUCENE-10530 > URL: https://issues.apache.org/jira/browse/LUCENE-10530 > Project: Lucene - Core > Issue Type: Bug >Reporter: Vigya Sharma >Priority: Major > Time Spent: 2h > Remaining Estimate: 0h > > TestTaxonomyFacetAssociations.testFloatAssociationRandom seems to have some > flakiness, it fails on the following random seed. > {code:java} > ./gradlew test --tests > TestTaxonomyFacetAssociations.testFloatAssociationRandom \ > -Dtests.seed=4DFBA8209AC82EB2 -Dtests.slow=true -Dtests.locale=fr-VU \ > -Dtests.timezone=Europe/Athens -Dtests.asserts=true > -Dtests.file.encoding=UTF-8 {code} > This is because of a mismatch in (SUM) aggregated multi-valued, > {{float_random}} facet field. We accept an error delta of 1 in this > aggregation, but for the failing random seed, the delta is 1.3. Maybe we > should change this delta to 1.5? > My hunch is that it is some floating point approximation error. I'm unable to > repro it without the randomization seed. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #842: LUCENE-10518: Relax field consistency check for old indices
mayya-sharipova commented on code in PR #842: URL: https://github.com/apache/lucene/pull/842#discussion_r861981910 ## lucene/core/src/java/org/apache/lucene/index/FieldInfos.java: ## @@ -178,7 +179,15 @@ public static FieldInfos getMergedFieldInfos(IndexReader reader) { .filter(Objects::nonNull) .findAny() .orElse(null); - final Builder builder = new Builder(new FieldNumbers(softDeletesField)); + final int indexCreatedVersionMajor = + leaves.stream() + .map(l -> l.reader().getMetaData()) + .filter(Objects::nonNull) + .mapToInt(r -> r.getCreatedVersionMajor()) + .min() + .orElse(Version.LATEST.major); Review Comment: I am interested when it is possible for segments to have different `getCreatedVersionMajor()`? Looking at the [code](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/IndexWriter.java#L1125) of `IndexWriter` it looks like if we open an `IndexWriter` for append, new segments will always have the same `indexCreatedVersion` as previous segments? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #842: LUCENE-10518: Relax field consistency check for old indices
mayya-sharipova commented on code in PR #842: URL: https://github.com/apache/lucene/pull/842#discussion_r861981910 ## lucene/core/src/java/org/apache/lucene/index/FieldInfos.java: ## @@ -178,7 +179,15 @@ public static FieldInfos getMergedFieldInfos(IndexReader reader) { .filter(Objects::nonNull) .findAny() .orElse(null); - final Builder builder = new Builder(new FieldNumbers(softDeletesField)); + final int indexCreatedVersionMajor = + leaves.stream() + .map(l -> l.reader().getMetaData()) + .filter(Objects::nonNull) + .mapToInt(r -> r.getCreatedVersionMajor()) + .min() + .orElse(Version.LATEST.major); Review Comment: I am interested when it is possible for segments to have different `getCreatedVersionMajor()`? Looking at the [code](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/IndexWriter.java#L1125) of `IndexWriter` it looks like if we open an `IndexWriter` for append, new segments will always have the same `indexCreatedVersionMajor` as previous segments? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #849: LUCENE-10530: Avoid floating point precision bug in TestTaxonomyFacetAssociations
gsmiller commented on PR #849: URL: https://github.com/apache/lucene/pull/849#issuecomment-1113534664 Going ahead and merging since there didn't seem to be any major feedback on the approach and there are occasional test failures due to this. Happy to revisit the solution if anyone has further suggestions. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller merged pull request #849: LUCENE-10530: Avoid floating point precision bug in TestTaxonomyFacetAssociations
gsmiller merged PR #849: URL: https://github.com/apache/lucene/pull/849 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10530) TestTaxonomyFacetAssociations test failure
[ https://issues.apache.org/jira/browse/LUCENE-10530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530133#comment-17530133 ] ASF subversion and git services commented on LUCENE-10530: -- Commit 92833d98fdbf85485f254e8e3514cf51e258d9b2 in lucene's branch refs/heads/branch_9x from Greg Miller [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=92833d98fdb ] LUCENE-10530: Avoid floating point precision bug in TestTaxonomyFacetAssociations (#849) > TestTaxonomyFacetAssociations test failure > -- > > Key: LUCENE-10530 > URL: https://issues.apache.org/jira/browse/LUCENE-10530 > Project: Lucene - Core > Issue Type: Bug >Reporter: Vigya Sharma >Priority: Major > Time Spent: 2h 20m > Remaining Estimate: 0h > > TestTaxonomyFacetAssociations.testFloatAssociationRandom seems to have some > flakiness, it fails on the following random seed. > {code:java} > ./gradlew test --tests > TestTaxonomyFacetAssociations.testFloatAssociationRandom \ > -Dtests.seed=4DFBA8209AC82EB2 -Dtests.slow=true -Dtests.locale=fr-VU \ > -Dtests.timezone=Europe/Athens -Dtests.asserts=true > -Dtests.file.encoding=UTF-8 {code} > This is because of a mismatch in (SUM) aggregated multi-valued, > {{float_random}} facet field. We accept an error delta of 1 in this > aggregation, but for the failing random seed, the delta is 1.3. Maybe we > should change this delta to 1.5? > My hunch is that it is some floating point approximation error. I'm unable to > repro it without the randomization seed. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10529) TestTaxonomyFacetAssociations may have floating point issues
[ https://issues.apache.org/jira/browse/LUCENE-10529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller resolved LUCENE-10529. -- Fix Version/s: 10.0 (main) 9.2 Resolution: Fixed > TestTaxonomyFacetAssociations may have floating point issues > > > Key: LUCENE-10529 > URL: https://issues.apache.org/jira/browse/LUCENE-10529 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > Fix For: 10.0 (main), 9.2 > > > Hit this in a jenkins CI build while testing something else: > {noformat} > gradlew test --tests TestTaxonomyFacetAssociations.testFloatAssociationRandom > -Dtests.seed=B39C450F4870F7F1 -Dtests.locale=ar-IQ > -Dtests.timezone=America/Rankin_Inlet -Dtests.asserts=true > -Dtests.file.encoding=UTF-8 > ... > org.apache.lucene.facet.taxonomy.TestTaxonomyFacetAssociations > > testFloatAssociationRandom FAILED > java.lang.AssertionError: expected:<2605996.5> but was:<2605995.2> > {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10530) TestTaxonomyFacetAssociations test failure
[ https://issues.apache.org/jira/browse/LUCENE-10530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller resolved LUCENE-10530. -- Fix Version/s: 10.0 (main) 9.2 Resolution: Fixed > TestTaxonomyFacetAssociations test failure > -- > > Key: LUCENE-10530 > URL: https://issues.apache.org/jira/browse/LUCENE-10530 > Project: Lucene - Core > Issue Type: Bug >Reporter: Vigya Sharma >Priority: Major > Fix For: 10.0 (main), 9.2 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > TestTaxonomyFacetAssociations.testFloatAssociationRandom seems to have some > flakiness, it fails on the following random seed. > {code:java} > ./gradlew test --tests > TestTaxonomyFacetAssociations.testFloatAssociationRandom \ > -Dtests.seed=4DFBA8209AC82EB2 -Dtests.slow=true -Dtests.locale=fr-VU \ > -Dtests.timezone=Europe/Athens -Dtests.asserts=true > -Dtests.file.encoding=UTF-8 {code} > This is because of a mismatch in (SUM) aggregated multi-valued, > {{float_random}} facet field. We accept an error delta of 1 in this > aggregation, but for the failing random seed, the delta is 1.3. Maybe we > should change this delta to 1.5? > My hunch is that it is some floating point approximation error. I'm unable to > repro it without the randomization seed. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on a diff in pull request #854: LUCENE-10545: allow to link github pr from changes
msokolov commented on code in PR #854: URL: https://github.com/apache/lucene/pull/854#discussion_r861995439 ## gradle/documentation/changes-to-html/changes2html.pl: ## @@ -572,7 +572,7 @@ $item =~ s{((LUCENE|SOLR|INFRA)\s+(\d{3,}))} {$1}gi; # Link "[ github | gh ] pull request [ # ] X+" to Github pull request - $item =~ s{((?:(?:(?:github|gh)\s+)?pull\s+request\s*(?:\#?\s*)?|gh-)(\d+))} + $item =~ s{((?:(?:(?:github|gh)\s+)?pull\s+request\s*(?:\#?\s*)?|gh-|github-)(\d+))} Review Comment: it seems from your examples this is case-insensitive (ie GITHUB works) but I don't see how - the regex has no `i`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10542) FieldSource exists implementation can avoid value retrieval
[ https://issues.apache.org/jira/browse/LUCENE-10542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Risden updated LUCENE-10542: -- Description: While looking at LUCENE-10534, found that *FieldSource exists implementation after LUCENE-7407 can avoid value lookup when just checking for exists. Flamegraphs - x axis = time spent as a percentage of time being profiled, y axis = stack trace bottom being first call top being last call Looking only at the left most getValueForDoc highlight only (and it helps to make it bigger or download the original) !flamegraph_getValueForDoc.png|height=410,width=1000! LongFieldSource#exists spends MOST of its time doing a LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its time doing two things primarily: * FilterNumericDocValues#longValue() * advance() This makes sense based on looking at the code (copied below to make it easier to see at once) https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L72 {code:java} private long getValueForDoc(int doc) throws IOException { if (doc < lastDocID) { throw new IllegalArgumentException( "docs were sent out-of-order: lastDocID=" + lastDocID + " vs docID=" + doc); } lastDocID = doc; int curDocID = arr.docID(); if (doc > curDocID) { curDocID = arr.advance(doc); } if (doc == curDocID) { return arr.longValue(); } else { return 0; } } {code} LongFieldSource#exists - doesn't care about the actual longValue. Just that there was a value found when iterating through the doc values. https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L95 {code:java} @Override public boolean exists(int doc) throws IOException { getValueForDoc(doc); return arr.docID() == doc; } {code} So putting this all together for exists calling getValueForDoc, we spent ~50% of the time trying to get the long value when we don't need it in exists. We can save that 50% of time making exists not care about the actual value and just return if doc == curDocID basically. This 50% extra is exaggerated in MaxFloatFunction (and other places) since exists() is being called a bunch. Eventually the value will be needed from longVal(), but if we call exists() say 3 times for every longVal(), we are spending a lot of time computing the value when we only need to check for existence. I found the same pattern in DoubleFieldSource, EnumFieldSource, FloatFieldSource, IntFieldSource, LongFieldSource. I put together a change showing what this would look like: Simple JMH performance tests comparing the original FloatFieldSource to the new ones from PR #847. | Benchmark | Mode | Cnt | Score and Error | Units | |-|---|-|--|---| | MyBenchmark.testMaxFloatFunction| thrpt | 25 | 64.159 ± 2.031 | ops/s | | MyBenchmark.testNewMaxFloatFunction | thrpt | 25 | 94.997 ± 2.365 | ops/s | | MyBenchmark.testMaxFloatFunctionNewFloatFieldSource | thrpt | 25 | 123.191 ± 9.291 | ops/s | | MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSource | thrpt | 25 | 123.817 ± 6.191 | ops/s | | MyBenchmark.testMaxFloatFunctionRareField | thrpt | 25 | 244.921 ± 6.439 | ops/s | | MyBenchmark.testNewMaxFloatFunctionRareField| thrpt | 25 | 239.288 ± 5.136 | ops/s | | MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField| thrpt | 25 | 271.521 ± 3.870 | ops/s | | MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSourceRareField | thrpt | 25 | 279.334 ± 10.511 | ops/s | Source: https://github.com/risdenk/lucene-jmh was: While looking at LUCENE-10534, found that *FieldSource exists implementation after LUCENE-7407 can avoid value lookup when just checking for exists. Flamegraphs - x axis = time spent as a percentage of time being profiled, y axis = stack trace bottom being first call top being last call Looking only at the left most getValueForDoc highlight only (and it helps to make it bigger or download the original) !flamegraph_getValueForDoc.png|height=410,width=1000! LongFieldSource#exists spends MOST of its time doing a LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its time doing two things primarily: * FilterNumericDocValues#longValue() * advance() This makes sense based on looking at the code (copied below to make it easier to see at once) https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/q
[jira] [Comment Edited] (LUCENE-10534) MinFloatFunction / MaxFloatFunction exists check can be slow
[ https://issues.apache.org/jira/browse/LUCENE-10534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529572#comment-17529572 ] Kevin Risden edited comment on LUCENE-10534 at 4/29/22 5:15 PM: Updated metrics - there is a benefit to the new maxfloatfunction logic to avoid duplicate exists() even with using the new fieldsource stuff in LUCENE-10542. | Benchmark | Mode | Cnt | Score and Error | Units | |-|---|-|--|---| | MyBenchmark.testMaxFloatFunction| thrpt | 25 | 64.159 ± 2.031 | ops/s | | MyBenchmark.testNewMaxFloatFunction | thrpt | 25 | 94.997 ± 2.365 | ops/s | | MyBenchmark.testMaxFloatFunctionNewFloatFieldSource | thrpt | 25 | 123.191 ± 9.291 | ops/s | | MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSource | thrpt | 25 | 123.817 ± 6.191 | ops/s | | MyBenchmark.testMaxFloatFunctionRareField | thrpt | 25 | 244.921 ± 6.439 | ops/s | | MyBenchmark.testNewMaxFloatFunctionRareField| thrpt | 25 | 239.288 ± 5.136 | ops/s | | MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField| thrpt | 25 | 271.521 ± 3.870 | ops/s | | MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSourceRareField | thrpt | 25 | 279.334 ± 10.511 | ops/s | was (Author: risdenk): Updated metrics - there is a benefit to the new maxfloatfunction logic to avoid duplicate exists() even with using the new fieldsource stuff in LUCENE-10542. | Benchmark | Mode | Cnt | Score and Error | Units | |-|---|-|--|---| | MyBenchmark.testMaxFloatFunction| thrpt | 25 | 69.949 ± 4.043 | ops/s | | MyBenchmark.testMaxFloatFunctionNewFloatFieldSource | thrpt | 25 | 112.326 ± 3.228 | ops/s | | MyBenchmark.testNewMaxFloatFunction | thrpt | 25 | 93.216 ± 2.757 | ops/s | | MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSource | thrpt | 25 | 123.364 ± 7.861 | ops/s | | MyBenchmark.testMaxFloatFunctionRareField | thrpt | 25 | 257.339 ± 33.849 | ops/s | | MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField| thrpt | 25 | 287.175 ± 22.840 | ops/s | | MyBenchmark.testNewMaxFloatFunctionRareField| thrpt | 25 | 235.268 ± 4.103 | ops/s | | MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSourceRareField | thrpt | 25 | 272.397 ± 8.406 | ops/s | > MinFloatFunction / MaxFloatFunction exists check can be slow > > > Key: LUCENE-10534 > URL: https://issues.apache.org/jira/browse/LUCENE-10534 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Kevin Risden >Assignee: Kevin Risden >Priority: Minor > Attachments: flamegraph.png, flamegraph_getValueForDoc.png > > Time Spent: 1h 10m > Remaining Estimate: 0h > > MinFloatFunction > (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MinFloatFunction.java) > and MaxFloatFunction > (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MaxFloatFunction.java) > both check if values exist. This is needed since the underlying valuesource > returns 0.0f as either a valid value or as a value when the document doesn't > have a value. > Even though this is changed to anyExists and short circuits in the case a > value is found in any document, the worst case is that there is no value > found and requires checking all the way through to the raw data. This is only > needed when 0.0f is returned and need to determine if it is a valid value or > the not found case. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10542) FieldSource exists implementation can avoid value retrieval
[ https://issues.apache.org/jira/browse/LUCENE-10542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530141#comment-17530141 ] Kevin Risden commented on LUCENE-10542: --- Thanks [~rcmuir] thats a good idea. > FieldSource exists implementation can avoid value retrieval > --- > > Key: LUCENE-10542 > URL: https://issues.apache.org/jira/browse/LUCENE-10542 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Kevin Risden >Assignee: Kevin Risden >Priority: Minor > Attachments: flamegraph_getValueForDoc.png > > Time Spent: 20m > Remaining Estimate: 0h > > While looking at LUCENE-10534, found that *FieldSource exists implementation > after LUCENE-7407 can avoid value lookup when just checking for exists. > Flamegraphs - x axis = time spent as a percentage of time being profiled, y > axis = stack trace bottom being first call top being last call > Looking only at the left most getValueForDoc highlight only (and it helps to > make it bigger or download the original) > !flamegraph_getValueForDoc.png|height=410,width=1000! > LongFieldSource#exists spends MOST of its time doing a > LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its > time doing two things primarily: > * FilterNumericDocValues#longValue() > * advance() > This makes sense based on looking at the code (copied below to make it easier > to see at once) > https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L72 > {code:java} > private long getValueForDoc(int doc) throws IOException { > if (doc < lastDocID) { > throw new IllegalArgumentException( > "docs were sent out-of-order: lastDocID=" + lastDocID + " vs > docID=" + doc); > } > lastDocID = doc; > int curDocID = arr.docID(); > if (doc > curDocID) { > curDocID = arr.advance(doc); > } > if (doc == curDocID) { > return arr.longValue(); > } else { > return 0; > } > } > {code} > LongFieldSource#exists - doesn't care about the actual longValue. Just that > there was a value found when iterating through the doc values. > https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L95 > {code:java} > @Override > public boolean exists(int doc) throws IOException { > getValueForDoc(doc); > return arr.docID() == doc; > } > {code} > So putting this all together for exists calling getValueForDoc, we spent ~50% > of the time trying to get the long value when we don't need it in exists. We > can save that 50% of time making exists not care about the actual value and > just return if doc == curDocID basically. > This 50% extra is exaggerated in MaxFloatFunction (and other places) since > exists() is being called a bunch. Eventually the value will be needed from > longVal(), but if we call exists() say 3 times for every longVal(), we are > spending a lot of time computing the value when we only need to check for > existence. > I found the same pattern in DoubleFieldSource, EnumFieldSource, > FloatFieldSource, IntFieldSource, LongFieldSource. I put together a change > showing what this would look like: > > Simple JMH performance tests comparing the original FloatFieldSource to the > new ones from PR #847. > > | Benchmark | Mode | > Cnt | Score and Error | Units | > |-|---|-|--|---| > | MyBenchmark.testMaxFloatFunction| thrpt | > 25 | 64.159 ± 2.031 | ops/s | > | MyBenchmark.testNewMaxFloatFunction | thrpt | > 25 | 94.997 ± 2.365 | ops/s | > | MyBenchmark.testMaxFloatFunctionNewFloatFieldSource | thrpt | > 25 | 123.191 ± 9.291 | ops/s | > | MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSource | thrpt | > 25 | 123.817 ± 6.191 | ops/s | > | MyBenchmark.testMaxFloatFunctionRareField | thrpt | > 25 | 244.921 ± 6.439 | ops/s | > | MyBenchmark.testNewMaxFloatFunctionRareField| thrpt | > 25 | 239.288 ± 5.136 | ops/s | > | MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField| thrpt | > 25 | 271.521 ± 3.870 | ops/s | > | MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSourceRareField | thrpt | > 25 | 279.334 ± 10.511 | ops/s | > Source: https://github.com/risdenk/lucene-jmh -- This message was sent by Atlassian Jira (v8.20.7#820007) -
[jira] [Commented] (LUCENE-10541) What to do about massive terms in our Wikipedia EN LineFileDocs?
[ https://issues.apache.org/jira/browse/LUCENE-10541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530144#comment-17530144 ] Michael Sokolov commented on LUCENE-10541: -- I think the probability of choosing a particular item is based on the length of the item just prior (ignoring the skip points, which I don't understand what Mike is talking about!). But if there is no correlation among length(line N) and length(line N+1) we could probably ignore that. In other words, the item following the longest line L is the most likely item to be chosen. However its expected length is no different from the expected length of all the lines, right? In which case I don't think the seek-and-scan method changes the probabilities at all. So I think we can simply look at the number of lines of a given length (or above some threshold) and divide by the total number of lines to get the P(line length). > What to do about massive terms in our Wikipedia EN LineFileDocs? > > > Key: LUCENE-10541 > URL: https://issues.apache.org/jira/browse/LUCENE-10541 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > Time Spent: 3h 20m > Remaining Estimate: 0h > > Spinoff from this fun build failure that [~dweiss] root caused: > [https://lucene.markmail.org/thread/pculfuazll4oebra] > Thank you and sorry [~dweiss]!! > This test failure happened because the test case randomly indexed a chunk of > the nightly (many GBs) LineFileDocs Wikipedia file that had a massive (> IW's > ~32 KB limit) term, and IW threw an {{IllegalArgumentException}} failing the > test. > It's crazy that it took so long for Lucene's randomized tests to discover > this too-massive term in Lucene's nightly benchmarks. It's like searching > for Nessie, or > [SETI|https://en.wikipedia.org/wiki/Search_for_extraterrestrial_intelligence]. > We need to prevent such false failures, somehow, and there are multiple > options: fix this test to not use {{{}LineFileDocs{}}}, remove all "massive" > terms from all tests (nightly and git) {{{}LineFileDocs{}}}, fix > {{MockTokenizer}} to trim such ridiculous terms (I think this is the best > option?), ... -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on a diff in pull request #844: LUCENE-10539: Return a stream of completions from FSTCompletion.
msokolov commented on code in PR #844: URL: https://github.com/apache/lucene/pull/844#discussion_r862001926 ## lucene/suggest/src/java/org/apache/lucene/search/suggest/fst/FSTCompletion.java: ## @@ -184,110 +195,174 @@ public List lookup(CharSequence key, int num) { return EMPTY_RESULT; } -try { - BytesRef keyUtf8 = new BytesRef(key); - if (!higherWeightsFirst && rootArcs.length > 1) { -// We could emit a warning here (?). An optimal strategy for -// alphabetically sorted -// suggestions would be to add them with a constant weight -- this saves -// unnecessary -// traversals and sorting. -return lookupSortedAlphabetically(keyUtf8, num); - } else { -return lookupSortedByWeight(keyUtf8, num, false); - } -} catch (IOException e) { - // Should never happen, but anyway. - throw new RuntimeException(e); +if (!higherWeightsFirst && rootArcs.length > 1) { + // We could emit a warning here (?). An optimal strategy for + // alphabetically sorted + // suggestions would be to add them with a constant weight -- this saves + // unnecessary + // traversals and sorting. + return lookup(key).sorted().limit(num).collect(Collectors.toList()); +} else { + return lookup(key).limit(num).collect(Collectors.toList()); } } /** - * Lookup suggestions sorted alphabetically if weights are not constant. This is a - * workaround: in general, use constant weights for alphabetically sorted result. + * Lookup suggestions to key and return a stream of matching completions. The stream + * fetches completions dynamically - it can be filtered and limited to acquire the desired number + * of completions without collecting all of them. + * + * @param key The prefix to which suggestions should be sought. + * @return Returns the suggestions */ - private List lookupSortedAlphabetically(BytesRef key, int num) throws IOException { -// Greedily get num results from each weight branch. -List res = lookupSortedByWeight(key, num, true); - -// Sort and trim. -Collections.sort(res); -if (res.size() > num) { - res = res.subList(0, num); + public Stream lookup(CharSequence key) { +if (key.length() == 0 || automaton == null) { + return Stream.empty(); +} + +try { + return lookupSortedByWeight(new BytesRef(key)); +} catch (IOException e) { + throw new RuntimeException(e); } -return res; } - /** - * Lookup suggestions sorted by weight (descending order). - * - * @param collectAll If true, the routine terminates immediately when num - * suggestions have been collected. If false, it will collect suggestions - * from all weight arcs (needed for {@link #lookupSortedAlphabetically}. - */ - private ArrayList lookupSortedByWeight(BytesRef key, int num, boolean collectAll) - throws IOException { -// Don't overallocate the results buffers. This also serves the purpose of -// allowing the user of this class to request all matches using Integer.MAX_VALUE as -// the number of results. -final ArrayList res = new ArrayList<>(Math.min(10, num)); - -final BytesRef output = BytesRef.deepCopyOf(key); -for (int i = 0; i < rootArcs.length; i++) { - final FST.Arc rootArc = rootArcs[i]; - final FST.Arc arc = new FST.Arc<>().copyFrom(rootArc); - - // Descend into the automaton using the key as prefix. - if (descendWithPrefix(arc, key)) { -// A subgraph starting from the current node has the completions -// of the key prefix. The arc we're at is the last key's byte, -// so we will collect it too. -output.length = key.length - 1; -if (collect(res, num, rootArc.label(), output, arc) && !collectAll) { - // We have enough suggestions to return immediately. Keep on looking - // for an - // exact match, if requested. - if (exactFirst) { -if (!checkExistingAndReorder(res, key)) { - int exactMatchBucket = getExactMatchStartingFromRootArc(i, key); - if (exactMatchBucket != -1) { -// Insert as the first result and truncate at num. -while (res.size() >= num) { - res.remove(res.size() - 1); -} -res.add(0, new Completion(key, exactMatchBucket)); - } -} - } + /** Lookup suggestions sorted by weight (descending order). */ + private Stream lookupSortedByWeight(BytesRef key) throws IOException { + +// Look for an exact match first. +Completion exactCompletion; +if (exactFirst) { + Completion c = null; + for (int i = 0; i < rootArcs.length; i++) { +int exactMatchBucket = getExactMatchStartingFromRootArc(i, key); +if (exactMatchBucket != -1) { + // root arcs are sorted by decr
[GitHub] [lucene-solr] magibney commented on pull request #989: SOLR-12457: improve compatibility/support for sort by field-function
magibney commented on PR #989: URL: https://github.com/apache/lucene-solr/pull/989#issuecomment-1113580029 Superseded by: https://github.com/apache/solr/pull/828 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] magibney closed pull request #989: SOLR-12457: improve compatibility/support for sort by field-function
magibney closed pull request #989: SOLR-12457: improve compatibility/support for sort by field-function URL: https://github.com/apache/lucene-solr/pull/989 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10534) MinFloatFunction / MaxFloatFunction calls exists twice
[ https://issues.apache.org/jira/browse/LUCENE-10534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Risden updated LUCENE-10534: -- Summary: MinFloatFunction / MaxFloatFunction calls exists twice (was: MinFloatFunction / MaxFloatFunction exists check can be slow) > MinFloatFunction / MaxFloatFunction calls exists twice > -- > > Key: LUCENE-10534 > URL: https://issues.apache.org/jira/browse/LUCENE-10534 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Kevin Risden >Assignee: Kevin Risden >Priority: Minor > Attachments: flamegraph.png, flamegraph_getValueForDoc.png > > Time Spent: 1h 10m > Remaining Estimate: 0h > > MinFloatFunction > (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MinFloatFunction.java) > and MaxFloatFunction > (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MaxFloatFunction.java) > both check if values exist twice. This change prevents the duplicate exists > check. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10534) MinFloatFunction / MaxFloatFunction exists check can be slow
[ https://issues.apache.org/jira/browse/LUCENE-10534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Risden updated LUCENE-10534: -- Description: MinFloatFunction (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MinFloatFunction.java) and MaxFloatFunction (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MaxFloatFunction.java) both check if values exist twice. This change prevents the duplicate exists check. (was: MinFloatFunction (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MinFloatFunction.java) and MaxFloatFunction (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MaxFloatFunction.java) both check if values exist. This is needed since the underlying valuesource returns 0.0f as either a valid value or as a value when the document doesn't have a value. Even though this is changed to anyExists and short circuits in the case a value is found in any document, the worst case is that there is no value found and requires checking all the way through to the raw data. This is only needed when 0.0f is returned and need to determine if it is a valid value or the not found case.) > MinFloatFunction / MaxFloatFunction exists check can be slow > > > Key: LUCENE-10534 > URL: https://issues.apache.org/jira/browse/LUCENE-10534 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Kevin Risden >Assignee: Kevin Risden >Priority: Minor > Attachments: flamegraph.png, flamegraph_getValueForDoc.png > > Time Spent: 1h 10m > Remaining Estimate: 0h > > MinFloatFunction > (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MinFloatFunction.java) > and MaxFloatFunction > (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MaxFloatFunction.java) > both check if values exist twice. This change prevents the duplicate exists > check. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10534) MinFloatFunction / MaxFloatFunction calls exists twice
[ https://issues.apache.org/jira/browse/LUCENE-10534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Risden updated LUCENE-10534: -- Description: MinFloatFunction (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MinFloatFunction.java) and MaxFloatFunction (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MaxFloatFunction.java) both check if values exist twice. This change prevents the duplicate exists check. Tested with JMH here: https://github.com/risdenk/lucene-jmh | Benchmark | Mode | Cnt | Score and Error | Units | |-|---|-|--|---| | MyBenchmark.testMaxFloatFunction| thrpt | 25 | 64.159 ± 2.031 | ops/s | | MyBenchmark.testNewMaxFloatFunction | thrpt | 25 | 94.997 ± 2.365 | ops/s | | MyBenchmark.testMaxFloatFunctionRareField | thrpt | 25 | 244.921 ± 6.439 | ops/s | | MyBenchmark.testNewMaxFloatFunctionRareField| thrpt | 25 | 239.288 ± 5.136 | ops/s | was:MinFloatFunction (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MinFloatFunction.java) and MaxFloatFunction (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MaxFloatFunction.java) both check if values exist twice. This change prevents the duplicate exists check. > MinFloatFunction / MaxFloatFunction calls exists twice > -- > > Key: LUCENE-10534 > URL: https://issues.apache.org/jira/browse/LUCENE-10534 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Kevin Risden >Assignee: Kevin Risden >Priority: Minor > Attachments: flamegraph.png, flamegraph_getValueForDoc.png > > Time Spent: 1h 10m > Remaining Estimate: 0h > > MinFloatFunction > (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MinFloatFunction.java) > and MaxFloatFunction > (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MaxFloatFunction.java) > both check if values exist twice. This change prevents the duplicate exists > check. > Tested with JMH here: https://github.com/risdenk/lucene-jmh > | Benchmark | Mode | > Cnt | Score and Error | Units | > |-|---|-|--|---| > | MyBenchmark.testMaxFloatFunction| thrpt | > 25 | 64.159 ± 2.031 | ops/s | > | MyBenchmark.testNewMaxFloatFunction | thrpt | > 25 | 94.997 ± 2.365 | ops/s | > | MyBenchmark.testMaxFloatFunctionRareField | thrpt | > 25 | 244.921 ± 6.439 | ops/s | > | MyBenchmark.testNewMaxFloatFunctionRareField| thrpt | > 25 | 239.288 ± 5.136 | ops/s | -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10548) Weird errors launching gradlew (Caused by: groovy.lang.MissingMethodException: No signature of method: java.lang.Object.clone() is applicable for argument types: () val
Dawid Weiss created LUCENE-10548: Summary: Weird errors launching gradlew (Caused by: groovy.lang.MissingMethodException: No signature of method: java.lang.Object.clone() is applicable for argument types: () values: []) Key: LUCENE-10548 URL: https://issues.apache.org/jira/browse/LUCENE-10548 Project: Lucene - Core Issue Type: Bug Reporter: Dawid Weiss https://bugs.openjdk.java.net/browse/JDK-8285835 I can't reproduce it anywhere, with the same JDK Tobias is using. Seems like clone() is the cause - let's see if we can just get rid of that code and if it helps. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10549) Upgrade to gradle 7.3.3
Dawid Weiss created LUCENE-10549: Summary: Upgrade to gradle 7.3.3 Key: LUCENE-10549 URL: https://issues.apache.org/jira/browse/LUCENE-10549 Project: Lucene - Core Issue Type: Improvement Reporter: Dawid Weiss There are newer gradle versions but this is a low-hanging fruit that has official support for Java 17. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss merged pull request #856: LUCENE-10549: upgrade gradle to 7.3.3
dweiss merged PR #856: URL: https://github.com/apache/lucene/pull/856 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10549) Upgrade to gradle 7.3.3
[ https://issues.apache.org/jira/browse/LUCENE-10549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss resolved LUCENE-10549. -- Fix Version/s: 9.2 Resolution: Fixed > Upgrade to gradle 7.3.3 > --- > > Key: LUCENE-10549 > URL: https://issues.apache.org/jira/browse/LUCENE-10549 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Dawid Weiss >Priority: Major > Fix For: 9.2 > > Time Spent: 10m > Remaining Estimate: 0h > > There are newer gradle versions but this is a low-hanging fruit that has > official support for Java 17. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10549) Upgrade to gradle 7.3.3
[ https://issues.apache.org/jira/browse/LUCENE-10549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated LUCENE-10549: - Fix Version/s: 9.1.1 > Upgrade to gradle 7.3.3 > --- > > Key: LUCENE-10549 > URL: https://issues.apache.org/jira/browse/LUCENE-10549 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Dawid Weiss >Priority: Major > Fix For: 10.0 (main), 9.2, 9.1.1 > > Time Spent: 10m > Remaining Estimate: 0h > > There are newer gradle versions but this is a low-hanging fruit that has > official support for Java 17. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10549) Upgrade to gradle 7.3.3
[ https://issues.apache.org/jira/browse/LUCENE-10549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated LUCENE-10549: - Fix Version/s: 10.0 (main) > Upgrade to gradle 7.3.3 > --- > > Key: LUCENE-10549 > URL: https://issues.apache.org/jira/browse/LUCENE-10549 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Dawid Weiss >Priority: Major > Fix For: 10.0 (main), 9.2 > > Time Spent: 10m > Remaining Estimate: 0h > > There are newer gradle versions but this is a low-hanging fruit that has > official support for Java 17. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on pull request #857: LUCENE-10548: Weird errors launching gradlew
dweiss commented on PR #857: URL: https://github.com/apache/lucene/pull/857#issuecomment-1113654041 I've moved this class to buildSrc so that it's precompiled before the scripts and removed the Cloneable interface. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10548) Weird errors launching gradlew (Caused by: groovy.lang.MissingMethodException: No signature of method: java.lang.Object.clone() is applicable for argument types: () v
[ https://issues.apache.org/jira/browse/LUCENE-10548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530204#comment-17530204 ] Dawid Weiss commented on LUCENE-10548: -- A PR is here (and on the source branch in my repo). https://github.com/apache/lucene/pull/857 > Weird errors launching gradlew (Caused by: > groovy.lang.MissingMethodException: No signature of method: > java.lang.Object.clone() is applicable for argument types: () values: []) > > > Key: LUCENE-10548 > URL: https://issues.apache.org/jira/browse/LUCENE-10548 > Project: Lucene - Core > Issue Type: Bug >Reporter: Dawid Weiss >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > https://bugs.openjdk.java.net/browse/JDK-8285835 > I can't reproduce it anywhere, with the same JDK Tobias is using. Seems like > clone() is the cause - let's see if we can just get rid of that code and if > it helps. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss merged pull request #844: LUCENE-10539: Return a stream of completions from FSTCompletion.
dweiss merged PR #844: URL: https://github.com/apache/lucene/pull/844 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on pull request #844: LUCENE-10539: Return a stream of completions from FSTCompletion.
dweiss commented on PR #844: URL: https://github.com/apache/lucene/pull/844#issuecomment-1113657203 Thanks @msokolov ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10539) Return a stream of completions from FSTCompletion
[ https://issues.apache.org/jira/browse/LUCENE-10539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530209#comment-17530209 ] ASF subversion and git services commented on LUCENE-10539: -- Commit d0ce05888d62eadf4e1172339338a389b37dad41 in lucene's branch refs/heads/branch_9x from Dawid Weiss [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=d0ce05888d6 ] LUCENE-10539: Return a stream of completions from FSTCompletion. (#844) > Return a stream of completions from FSTCompletion > - > > Key: LUCENE-10539 > URL: https://issues.apache.org/jira/browse/LUCENE-10539 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Minor > Fix For: 9.2 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > FSTLookup currently has a "num" parameter which limits the number of > completions from the underlying automaton. But this has severe disadvantages > if you need to collect completions that need to fulfill a secondary condition > (for example, collect only verbs or terms that contain a certain infix). Then > you can't determine the 'num' parameter easily because the number of filtered > completions is unknown. > I also think implementation-wise it's also much nicer to provide a stream > that iterates over completions rather than a fixed-size list. This allows for > much more elegant code (stream.filter, stream.limit). > The provided patch adds a single {{Stream lookup(key)}} method > and modifies the existing lookup methods to use it. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10539) Return a stream of completions from FSTCompletion
[ https://issues.apache.org/jira/browse/LUCENE-10539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss resolved LUCENE-10539. -- Resolution: Fixed > Return a stream of completions from FSTCompletion > - > > Key: LUCENE-10539 > URL: https://issues.apache.org/jira/browse/LUCENE-10539 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Minor > Fix For: 9.2 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > FSTLookup currently has a "num" parameter which limits the number of > completions from the underlying automaton. But this has severe disadvantages > if you need to collect completions that need to fulfill a secondary condition > (for example, collect only verbs or terms that contain a certain infix). Then > you can't determine the 'num' parameter easily because the number of filtered > completions is unknown. > I also think implementation-wise it's also much nicer to provide a stream > that iterates over completions rather than a fixed-size list. This allows for > much more elegant code (stream.filter, stream.limit). > The provided patch adds a single {{Stream lookup(key)}} method > and modifies the existing lookup methods to use it. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10539) Return a stream of completions from FSTCompletion
[ https://issues.apache.org/jira/browse/LUCENE-10539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530211#comment-17530211 ] Dawid Weiss commented on LUCENE-10539: -- I applied to branch_9x and main. > Return a stream of completions from FSTCompletion > - > > Key: LUCENE-10539 > URL: https://issues.apache.org/jira/browse/LUCENE-10539 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Minor > Fix For: 9.2 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > FSTLookup currently has a "num" parameter which limits the number of > completions from the underlying automaton. But this has severe disadvantages > if you need to collect completions that need to fulfill a secondary condition > (for example, collect only verbs or terms that contain a certain infix). Then > you can't determine the 'num' parameter easily because the number of filtered > completions is unknown. > I also think implementation-wise it's also much nicer to provide a stream > that iterates over completions rather than a fixed-size list. This allows for > much more elegant code (stream.filter, stream.limit). > The provided patch adds a single {{Stream lookup(key)}} method > and modifies the existing lookup methods to use it. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10550) Add getAllChildren functionality to facets
Yuting Gan created LUCENE-10550: --- Summary: Add getAllChildren functionality to facets Key: LUCENE-10550 URL: https://issues.apache.org/jira/browse/LUCENE-10550 Project: Lucene - Core Issue Type: New Feature Reporter: Yuting Gan Currently Lucene does not support returning range counts sorted by label values, but there are use cases demanding this feature. For example, a user specifies ranges (e.g., [0, 10], [10, 20]) and wants to get range counts without changing the range order. Today we can only call getTopChildren to populate range counts, but it would return ranges sorted by counts (e.g., [10, 20] 100, [0, 10] 50) instead of range values. Lucene has a API, getAllChildrenSortByValue, that returns numeric values with counts sorted by label values, please see [LUCENE-7927|https://issues.apache.org/jira/browse/LUCENE-7927] for details. Therefore, it would be nice that we can also have a similar API to support range counts. The proposed getAllChildren API is to return value/range counts sorted by label values instead of counts. This proposal was inspired from the discussions with [~gsmiller] when I was working on the LUCENE-10538 [PR|https://github.com/apache/lucene/pull/843], and we believe users would benefit from adding this API to Facets. Hope I can get some feedback from the community since this proposal would require changes to the getTopChildren API in RangeFacetCounts. Thanks! -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10538) TopN is not being used in getTopChildren()
[ https://issues.apache.org/jira/browse/LUCENE-10538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530229#comment-17530229 ] Yuting Gan commented on LUCENE-10538: - Created a related issue [LUCENE-10550|https://issues.apache.org/jira/browse/LUCENE-10550] and would appreciate any feedback from the community. Thanks! > TopN is not being used in getTopChildren() > -- > > Key: LUCENE-10538 > URL: https://issues.apache.org/jira/browse/LUCENE-10538 > Project: Lucene - Core > Issue Type: Bug >Reporter: Yuting Gan >Priority: Minor > Time Spent: 1.5h > Remaining Estimate: 0h > > When looking at the overridden implementation getTopChildren(int topN, String > dim, String... path) in RangeFacetCounts, I found that the topN parameter is > not being used in the code, and the unit tests did not test this function > properly. I will create a PR to fix this, and will look into other overridden > implementations and see if they have the same issue. Please let me know if > there is any question. Thanks! -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] vigyasharma commented on a diff in pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.
vigyasharma commented on code in PR #633: URL: https://github.com/apache/lucene/pull/633#discussion_r862140049 ## lucene/core/src/test/org/apache/lucene/index/TestAddIndexes.java: ## @@ -1014,8 +1304,6 @@ public void testAddIndexesWithRollback() throws Throwable { } c.closeDir(); - -assertTrue(c.failures.size() == 0); Review Comment: We don't. This checks for failures when writer.close() was called. We are able to abort merges and don't have any failures, so this assertion is valid. I think I removed it while debugging some breaking tests and missed adding it back. Added it back now. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] vigyasharma commented on a diff in pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.
vigyasharma commented on code in PR #633: URL: https://github.com/apache/lucene/pull/633#discussion_r862140381 ## lucene/core/src/test/org/apache/lucene/index/TestIndexWriterOnDiskFull.java: ## @@ -352,15 +352,15 @@ public void testAddIndexOnDiskFull() throws IOException { done = true; } - } catch (IllegalStateException | IOException e) { + } catch (IllegalStateException | IOException | MergePolicy.MergeException e) { Review Comment: If the merge threads triggered for addIndexes readers fail, they throw a merge exception, which is silently recorded in that OneMerge object. After all threads join, we check if merge passed on all threads, and if not, we rethrow this MergeException, to surface the actual cause of API failure. This is [done here](https://github.com/apache/lucene/pull/633/files/826116d47268714685ad42988ae5f9b66bac6673#diff-edbb11369e2e565210ea9761ee167ac0c6a39521c0e71f8af24f25f4092c740aR3224-R3229). If all threads passed, we proceed to register the new segmentinfo. `testAddIndexOnDiskFull()` simulates this by randomly filling up the disk and causing these merge threads to fail. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] vigyasharma commented on pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.
vigyasharma commented on PR #633: URL: https://github.com/apache/lucene/pull/633#issuecomment-1113715276 Updated the `MockRandomMergePolicy` to randomly use the more concurrent version of `findMerges(CodecReaders[])` with a 50% probability. This uncovered some failing tests that were asserting directly on segments within a writer, and expected a single segment out of the API. Fixed broken tests. Kudos to Lucene's randomized testing for catching such edge cases! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #792: LUCENE-10502: Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc
mayya-sharipova commented on code in PR #792: URL: https://github.com/apache/lucene/pull/792#discussion_r862127419 ## lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java: ## @@ -320,13 +323,19 @@ private static class FieldEntry { final int numLevels; final int dimension; private final int size; -final int[] ordToDoc; -private final IntUnaryOperator ordToDocOperator; final int[][] nodesByLevel; // for each level the start offsets in vectorIndex file from where to read neighbours final long[] graphOffsetsByLevel; - -FieldEntry(DataInput input, VectorSimilarityFunction similarityFunction) throws IOException { +final long docsWithFieldOffset; +final long docsWithFieldLength; +final short jumpTableEntryCount; +final byte denseRankPower; +long addressesOffset; Review Comment: Should all these new variables be final? ## lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java: ## @@ -258,14 +257,18 @@ public TopDocs search(String field, float[] target, int k, Bits acceptDocs, int } private OffHeapVectorValues getOffHeapVectorValues(FieldEntry fieldEntry) throws IOException { -IndexInput bytesSlice = -vectorData.slice("vector-data", fieldEntry.vectorDataOffset, fieldEntry.vectorDataLength); -return new OffHeapVectorValues( -fieldEntry.dimension, fieldEntry.size(), fieldEntry.ordToDoc, bytesSlice); +if (fieldEntry.docsWithFieldOffset == -2) { + return OffHeapVectorValues.emptyOffHeapVectorValues(fieldEntry.dimension); +} else { + IndexInput bytesSlice = + vectorData.slice("vector-data", fieldEntry.vectorDataOffset, fieldEntry.vectorDataLength); + return new OffHeapVectorValues( + fieldEntry.dimension, fieldEntry.size(), fieldEntry, vectorData, bytesSlice); Review Comment: Should `OffHeapVectorValues` just accept `fieldEntry` as an argument and `fieldEntry.dimension` and `fieldEntry.size()` can be removed? ## lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java: ## @@ -400,27 +400,42 @@ static class OffHeapVectorValues extends VectorValues private final int dimension; private final int size; -private final int[] ordToDoc; -private final IntUnaryOperator ordToDocOperator; +// dataIn was used to init a new IndexedDIS for #randomAccess() private final IndexInput dataIn; +private final IndexInput slice; private final BytesRef binaryValue; private final ByteBuffer byteBuffer; private final int byteSize; private final float[] value; +private final IndexedDISI disi; +private final FieldEntry fieldEntry; +final DirectMonotonicReader ordToDoc; private int ord = -1; private int doc = -1; -OffHeapVectorValues(int dimension, int size, int[] ordToDoc, IndexInput dataIn) { +OffHeapVectorValues( +int dimension, int size, FieldEntry fieldEntry, IndexInput dataIn, IndexInput slice) +throws IOException { this.dimension = dimension; this.size = size; - this.ordToDoc = ordToDoc; - ordToDocOperator = ordToDoc == null ? IntUnaryOperator.identity() : (ord) -> ordToDoc[ord]; + this.fieldEntry = fieldEntry; this.dataIn = dataIn; + this.slice = slice; + this.disi = initDISI(dataIn); Review Comment: I was thinking how to simplify the code around `disi` as it involves a lot of conditions that need to be carefully checked. (may be to use a general `DocIdSetIterator instead, but it doesn't seem to work). Anyway I could not think of anything smart, and this LGTM in the current form -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10551) LowercaseAsciiCompression should return false when it's unable to compress
Peixin Li created LUCENE-10551: -- Summary: LowercaseAsciiCompression should return false when it's unable to compress Key: LUCENE-10551 URL: https://issues.apache.org/jira/browse/LUCENE-10551 Project: Lucene - Core Issue Type: Bug Environment: Lucene version 8.11.1 Reporter: Peixin Li {code:java} Failed to commit.. java.lang.IllegalStateException: 10 <> 5 cion1cion_desarrollociones_oraclecionesnaturacionesnatura2tedppsa-integrationdemotiontion cloud gen2tion instance - dev1tion instance - testtion-devbtion-instancetion-prdtion-promerication-qation064533tion535217tion697401tion761348tion892818tion_matrationcauto_simmonsintgic_testtioncloudprodictioncloudservicetiongateway10tioninstance-jtsundatamartprd??o at org.apache.lucene.util.compress.LowercaseAsciiCompression.compress(LowercaseAsciiCompression.java:115) at org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.writeBlock(BlockTreeTermsWriter.java:834) at org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.writeBlocks(BlockTreeTermsWriter.java:628) at org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.pushTerm(BlockTreeTermsWriter.java:947) at org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.write(BlockTreeTermsWriter.java:912) at org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter.write(BlockTreeTermsWriter.java:318) at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.write(PerFieldPostingsFormat.java:170) at org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:120) at org.apache.lucene.index.DefaultIndexingChain.flush(DefaultIndexingChain.java:267) at org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:350) at org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:476) at org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:656) at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3364) at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3770) at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3728) {code} {code:java} key=och-live--WorkResource.renewAssignmentToken.ResourceTime[namespace=workflow, resourceGroup=workflow-service-overlay]{availabilityDomain=iad-ad-1, domainId=och-live, host=workflow-service-overlay-01341.node.ad1.us-ashburn-1}) java.lang.IllegalStateException: 29 <> 16 analytics-platform-test/koala/cluster-tool:1.0-20220310151438.492,mesh_istio_examples-bookinfo-details-v1:1.16.2mesh_istio_examples-bookinfo-reviews-v3:1.16.2oce-clamav:1.0.219oce-tesseract:1.0.7oce-traefik:2.5.1oci-opensearch:1.2.4.8.103oda-digital-assistant-control-plane-train-pool-workflow-v6:22.02.14oke-coresvcs-k8s-dns-dnsmasq-nanny-amd64@sha256:41aa9160ceeaf712369ddb660d02e5ec06d1679965e6930351967c8cf5ed62d4oke-coresvcs-k8s-dns-kube-dns-amd64@sha256:2cf34b04106974952996c6ef1313f165ce65b4ad68a3051f51b1b8f91ba5f838oke-coresvcs-k8s-dns-sidecar-amd64@sha256:8a82c7288725cb4de9c7cd8d5a78279208e379f35751539b406077f9a3163dcdoke-coresvcs-node-problem-detector@sha256:9d54df11804a862c54276648702a45a6a0027a9d930a86becd69c34cc84bf510oke-coresvcs-oke-fluentd-lumberjack@sha256:5f3f10b187eb804ce4e84bc3672de1cf318c0f793f00dac01cd7da8beea8f269oke-etcd-operator@sha256:4353a2e5ef02bb0f6b046a8d6219b1af359a2c1141c358ff110e395f29d0bfc8oke-oke-hyperkube-amd64@sha256:3c734f46099400507f938090eb9a874338fa25cde425ac9409df4c885759752foke-public-busybox@sha256:4cee1979ba0bf7db9fc5d28fb7b798ca69ae95a47c5fecf46327720df4ff352doke-public-coredns@sha256:86f8cfc74497f04e181ab2e1d26d2fd8bd46c4b33ce24b55620efcdfcb214670oke-public-coredns@sha256:8cd974302f1f6108f6f31312f8181ae723b514e2022089cdcc3db10666c49228oke-public-etcd@sha256:b751e459bc2a8f079f6730dd8462671b253c7c8b0d0eb47c67888d5091c6bb77oke-public-etcd@sha256:d6a76200a6e9103681bc2cf7fefbcada0dd9372d52cf8964178d846b89959d14oke-public-etcd@sha256:fa056479342b45479ac74c58176ddad43687d5fc295375d705808f9dfb48439aoke-public-kube-proxy@sha256:93b2da69d03413671606e22294c59a69fe404088a5f6e74d6394a8641fdb899boke-public-tiller@sha256:c2eb6e580123622e1bc0ff3becae3a3a71ac36c98a2786d780590197839175e5osms/opcbuild-osms-agent-proxy-java:0.4.0-129rosms/opcbuild-osms-agent-proxy-nginx:0.4.0-129rosms/opcbuild-osms-ingestion-cert:0.4.0-129rscs-lcm/drift-detector:227scs-lcm/salt-state-sync:242streaming-alpine:30.10.183streaming-kafka:30.10.183vision-service-document-classification:1.1.55vision-service-image-classification:1.4.52 at org.apache.lucene.util.compress.LowercaseAsciiCompression.compress(LowercaseAsciiCompression.java:115) at org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.writeBlock(BlockTreeTermsWriter.java:834)
[GitHub] [lucene] mocobeta commented on a diff in pull request #854: LUCENE-10545: allow to link github pr from changes
mocobeta commented on code in PR #854: URL: https://github.com/apache/lucene/pull/854#discussion_r862272575 ## lucene/CHANGES.txt: ## @@ -53,7 +53,7 @@ Other * LUCENE-10253: The @BadApple annotation has been removed from the test framework. (Adrien Grand) -* LUCENE-10393: Unify binary dictionary and dictionary writer in Kuromoji and Nori. +* GITHUB-740: Unify binary dictionary and dictionary writer in Kuromoji and Nori. Review Comment: @janhoy thanks for your comment, I didn't think it. It makes sense to me to distinguish GitHub-style numbers from JIRA-like issue keys. I updated the regex and example. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on a diff in pull request #854: LUCENE-10545: allow to link github pr from changes
mocobeta commented on code in PR #854: URL: https://github.com/apache/lucene/pull/854#discussion_r862272691 ## gradle/documentation/changes-to-html/changes2html.pl: ## @@ -572,7 +572,7 @@ $item =~ s{((LUCENE|SOLR|INFRA)\s+(\d{3,}))} {$1}gi; # Link "[ github | gh ] pull request [ # ] X+" to Github pull request - $item =~ s{((?:(?:(?:github|gh)\s+)?pull\s+request\s*(?:\#?\s*)?|gh-)(\d+))} + $item =~ s{((?:(?:(?:github|gh)\s+)?pull\s+request\s*(?:\#?\s*)?|gh-|github-)(\d+))} Review Comment: Yes, it's case insensitive; `i` appears in the next line. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] risdenk merged pull request #847: LUCENE-10542: FieldSource exists implementations can avoid value retrieval
risdenk merged PR #847: URL: https://github.com/apache/lucene/pull/847 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10542) FieldSource exists implementation can avoid value retrieval
[ https://issues.apache.org/jira/browse/LUCENE-10542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530311#comment-17530311 ] ASF subversion and git services commented on LUCENE-10542: -- Commit 3063109d8375b7bd94873e7ac70c52ad57037773 in lucene's branch refs/heads/main from Kevin Risden [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=3063109d837 ] LUCENE-10542: FieldSource exists implementations can avoid value retrieval (#847) > FieldSource exists implementation can avoid value retrieval > --- > > Key: LUCENE-10542 > URL: https://issues.apache.org/jira/browse/LUCENE-10542 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Kevin Risden >Assignee: Kevin Risden >Priority: Minor > Attachments: flamegraph_getValueForDoc.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > While looking at LUCENE-10534, found that *FieldSource exists implementation > after LUCENE-7407 can avoid value lookup when just checking for exists. > Flamegraphs - x axis = time spent as a percentage of time being profiled, y > axis = stack trace bottom being first call top being last call > Looking only at the left most getValueForDoc highlight only (and it helps to > make it bigger or download the original) > !flamegraph_getValueForDoc.png|height=410,width=1000! > LongFieldSource#exists spends MOST of its time doing a > LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its > time doing two things primarily: > * FilterNumericDocValues#longValue() > * advance() > This makes sense based on looking at the code (copied below to make it easier > to see at once) > https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L72 > {code:java} > private long getValueForDoc(int doc) throws IOException { > if (doc < lastDocID) { > throw new IllegalArgumentException( > "docs were sent out-of-order: lastDocID=" + lastDocID + " vs > docID=" + doc); > } > lastDocID = doc; > int curDocID = arr.docID(); > if (doc > curDocID) { > curDocID = arr.advance(doc); > } > if (doc == curDocID) { > return arr.longValue(); > } else { > return 0; > } > } > {code} > LongFieldSource#exists - doesn't care about the actual longValue. Just that > there was a value found when iterating through the doc values. > https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L95 > {code:java} > @Override > public boolean exists(int doc) throws IOException { > getValueForDoc(doc); > return arr.docID() == doc; > } > {code} > So putting this all together for exists calling getValueForDoc, we spent ~50% > of the time trying to get the long value when we don't need it in exists. We > can save that 50% of time making exists not care about the actual value and > just return if doc == curDocID basically. > This 50% extra is exaggerated in MaxFloatFunction (and other places) since > exists() is being called a bunch. Eventually the value will be needed from > longVal(), but if we call exists() say 3 times for every longVal(), we are > spending a lot of time computing the value when we only need to check for > existence. > I found the same pattern in DoubleFieldSource, EnumFieldSource, > FloatFieldSource, IntFieldSource, LongFieldSource. I put together a change > showing what this would look like: > > Simple JMH performance tests comparing the original FloatFieldSource to the > new ones from PR #847. > > | Benchmark | Mode | > Cnt | Score and Error | Units | > |-|---|-|--|---| > | MyBenchmark.testMaxFloatFunction| thrpt | > 25 | 64.159 ± 2.031 | ops/s | > | MyBenchmark.testNewMaxFloatFunction | thrpt | > 25 | 94.997 ± 2.365 | ops/s | > | MyBenchmark.testMaxFloatFunctionNewFloatFieldSource | thrpt | > 25 | 123.191 ± 9.291 | ops/s | > | MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSource | thrpt | > 25 | 123.817 ± 6.191 | ops/s | > | MyBenchmark.testMaxFloatFunctionRareField | thrpt | > 25 | 244.921 ± 6.439 | ops/s | > | MyBenchmark.testNewMaxFloatFunctionRareField| thrpt | > 25 | 239.288 ± 5.136 | ops/s | > | MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField| thrpt | > 25 | 271.521 ± 3.870 | ops/s | > | MyBenchmark.testNewMaxFloatFu
[jira] [Updated] (LUCENE-10542) FieldSource exists implementations can avoid value retrieval
[ https://issues.apache.org/jira/browse/LUCENE-10542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Risden updated LUCENE-10542: -- Summary: FieldSource exists implementations can avoid value retrieval (was: FieldSource exists implementation can avoid value retrieval) > FieldSource exists implementations can avoid value retrieval > > > Key: LUCENE-10542 > URL: https://issues.apache.org/jira/browse/LUCENE-10542 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Kevin Risden >Assignee: Kevin Risden >Priority: Minor > Attachments: flamegraph_getValueForDoc.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > While looking at LUCENE-10534, found that *FieldSource exists implementation > after LUCENE-7407 can avoid value lookup when just checking for exists. > Flamegraphs - x axis = time spent as a percentage of time being profiled, y > axis = stack trace bottom being first call top being last call > Looking only at the left most getValueForDoc highlight only (and it helps to > make it bigger or download the original) > !flamegraph_getValueForDoc.png|height=410,width=1000! > LongFieldSource#exists spends MOST of its time doing a > LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its > time doing two things primarily: > * FilterNumericDocValues#longValue() > * advance() > This makes sense based on looking at the code (copied below to make it easier > to see at once) > https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L72 > {code:java} > private long getValueForDoc(int doc) throws IOException { > if (doc < lastDocID) { > throw new IllegalArgumentException( > "docs were sent out-of-order: lastDocID=" + lastDocID + " vs > docID=" + doc); > } > lastDocID = doc; > int curDocID = arr.docID(); > if (doc > curDocID) { > curDocID = arr.advance(doc); > } > if (doc == curDocID) { > return arr.longValue(); > } else { > return 0; > } > } > {code} > LongFieldSource#exists - doesn't care about the actual longValue. Just that > there was a value found when iterating through the doc values. > https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L95 > {code:java} > @Override > public boolean exists(int doc) throws IOException { > getValueForDoc(doc); > return arr.docID() == doc; > } > {code} > So putting this all together for exists calling getValueForDoc, we spent ~50% > of the time trying to get the long value when we don't need it in exists. We > can save that 50% of time making exists not care about the actual value and > just return if doc == curDocID basically. > This 50% extra is exaggerated in MaxFloatFunction (and other places) since > exists() is being called a bunch. Eventually the value will be needed from > longVal(), but if we call exists() say 3 times for every longVal(), we are > spending a lot of time computing the value when we only need to check for > existence. > I found the same pattern in DoubleFieldSource, EnumFieldSource, > FloatFieldSource, IntFieldSource, LongFieldSource. I put together a change > showing what this would look like: > > Simple JMH performance tests comparing the original FloatFieldSource to the > new ones from PR #847. > > | Benchmark | Mode | > Cnt | Score and Error | Units | > |-|---|-|--|---| > | MyBenchmark.testMaxFloatFunction| thrpt | > 25 | 64.159 ± 2.031 | ops/s | > | MyBenchmark.testNewMaxFloatFunction | thrpt | > 25 | 94.997 ± 2.365 | ops/s | > | MyBenchmark.testMaxFloatFunctionNewFloatFieldSource | thrpt | > 25 | 123.191 ± 9.291 | ops/s | > | MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSource | thrpt | > 25 | 123.817 ± 6.191 | ops/s | > | MyBenchmark.testMaxFloatFunctionRareField | thrpt | > 25 | 244.921 ± 6.439 | ops/s | > | MyBenchmark.testNewMaxFloatFunctionRareField| thrpt | > 25 | 239.288 ± 5.136 | ops/s | > | MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField| thrpt | > 25 | 271.521 ± 3.870 | ops/s | > | MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSourceRareField | thrpt | > 25 | 279.334 ± 10.511 | ops/s | > Source: https://github.com/risdenk/lucene-jmh -- This message was sent by Atlassian Jira (v8.20.7#820007) -
[jira] [Commented] (LUCENE-10542) FieldSource exists implementations can avoid value retrieval
[ https://issues.apache.org/jira/browse/LUCENE-10542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530312#comment-17530312 ] ASF subversion and git services commented on LUCENE-10542: -- Commit fba1a68b4518b799d6e4c8637448e032825549cb in lucene's branch refs/heads/branch_9x from Kevin Risden [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=fba1a68b451 ] LUCENE-10542: FieldSource exists implementations can avoid value retrieval (#847) > FieldSource exists implementations can avoid value retrieval > > > Key: LUCENE-10542 > URL: https://issues.apache.org/jira/browse/LUCENE-10542 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Kevin Risden >Assignee: Kevin Risden >Priority: Minor > Attachments: flamegraph_getValueForDoc.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > While looking at LUCENE-10534, found that *FieldSource exists implementation > after LUCENE-7407 can avoid value lookup when just checking for exists. > Flamegraphs - x axis = time spent as a percentage of time being profiled, y > axis = stack trace bottom being first call top being last call > Looking only at the left most getValueForDoc highlight only (and it helps to > make it bigger or download the original) > !flamegraph_getValueForDoc.png|height=410,width=1000! > LongFieldSource#exists spends MOST of its time doing a > LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its > time doing two things primarily: > * FilterNumericDocValues#longValue() > * advance() > This makes sense based on looking at the code (copied below to make it easier > to see at once) > https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L72 > {code:java} > private long getValueForDoc(int doc) throws IOException { > if (doc < lastDocID) { > throw new IllegalArgumentException( > "docs were sent out-of-order: lastDocID=" + lastDocID + " vs > docID=" + doc); > } > lastDocID = doc; > int curDocID = arr.docID(); > if (doc > curDocID) { > curDocID = arr.advance(doc); > } > if (doc == curDocID) { > return arr.longValue(); > } else { > return 0; > } > } > {code} > LongFieldSource#exists - doesn't care about the actual longValue. Just that > there was a value found when iterating through the doc values. > https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L95 > {code:java} > @Override > public boolean exists(int doc) throws IOException { > getValueForDoc(doc); > return arr.docID() == doc; > } > {code} > So putting this all together for exists calling getValueForDoc, we spent ~50% > of the time trying to get the long value when we don't need it in exists. We > can save that 50% of time making exists not care about the actual value and > just return if doc == curDocID basically. > This 50% extra is exaggerated in MaxFloatFunction (and other places) since > exists() is being called a bunch. Eventually the value will be needed from > longVal(), but if we call exists() say 3 times for every longVal(), we are > spending a lot of time computing the value when we only need to check for > existence. > I found the same pattern in DoubleFieldSource, EnumFieldSource, > FloatFieldSource, IntFieldSource, LongFieldSource. I put together a change > showing what this would look like: > > Simple JMH performance tests comparing the original FloatFieldSource to the > new ones from PR #847. > > | Benchmark | Mode | > Cnt | Score and Error | Units | > |-|---|-|--|---| > | MyBenchmark.testMaxFloatFunction| thrpt | > 25 | 64.159 ± 2.031 | ops/s | > | MyBenchmark.testNewMaxFloatFunction | thrpt | > 25 | 94.997 ± 2.365 | ops/s | > | MyBenchmark.testMaxFloatFunctionNewFloatFieldSource | thrpt | > 25 | 123.191 ± 9.291 | ops/s | > | MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSource | thrpt | > 25 | 123.817 ± 6.191 | ops/s | > | MyBenchmark.testMaxFloatFunctionRareField | thrpt | > 25 | 244.921 ± 6.439 | ops/s | > | MyBenchmark.testNewMaxFloatFunctionRareField| thrpt | > 25 | 239.288 ± 5.136 | ops/s | > | MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField| thrpt | > 25 | 271.521 ± 3.870 | ops/s | > | MyBenchmark.testNewMax
[jira] [Updated] (LUCENE-10542) FieldSource exists implementations can avoid value retrieval
[ https://issues.apache.org/jira/browse/LUCENE-10542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Risden updated LUCENE-10542: -- Fix Version/s: 9.2 Resolution: Fixed Status: Resolved (was: Patch Available) > FieldSource exists implementations can avoid value retrieval > > > Key: LUCENE-10542 > URL: https://issues.apache.org/jira/browse/LUCENE-10542 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Kevin Risden >Assignee: Kevin Risden >Priority: Minor > Fix For: 9.2 > > Attachments: flamegraph_getValueForDoc.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > While looking at LUCENE-10534, found that *FieldSource exists implementation > after LUCENE-7407 can avoid value lookup when just checking for exists. > Flamegraphs - x axis = time spent as a percentage of time being profiled, y > axis = stack trace bottom being first call top being last call > Looking only at the left most getValueForDoc highlight only (and it helps to > make it bigger or download the original) > !flamegraph_getValueForDoc.png|height=410,width=1000! > LongFieldSource#exists spends MOST of its time doing a > LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its > time doing two things primarily: > * FilterNumericDocValues#longValue() > * advance() > This makes sense based on looking at the code (copied below to make it easier > to see at once) > https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L72 > {code:java} > private long getValueForDoc(int doc) throws IOException { > if (doc < lastDocID) { > throw new IllegalArgumentException( > "docs were sent out-of-order: lastDocID=" + lastDocID + " vs > docID=" + doc); > } > lastDocID = doc; > int curDocID = arr.docID(); > if (doc > curDocID) { > curDocID = arr.advance(doc); > } > if (doc == curDocID) { > return arr.longValue(); > } else { > return 0; > } > } > {code} > LongFieldSource#exists - doesn't care about the actual longValue. Just that > there was a value found when iterating through the doc values. > https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L95 > {code:java} > @Override > public boolean exists(int doc) throws IOException { > getValueForDoc(doc); > return arr.docID() == doc; > } > {code} > So putting this all together for exists calling getValueForDoc, we spent ~50% > of the time trying to get the long value when we don't need it in exists. We > can save that 50% of time making exists not care about the actual value and > just return if doc == curDocID basically. > This 50% extra is exaggerated in MaxFloatFunction (and other places) since > exists() is being called a bunch. Eventually the value will be needed from > longVal(), but if we call exists() say 3 times for every longVal(), we are > spending a lot of time computing the value when we only need to check for > existence. > I found the same pattern in DoubleFieldSource, EnumFieldSource, > FloatFieldSource, IntFieldSource, LongFieldSource. I put together a change > showing what this would look like: > > Simple JMH performance tests comparing the original FloatFieldSource to the > new ones from PR #847. > > | Benchmark | Mode | > Cnt | Score and Error | Units | > |-|---|-|--|---| > | MyBenchmark.testMaxFloatFunction| thrpt | > 25 | 64.159 ± 2.031 | ops/s | > | MyBenchmark.testNewMaxFloatFunction | thrpt | > 25 | 94.997 ± 2.365 | ops/s | > | MyBenchmark.testMaxFloatFunctionNewFloatFieldSource | thrpt | > 25 | 123.191 ± 9.291 | ops/s | > | MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSource | thrpt | > 25 | 123.817 ± 6.191 | ops/s | > | MyBenchmark.testMaxFloatFunctionRareField | thrpt | > 25 | 244.921 ± 6.439 | ops/s | > | MyBenchmark.testNewMaxFloatFunctionRareField| thrpt | > 25 | 239.288 ± 5.136 | ops/s | > | MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField| thrpt | > 25 | 271.521 ± 3.870 | ops/s | > | MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSourceRareField | thrpt | > 25 | 279.334 ± 10.511 | ops/s | > Source: https://github.com/risdenk/lucene-jmh -- This message was sent by Atlassian Jira (v8.20.7#820007) ---
[jira] [Updated] (LUCENE-10543) Achieve contribution workflow perfection (with progress)
[ https://issues.apache.org/jira/browse/LUCENE-10543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-10543: - Attachment: Screen_Shot_2022-04-30_at_01.15.00.png > Achieve contribution workflow perfection (with progress) > > > Key: LUCENE-10543 > URL: https://issues.apache.org/jira/browse/LUCENE-10543 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > Attachments: Screen_Shot_2022-04-30_at_01.15.00.png > > > Inspired by Dawid's build issue which has worked out for us: LUCENE-9871 > He hasn't even linked 10% of the issues/subtasks involved in that work > either, but we know. > I think we need a similar approach for the contribution workflow. There has > been some major improvements recently, a couple that come to mind: > * Tomoko made a CONTRIBUTING.md file which github recognizes and is way > better than the wiki stuff > * Some hazards/error messages/mazes in the build process and so on have > gotten fixed. > But there is more to do in my opinion, here is 3 ideas: > * Creating a PR still has a massive checklist template. But now this template > links to CONTRIBUTING.md, so why include the other stuff/checklist? Isn't it > enough to just link to CONTRIBUTING.md and fix that as needed? > * Creating a PR still requires signing up for Apache JIRA and creating a JIRA > issue. There is zero value to this additional process. We often end out with > either JIRAs and/or PRs that have zero content, or maybe conflicting/outdated > content. This is just an unnecessary dance, can we use github issues instead? > * Haven't dug into the github actions or configs very deeply. Maybe there's > simple stuff we can do such as give useful notifications if checks fail. Try > to guide the user to run ./gradlew check and fix it. It sucks to have to > review, look at logs, and manually add comments to do this stuff. > So let's have an issue to improve this area. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10543) Achieve contribution workflow perfection (with progress)
[ https://issues.apache.org/jira/browse/LUCENE-10543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530339#comment-17530339 ] Robert Muir commented on LUCENE-10543: -- IMO it just creates friction. having to sign-up for an account on anything is a big hurdle IMO. Signing up for JIRA is something i haven't done in forever, so I've got a screenshot of what it looks like. You have to solve CAPTCHA, etc: !Screen_Shot_2022-04-30_at_01.15.00.png! > Achieve contribution workflow perfection (with progress) > > > Key: LUCENE-10543 > URL: https://issues.apache.org/jira/browse/LUCENE-10543 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > Attachments: Screen_Shot_2022-04-30_at_01.15.00.png > > > Inspired by Dawid's build issue which has worked out for us: LUCENE-9871 > He hasn't even linked 10% of the issues/subtasks involved in that work > either, but we know. > I think we need a similar approach for the contribution workflow. There has > been some major improvements recently, a couple that come to mind: > * Tomoko made a CONTRIBUTING.md file which github recognizes and is way > better than the wiki stuff > * Some hazards/error messages/mazes in the build process and so on have > gotten fixed. > But there is more to do in my opinion, here is 3 ideas: > * Creating a PR still has a massive checklist template. But now this template > links to CONTRIBUTING.md, so why include the other stuff/checklist? Isn't it > enough to just link to CONTRIBUTING.md and fix that as needed? > * Creating a PR still requires signing up for Apache JIRA and creating a JIRA > issue. There is zero value to this additional process. We often end out with > either JIRAs and/or PRs that have zero content, or maybe conflicting/outdated > content. This is just an unnecessary dance, can we use github issues instead? > * Haven't dug into the github actions or configs very deeply. Maybe there's > simple stuff we can do such as give useful notifications if checks fail. Try > to guide the user to run ./gradlew check and fix it. It sucks to have to > review, look at logs, and manually add comments to do this stuff. > So let's have an issue to improve this area. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10543) Achieve contribution workflow perfection (with progress)
[ https://issues.apache.org/jira/browse/LUCENE-10543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530342#comment-17530342 ] Tomoko Uchida commented on LUCENE-10543: I forgot the sign-up procedure of Jira, but generally agree with it's cumbersome and I don't like CAPTCHA too :\) > Achieve contribution workflow perfection (with progress) > > > Key: LUCENE-10543 > URL: https://issues.apache.org/jira/browse/LUCENE-10543 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > Attachments: Screen_Shot_2022-04-30_at_01.15.00.png > > > Inspired by Dawid's build issue which has worked out for us: LUCENE-9871 > He hasn't even linked 10% of the issues/subtasks involved in that work > either, but we know. > I think we need a similar approach for the contribution workflow. There has > been some major improvements recently, a couple that come to mind: > * Tomoko made a CONTRIBUTING.md file which github recognizes and is way > better than the wiki stuff > * Some hazards/error messages/mazes in the build process and so on have > gotten fixed. > But there is more to do in my opinion, here is 3 ideas: > * Creating a PR still has a massive checklist template. But now this template > links to CONTRIBUTING.md, so why include the other stuff/checklist? Isn't it > enough to just link to CONTRIBUTING.md and fix that as needed? > * Creating a PR still requires signing up for Apache JIRA and creating a JIRA > issue. There is zero value to this additional process. We often end out with > either JIRAs and/or PRs that have zero content, or maybe conflicting/outdated > content. This is just an unnecessary dance, can we use github issues instead? > * Haven't dug into the github actions or configs very deeply. Maybe there's > simple stuff we can do such as give useful notifications if checks fail. Try > to guide the user to run ./gradlew check and fix it. It sucks to have to > review, look at logs, and manually add comments to do this stuff. > So let's have an issue to improve this area. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org