[GitHub] [lucene] jpountz commented on pull request #964: LUCENE-10620: Pass the Weight to Collectors.
jpountz commented on PR #964: URL: https://github.com/apache/lucene/pull/964#issuecomment-1160079281 Now when collectors need to count hits too (I changed IndexSearcher's `TOTAL_HITS_THRESHOLD` to `Integer.MAX_VALUE`): ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value OrHighLow 90.31 (6.7%) 35.49 (1.8%) -60.7% ( -64% - -55%) 0.000 OrHighHigh 40.98 (5.6%) 21.17 (2.4%) -48.3% ( -53% - -42%) 0.000 OrHighNotLow 143.61 (8.2%) 76.04 (5.4%) -47.1% ( -56% - -36%) 0.000 OrHighNotMed 88.77 (7.7%) 49.41 (5.6%) -44.3% ( -53% - -33%) 0.000 OrHighNotHigh 18.24 (7.4%) 10.59 (5.9%) -41.9% ( -51% - -30%) 0.000 OrHighMed 80.82 (5.0%) 48.18 (2.8%) -40.4% ( -45% - -34%) 0.000 OrNotHighHigh 51.35 (5.7%) 39.11 (5.6%) -23.8% ( -33% - -13%) 0.000 AndHighHigh 53.49 (1.9%) 41.97 (4.1%) -21.5% ( -27% - -15%) 0.000 AndHighMed 321.43 (2.4%) 258.39 (4.5%) -19.6% ( -25% - -13%) 0.000 AndHighLow 1777.06 (2.7%) 1474.52 (3.1%) -17.0% ( -22% - -11%) 0.000 MedPhrase 391.41 (5.9%) 332.93 (5.1%) -14.9% ( -24% - -4%) 0.000 OrNotHighMed 313.44 (6.7%) 269.25 (5.3%) -14.1% ( -24% - -2%) 0.000 OrNotHighLow 1977.65 (4.2%) 1803.88 (4.7%) -8.8% ( -16% -0%) 0.000 AndHighHighDayTaxoFacets 25.28 (1.7%) 23.30 (1.9%) -7.8% ( -11% - -4%) 0.000 MedTermDayTaxoFacets 79.97 (2.6%) 74.42 (3.6%) -6.9% ( -12% -0%) 0.000 Prefix3 27.72 (6.2%) 25.83 (5.2%) -6.8% ( -17% -4%) 0.000 LowPhrase 159.63 (5.0%) 148.90 (3.3%) -6.7% ( -14% -1%) 0.000 OrHighMedDayTaxoFacets 19.30 (5.7%) 18.11 (4.2%) -6.2% ( -15% -3%) 0.000 HighPhrase 16.15 (5.7%) 15.30 (4.6%) -5.2% ( -14% -5%) 0.001 Wildcard 79.98 (2.3%) 76.50 (3.0%) -4.4% ( -9% -1%) 0.000 AndHighMedDayTaxoFacets 72.60 (2.1%) 69.79 (1.9%) -3.9% ( -7% -0%) 0.000 HighSpanNear 44.71 (4.9%) 43.00 (4.7%) -3.8% ( -12% -6%) 0.012 BrowseDayOfYearTaxoFacets 47.80 (2.0%) 46.05 (12.3%) -3.7% ( -17% - 10%) 0.189 Fuzzy2 103.64 (2.1%) 100.39 (1.7%) -3.1% ( -6% -0%) 0.000 BrowseDateTaxoFacets 46.13 (1.9%) 44.69 (12.0%) -3.1% ( -16% - 10%) 0.249 BrowseRandomLabelTaxoFacets 37.71 (2.1%) 36.53 (10.5%) -3.1% ( -15% -9%) 0.195 MedSpanNear 68.62 (3.0%) 66.69 (3.0%) -2.8% ( -8% -3%) 0.003 LowSpanNear 57.05 (3.0%) 55.49 (2.8%) -2.7% ( -8% -3%) 0.003 BrowseMonthTaxoFacets 29.68 (7.3%) 28.87 (12.8%) -2.7% ( -21% - 18%) 0.410 Fuzzy1 128.59 (2.2%) 125.27 (1.8%) -2.6% ( -6% -1%) 0.000 LowIntervalsOrdered 219.66 (4.2%) 216.18 (3.4%) -1.6% ( -8% -6%) 0.184 HighIntervalsOrdered 35.55 (5.7%) 35.03 (4.3%) -1.5% ( -10% -9%) 0.361 HighSloppyPhrase8.33 (15.0%)8.22 (13.6%) -1.3% ( -25% - 32%) 0.775 BrowseDayOfYearSSDVFacets 21.93 (9.9%) 21.80 (9.3%) -0.6% ( -17% - 20%) 0.841 BrowseMonthSSDVFacets 23.61 (8.5%) 23.53 (8.2%) -0.3% ( -15% - 17%) 0.904 Respell 77.59 (2.0%) 77.42 (2.3%) -0.2% ( -4% -4%) 0.740 BrowseRandomLabelSSDVFacets 15.20 (5.5%) 15.19 (5.7%) -0.1% ( -10% - 11%) 0.971 MedIntervalsOrdered 43.08 (5.0%) 43.14 (4.6%)0.1% ( -9% - 10%) 0.934 LowSloppyPhrase 54.27 (10.7%) 54.76 (10.1%)0.9% ( -17% - 24%) 0.782 BrowseDateSSDVFacets4.21 (12.4%)4.26 (11.8%)1.0% ( -20% - 28%) 0.784 PKLooku
[GitHub] [lucene] jpountz commented on pull request #964: LUCENE-10620: Pass the Weight to Collectors.
jpountz commented on PR #964: URL: https://github.com/apache/lucene/pull/964#issuecomment-1160179472 Unfortunately this is challenging to do right at the moment since the API requires the collector to tell the `ScoreMode` it needs to be able to create the `Weight`. So either the collector says it needs to evaluate all hits (`ScoreMode.COMPLETE`) and then we cannot skip hits in the case when the weight can count its hits efficiently. Or it says it doesn't (`ScoreMode.TOP_SCORES`), like the PR does at the moment and then queries get slower when the weight cannot count hits. We could fix this by moving the score mode to `LeafCollector` instead of `Collector` but this would be a big change... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira
[ https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556288#comment-17556288 ] Tomoko Uchida commented on LUCENE-10557: > I've added a few bullet points that script could/should handle under > LUCENE-10557, hope you don't mind. If you place these script(s) in the open > then perhaps indeed we could try to collaborate and see what can be done. Thanks for your suggestions, Dawid. I'd move the conversation to this issue from the mail list. I think we'll be able to handle the requirements (cross-issue links, and so on) in some ways. I started work on LUCENE-10622 and added the link to the sandbox repository where the migration scripts (early draft) were pushed. For what it's worth, LUCENE-1 will be migrated something like this. Although the formatting and look-and-feel could be improved a bit, it would not be drastically changed in essentials. We cannot simulate Jira issues on GitHub. e.g.; it is not allowed to tweak the issue reporter and timestamp (very basic metadata to me), so they have to be embedded in the issue description as free texts. I'll continue to work on though - does this really meet your expectations, Mike McCandless and other folks who argue to preserve all issue history in GItHub? https://github.com/mocobeta/sandbox-lucene-10557/issues/19 > Migrate to GitHub issue from Jira > - > > Key: LUCENE-10557 > URL: https://issues.apache.org/jira/browse/LUCENE-10557 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > > A few (not the majority) Apache projects already use the GitHub issue instead > of Jira. For example, > Airflow: [https://github.com/apache/airflow/issues] > BookKeeper: [https://github.com/apache/bookkeeper/issues] > So I think it'd be technically possible that we move to GitHub issue. I have > little knowledge of how to proceed with it, I'd like to discuss whether we > should migrate to it, and if so, how to smoothly handle the migration. > The major tasks would be: > * (/) Get a consensus about the migration among committers > * Choose issues that should be moved to GitHub > ** Discussion thread > [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12] > ** -Conclusion for now: We don't migrate any issues. Only new issues should > be opened on GitHub.- > ** Write a prototype migration script - the decision could be made on that. > Things to consider: > *** version numbers - labels or milestones? > *** add a comment/ prepend a link to the source Jira issue on github side, > *** add a comment/ prepend a link on the jira side to the new issue on > github side (for people who access jira from blogs, mailing list archives and > other sources that will have stale links), > *** convert cross-issue automatic links in comments/ descriptions (as > suggested by Robert), > *** strategy to deal with sub-issues (hierarchies), > *** maybe prefix (or postfix) the issue title on github side with the > original LUCENE-XYZ key so that it is easier to search for a particular issue > there? > *** how to deal with user IDs (author, reporter, commenters)? Do they have > to be github users? Will information about people not registered on github be > lost? > *** create an extra mapping file of old-issue-new-issue URLs for any > potential future uses. > *** what to do with issue numbers in git/svn commits? These could be > rewritten but it'd change the entire git history tree - I don't think this is > practical, while doable. > * Build the convention for issue label/milestone management > ** Do some experiments on a sandbox repository > [https://github.com/mocobeta/sandbox-lucene-10557] > ** Make documentation for metadata (label/milestone) management > * Enable Github issue on the lucene's repository > ** Raise an issue on INFRA > ** (Create an issue-only private repository for sensitive issues if it's > needed and allowed) > ** Set a mail hook to > [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to > the general mail group name) > * Set a schedule for migration > ** Give some time to committers to play around with issues/labels/milestones > before the actual migration > ** Make an announcement on the mail lists > ** Show some text messages when opening a new Jira issue (in issue template?) -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #967: LUCENE-10623: Error implementation of docValueCount for SortingSortedSetDocValues
jpountz commented on PR #967: URL: https://github.com/apache/lucene/pull/967#issuecomment-1160204024 Thanks for catching this bug. The fix is a bit wasteful in that it requires iterating over ords twice, once to count them and another time to iterate through them. Maybe we should change `DocOrds` to also record the number of ords for each doc (e.g. using a `GrowableWriter`), and stop recording zeroes to signal that all ords for a document have been consumed? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira
[ https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556306#comment-17556306 ] Dawid Weiss commented on LUCENE-10557: -- I've verified that searches for old issue numbers seem to work: https://github.com/mocobeta/sandbox-lucene-10557/search?q=%22LUCENE-1%22+in%3Atitle&type=issues I'm more familiar with the "hierarchical" tags like "affects/xyz" or "type/bug" but I can live with the comma version. Good to have some of the metadata transferred as well, even as a plain text content in the issue description. > Migrate to GitHub issue from Jira > - > > Key: LUCENE-10557 > URL: https://issues.apache.org/jira/browse/LUCENE-10557 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > > A few (not the majority) Apache projects already use the GitHub issue instead > of Jira. For example, > Airflow: [https://github.com/apache/airflow/issues] > BookKeeper: [https://github.com/apache/bookkeeper/issues] > So I think it'd be technically possible that we move to GitHub issue. I have > little knowledge of how to proceed with it, I'd like to discuss whether we > should migrate to it, and if so, how to smoothly handle the migration. > The major tasks would be: > * (/) Get a consensus about the migration among committers > * Choose issues that should be moved to GitHub > ** Discussion thread > [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12] > ** -Conclusion for now: We don't migrate any issues. Only new issues should > be opened on GitHub.- > ** Write a prototype migration script - the decision could be made on that. > Things to consider: > *** version numbers - labels or milestones? > *** add a comment/ prepend a link to the source Jira issue on github side, > *** add a comment/ prepend a link on the jira side to the new issue on > github side (for people who access jira from blogs, mailing list archives and > other sources that will have stale links), > *** convert cross-issue automatic links in comments/ descriptions (as > suggested by Robert), > *** strategy to deal with sub-issues (hierarchies), > *** maybe prefix (or postfix) the issue title on github side with the > original LUCENE-XYZ key so that it is easier to search for a particular issue > there? > *** how to deal with user IDs (author, reporter, commenters)? Do they have > to be github users? Will information about people not registered on github be > lost? > *** create an extra mapping file of old-issue-new-issue URLs for any > potential future uses. > *** what to do with issue numbers in git/svn commits? These could be > rewritten but it'd change the entire git history tree - I don't think this is > practical, while doable. > * Build the convention for issue label/milestone management > ** Do some experiments on a sandbox repository > [https://github.com/mocobeta/sandbox-lucene-10557] > ** Make documentation for metadata (label/milestone) management > * Enable Github issue on the lucene's repository > ** Raise an issue on INFRA > ** (Create an issue-only private repository for sensitive issues if it's > needed and allowed) > ** Set a mail hook to > [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to > the general mail group name) > * Set a schedule for migration > ** Give some time to committers to play around with issues/labels/milestones > before the actual migration > ** Make an announcement on the mail lists > ** Show some text messages when opening a new Jira issue (in issue template?) -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz merged pull request #965: LUCENE-10618: Implement BooleanQuery rewrite rules based for minimumShouldMatch
jpountz merged PR #965: URL: https://github.com/apache/lucene/pull/965 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10618) Implement BooleanQuery rewrite rules based for minimumShouldMatch
[ https://issues.apache.org/jira/browse/LUCENE-10618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556345#comment-17556345 ] ASF subversion and git services commented on LUCENE-10618: -- Commit bb1b3dce04c06e7533b2ff418b8b7c2544534e24 in lucene's branch refs/heads/branch_9x from JoeHF [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=bb1b3dce04c ] LUCENE-10618: Implement BooleanQuery rewrite rules based for minimumShouldMatch (#965) > Implement BooleanQuery rewrite rules based for minimumShouldMatch > - > > Key: LUCENE-10618 > URL: https://issues.apache.org/jira/browse/LUCENE-10618 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 0.5h > Remaining Estimate: 0h > > While looking into a test failure I noticed that we sometimes create weights > for boolean queries with no SHOULD clauses and a non-zero > minimumNumberShouldMatch. > We could rewrite BooleanQuery to MatchNoDocsQuery when the number of SHOULD > clauses is less than minimumNumberShouldMatch, and make SHOULD clauses > required when the number of SHOULD clauses is equal to > minimumNumberShouldMatch. > This feels a bit like a degenerate case (why would the use create such a > query in the first place?) but this case can also happen to non-degenerate > queries if some SHOULD clauses rewrite to a MatchNoDocsQuery and get removed > through rewrite. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10618) Implement BooleanQuery rewrite rules based for minimumShouldMatch
[ https://issues.apache.org/jira/browse/LUCENE-10618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-10618. --- Fix Version/s: 9.3 Resolution: Fixed Thanks [~joe hou]! > Implement BooleanQuery rewrite rules based for minimumShouldMatch > - > > Key: LUCENE-10618 > URL: https://issues.apache.org/jira/browse/LUCENE-10618 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Fix For: 9.3 > > Time Spent: 0.5h > Remaining Estimate: 0h > > While looking into a test failure I noticed that we sometimes create weights > for boolean queries with no SHOULD clauses and a non-zero > minimumNumberShouldMatch. > We could rewrite BooleanQuery to MatchNoDocsQuery when the number of SHOULD > clauses is less than minimumNumberShouldMatch, and make SHOULD clauses > required when the number of SHOULD clauses is equal to > minimumNumberShouldMatch. > This feels a bit like a degenerate case (why would the use create such a > query in the first place?) but this case can also happen to non-degenerate > queries if some SHOULD clauses rewrite to a MatchNoDocsQuery and get removed > through rewrite. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira
[ https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556362#comment-17556362 ] Tomoko Uchida commented on LUCENE-10557: As for User ID alignment, it'd be great if we can map the reporter/assignee/commenter to correct GitHub accounts. I just wanted to note that there is a trivial but very practical concern for me - we have to "mention" the accounts in the issue description/comment to create a hyperlink (we can't create resources on behalf of the original authors). I think we don't want to receive a huge volume of notifications from old issues. There could be a tip or workaround, otherwise we will not be able to create real links but just have markups like`@mocobeta`. > Migrate to GitHub issue from Jira > - > > Key: LUCENE-10557 > URL: https://issues.apache.org/jira/browse/LUCENE-10557 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > > A few (not the majority) Apache projects already use the GitHub issue instead > of Jira. For example, > Airflow: [https://github.com/apache/airflow/issues] > BookKeeper: [https://github.com/apache/bookkeeper/issues] > So I think it'd be technically possible that we move to GitHub issue. I have > little knowledge of how to proceed with it, I'd like to discuss whether we > should migrate to it, and if so, how to smoothly handle the migration. > The major tasks would be: > * (/) Get a consensus about the migration among committers > * Choose issues that should be moved to GitHub > ** Discussion thread > [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12] > ** -Conclusion for now: We don't migrate any issues. Only new issues should > be opened on GitHub.- > ** Write a prototype migration script - the decision could be made on that. > Things to consider: > *** version numbers - labels or milestones? > *** add a comment/ prepend a link to the source Jira issue on github side, > *** add a comment/ prepend a link on the jira side to the new issue on > github side (for people who access jira from blogs, mailing list archives and > other sources that will have stale links), > *** convert cross-issue automatic links in comments/ descriptions (as > suggested by Robert), > *** strategy to deal with sub-issues (hierarchies), > *** maybe prefix (or postfix) the issue title on github side with the > original LUCENE-XYZ key so that it is easier to search for a particular issue > there? > *** how to deal with user IDs (author, reporter, commenters)? Do they have > to be github users? Will information about people not registered on github be > lost? > *** create an extra mapping file of old-issue-new-issue URLs for any > potential future uses. > *** what to do with issue numbers in git/svn commits? These could be > rewritten but it'd change the entire git history tree - I don't think this is > practical, while doable. > * Build the convention for issue label/milestone management > ** Do some experiments on a sandbox repository > [https://github.com/mocobeta/sandbox-lucene-10557] > ** Make documentation for metadata (label/milestone) management > * Enable Github issue on the lucene's repository > ** Raise an issue on INFRA > ** (Create an issue-only private repository for sensitive issues if it's > needed and allowed) > ** Set a mail hook to > [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to > the general mail group name) > * Set a schedule for migration > ** Give some time to committers to play around with issues/labels/milestones > before the actual migration > ** Make an announcement on the mail lists > ** Show some text messages when opening a new Jira issue (in issue template?) -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] LuXugang commented on pull request #967: LUCENE-10623: Error implementation of docValueCount for SortingSortedSetDocValues
LuXugang commented on PR #967: URL: https://github.com/apache/lucene/pull/967#issuecomment-1160497729 > Thanks for catching this bug. The fix is a bit wasteful in that it requires iterating over ords twice, once to count them and another time to iterate through them. Maybe we should change `DocOrds` to also record the number of ords for each doc (e.g. using a `GrowableWriter`), and stop recording zeroes to signal that all ords for a document have been consumed? Thanks for your suggestion, @jpountz . Remove the sentinel value zero and use GrowableWriter could make code more readable! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira
[ https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556475#comment-17556475 ] Tomoko Uchida commented on LUCENE-10557: I browsed through several JSON dumps of Jira issues. These are some observations. - It'd be easy to extract various metadata of issues (reporter id, status, created timestamp, etc.) - It'd be easy to extract all linked issue ids and sub-task ids - It'd be easy to extract all attached file URLs -- Can't estimate how many hours it will take to download all of the files - it'd be easy to extract all comments in an issue -- Perhaps pagination is needed for issues with many comments - We can apply parser/converter tools to convert the jira markups to markdown -- I think this can be error-prone - It'd be cumbersome to extract GitHub PR links. The links to PRs only appear in the github bot's comments in the Work Log. On GitHub side, there are no difficulties in dealing with the APIs. - It'd be a bit tedious to work with milestones via APIs. They can't be referred to by their text. Id - text mapping is needed - It might need some trials and errors to properly place attached files in their right place As for the cross-link conversion and account mapping script: - To "embed" github issue links / accounts in their right place (maybe next to the Jira issue keys / user names), we need to modify the original text. This can be tricky and the riskiest part to me. Instead of modifying the original text, we could just add some footnotes for the issues/comments - but it could considerably damage the readability. Yes it should be possible with a set of small scripts. Maybe one problem is that it'd be difficult to detect conversion errors/omissions and we can't correct them ourselves if we notice migration errors later (it seems we are not to be allowed to have the github tokens of the ASF repository). > Migrate to GitHub issue from Jira > - > > Key: LUCENE-10557 > URL: https://issues.apache.org/jira/browse/LUCENE-10557 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > > A few (not the majority) Apache projects already use the GitHub issue instead > of Jira. For example, > Airflow: [https://github.com/apache/airflow/issues] > BookKeeper: [https://github.com/apache/bookkeeper/issues] > So I think it'd be technically possible that we move to GitHub issue. I have > little knowledge of how to proceed with it, I'd like to discuss whether we > should migrate to it, and if so, how to smoothly handle the migration. > The major tasks would be: > * (/) Get a consensus about the migration among committers > * Choose issues that should be moved to GitHub > ** Discussion thread > [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12] > ** -Conclusion for now: We don't migrate any issues. Only new issues should > be opened on GitHub.- > ** Write a prototype migration script - the decision could be made on that. > Things to consider: > *** version numbers - labels or milestones? > *** add a comment/ prepend a link to the source Jira issue on github side, > *** add a comment/ prepend a link on the jira side to the new issue on > github side (for people who access jira from blogs, mailing list archives and > other sources that will have stale links), > *** convert cross-issue automatic links in comments/ descriptions (as > suggested by Robert), > *** strategy to deal with sub-issues (hierarchies), > *** maybe prefix (or postfix) the issue title on github side with the > original LUCENE-XYZ key so that it is easier to search for a particular issue > there? > *** how to deal with user IDs (author, reporter, commenters)? Do they have > to be github users? Will information about people not registered on github be > lost? > *** create an extra mapping file of old-issue-new-issue URLs for any > potential future uses. > *** what to do with issue numbers in git/svn commits? These could be > rewritten but it'd change the entire git history tree - I don't think this is > practical, while doable. > * Build the convention for issue label/milestone management > ** Do some experiments on a sandbox repository > [https://github.com/mocobeta/sandbox-lucene-10557] > ** Make documentation for metadata (label/milestone) management > * Enable Github issue on the lucene's repository > ** Raise an issue on INFRA > ** (Create an issue-only private repository for sensitive issues if it's > needed and allowed) > ** Set a mail hook to > [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to > the general mail group name) > * Set a schedule for migration > ** Give some time to committers to play around with issues/labels/miles
[jira] [Comment Edited] (LUCENE-10557) Migrate to GitHub issue from Jira
[ https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556475#comment-17556475 ] Tomoko Uchida edited comment on LUCENE-10557 at 6/20/22 5:20 PM: - I browsed through several JSON dumps of Jira issues. These are some observations. - It'd be easy to extract various metadata of issues (reporter id, status, created timestamp, etc.) - It'd be easy to extract all linked issue ids and sub-task ids - It'd be easy to extract all attached file URLs -- Can't estimate how many hours it will take to download all of the files - it'd be easy to extract all comments in an issue -- Perhaps pagination is needed for issues with many comments - We can apply parser/converter tools to convert the jira markups to markdown -- I think this can be error-prone - It'd be cumbersome to extract GitHub PR links. The links to PRs only appear in the github bot's comments in the Work Log. On GitHub side, there are no difficulties in dealing with the APIs. - It'd be a bit tedious to work with milestones via APIs. They can't be referred to by their text. Id - text mapping is needed - It might need some trials and errors to properly place attached files in their right place As for the cross-link conversion and account mapping script: - To "embed" github issue links / accounts in their right place (maybe next to the Jira issue keys / user names), we need to modify the original text. This can be tricky and the riskiest part to me. Instead of modifying the original text, we could just add some footnotes for the issues/comments - but it could considerably damage the readability. Yes it should be possible with a set of small scripts. Maybe one problem is that it'd be difficult to detect conversion errors/omissions and we can't correct them ourselves if we notice migration errors later (it seems we are not to be allowed to have the github tokens of the ASF repository). was (Author: tomoko uchida): I browsed through several JSON dumps of Jira issues. These are some observations. - It'd be easy to extract various metadata of issues (reporter id, status, created timestamp, etc.) - It'd be easy to extract all linked issue ids and sub-task ids - It'd be easy to extract all attached file URLs -- Can't estimate how many hours it will take to download all of the files - it'd be easy to extract all comments in an issue -- Perhaps pagination is needed for issues with many comments - We can apply parser/converter tools to convert the jira markups to markdown -- I think this can be error-prone - It'd be cumbersome to extract GitHub PR links. The links to PRs only appear in the github bot's comments in the Work Log. On GitHub side, there are no difficulties in dealing with the APIs. - It'd be a bit tedious to work with milestones via APIs. They can't be referred to by their text. Id - text mapping is needed - It might need some trials and errors to properly place attached files in their right place As for the cross-link conversion and account mapping script: - To "embed" github issue links / accounts in their right place (maybe next to the Jira issue keys / user names), we need to modify the original text. This can be tricky and the riskiest part to me. Instead of modifying the original text, we could just add some footnotes for the issues/comments - but it could considerably damage the readability. Yes it should be possible with a set of small scripts. Maybe one problem is that it'd be difficult to detect conversion errors/omissions and we can't correct them ourselves if we notice migration errors later (it seems we are not to be allowed to have the github tokens of the ASF repository). > Migrate to GitHub issue from Jira > - > > Key: LUCENE-10557 > URL: https://issues.apache.org/jira/browse/LUCENE-10557 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > > A few (not the majority) Apache projects already use the GitHub issue instead > of Jira. For example, > Airflow: [https://github.com/apache/airflow/issues] > BookKeeper: [https://github.com/apache/bookkeeper/issues] > So I think it'd be technically possible that we move to GitHub issue. I have > little knowledge of how to proceed with it, I'd like to discuss whether we > should migrate to it, and if so, how to smoothly handle the migration. > The major tasks would be: > * (/) Get a consensus about the migration among committers > * Choose issues that should be moved to GitHub > ** Discussion thread > [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12] > ** -Conclusion for now: We don't migrate any issues. Only new issues should > be opened on GitHub.- > ** Write a prototype migration
[GitHub] [lucene] jtibshirani commented on pull request #951: LUCENE-10606: Optimize Prefilter Hit Collection
jtibshirani commented on PR #951: URL: https://github.com/apache/lucene/pull/951#issuecomment-1160945431 @kaivalnp just wanted to check how this is going. I'm excited about this improvement. Let me know if I can help with anything, for example I could dig into the questions that Adrien and I raised earlier. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10624) Binary Search for Sparse IndexedDISI advanceWithinBlock & advanceExactWithinBlock
[ https://issues.apache.org/jira/browse/LUCENE-10624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiming Wu updated LUCENE-10624: Status: Patch Available (was: Open) > Binary Search for Sparse IndexedDISI advanceWithinBlock & > advanceExactWithinBlock > - > > Key: LUCENE-10624 > URL: https://issues.apache.org/jira/browse/LUCENE-10624 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 9.0, 9.1, 9.2 >Reporter: Weiming Wu >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > h3. Problem Statement > We noticed DocValue read performance regression with the iterative API when > upgrading from Lucene 5 to Lucene 9. Our latency is increased by 50%. The > degradation is similar to what's described in > https://issues.apache.org/jira/browse/SOLR-9599 > By analyzing profiling data, we found method "advanceWithinBlock" and > "advanceExactWithinBlock" for Sparse IndexedDISI is slow in Lucene 9 due to > their O(N) doc lookup algorithm. > h3. Changes > Used binary search algorithm to replace current O(N) lookup algorithm in > Sparse IndexedDISI "advanceWithinBlock" and "advanceExactWithinBlock" because > docs are in ascending order. > h3. Test > {code:java} > ./gradlew tidy > ./gradlew check {code} > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10624) Binary Search for Sparse IndexedDISI advanceWithinBlock & advanceExactWithinBlock
[ https://issues.apache.org/jira/browse/LUCENE-10624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiming Wu updated LUCENE-10624: Attachment: baseline_sparseTaxis_searchsparse-sorted.0.log > Binary Search for Sparse IndexedDISI advanceWithinBlock & > advanceExactWithinBlock > - > > Key: LUCENE-10624 > URL: https://issues.apache.org/jira/browse/LUCENE-10624 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 9.0, 9.1, 9.2 >Reporter: Weiming Wu >Priority: Major > Attachments: baseline_sparseTaxis_searchsparse-sorted.0.log, > candidate_sparseTaxis_searchsparse-sorted.0.log > > Time Spent: 10m > Remaining Estimate: 0h > > h3. Problem Statement > We noticed DocValue read performance regression with the iterative API when > upgrading from Lucene 5 to Lucene 9. Our latency is increased by 50%. The > degradation is similar to what's described in > https://issues.apache.org/jira/browse/SOLR-9599 > By analyzing profiling data, we found method "advanceWithinBlock" and > "advanceExactWithinBlock" for Sparse IndexedDISI is slow in Lucene 9 due to > their O(N) doc lookup algorithm. > h3. Changes > Used binary search algorithm to replace current O(N) lookup algorithm in > Sparse IndexedDISI "advanceWithinBlock" and "advanceExactWithinBlock" because > docs are in ascending order. > h3. Test > {code:java} > ./gradlew tidy > ./gradlew check {code} > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10624) Binary Search for Sparse IndexedDISI advanceWithinBlock & advanceExactWithinBlock
[ https://issues.apache.org/jira/browse/LUCENE-10624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiming Wu updated LUCENE-10624: Attachment: candidate_sparseTaxis_searchsparse-sorted.0.log > Binary Search for Sparse IndexedDISI advanceWithinBlock & > advanceExactWithinBlock > - > > Key: LUCENE-10624 > URL: https://issues.apache.org/jira/browse/LUCENE-10624 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 9.0, 9.1, 9.2 >Reporter: Weiming Wu >Priority: Major > Attachments: baseline_sparseTaxis_searchsparse-sorted.0.log, > candidate_sparseTaxis_searchsparse-sorted.0.log > > Time Spent: 10m > Remaining Estimate: 0h > > h3. Problem Statement > We noticed DocValue read performance regression with the iterative API when > upgrading from Lucene 5 to Lucene 9. Our latency is increased by 50%. The > degradation is similar to what's described in > https://issues.apache.org/jira/browse/SOLR-9599 > By analyzing profiling data, we found method "advanceWithinBlock" and > "advanceExactWithinBlock" for Sparse IndexedDISI is slow in Lucene 9 due to > their O(N) doc lookup algorithm. > h3. Changes > Used binary search algorithm to replace current O(N) lookup algorithm in > Sparse IndexedDISI "advanceWithinBlock" and "advanceExactWithinBlock" because > docs are in ascending order. > h3. Test > {code:java} > ./gradlew tidy > ./gradlew check {code} > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10624) Binary Search for Sparse IndexedDISI advanceWithinBlock & advanceExactWithinBlock
[ https://issues.apache.org/jira/browse/LUCENE-10624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiming Wu updated LUCENE-10624: Description: h3. Problem Statement We noticed DocValue read performance regression with the iterative API when upgrading from Lucene 5 to Lucene 9. Our latency is increased by 50%. The degradation is similar to what's described in https://issues.apache.org/jira/browse/SOLR-9599 By analyzing profiling data, we found method "advanceWithinBlock" and "advanceExactWithinBlock" for Sparse IndexedDISI is slow in Lucene 9 due to their O(N) doc lookup algorithm. h3. Changes Used binary search algorithm to replace current O(N) lookup algorithm in Sparse IndexedDISI "advanceWithinBlock" and "advanceExactWithinBlock" because docs are in ascending order. h3. Test {code:java} ./gradlew tidy ./gradlew check {code} h3. Benchmark Ran sparseTaxis from {color:#1d1c1d}luceneutil. Attached the reports of baseline and candidates.{color} {color:#1d1c1d}1. Most cases have ~15% latency reduction.{color} {color:#1d1c1d}2. Some highlights:{color} * {color:#1d1c1d}T0 green_pickup_latitude:[40.75 TO 40.9] yellow_pickup_latitude:[40.75 TO 40.9] sort=null{color} ** {color:#1d1c1d}Baseline: 10973978+ hits hits in 726.81967 msec{color} ** {color:#1d1c1d}Candidate: 10973978+ hits hits in 484.544594 msec{color} was: h3. Problem Statement We noticed DocValue read performance regression with the iterative API when upgrading from Lucene 5 to Lucene 9. Our latency is increased by 50%. The degradation is similar to what's described in https://issues.apache.org/jira/browse/SOLR-9599 By analyzing profiling data, we found method "advanceWithinBlock" and "advanceExactWithinBlock" for Sparse IndexedDISI is slow in Lucene 9 due to their O(N) doc lookup algorithm. h3. Changes Used binary search algorithm to replace current O(N) lookup algorithm in Sparse IndexedDISI "advanceWithinBlock" and "advanceExactWithinBlock" because docs are in ascending order. h3. Test {code:java} ./gradlew tidy ./gradlew check {code} > Binary Search for Sparse IndexedDISI advanceWithinBlock & > advanceExactWithinBlock > - > > Key: LUCENE-10624 > URL: https://issues.apache.org/jira/browse/LUCENE-10624 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 9.0, 9.1, 9.2 >Reporter: Weiming Wu >Priority: Major > Attachments: baseline_sparseTaxis_searchsparse-sorted.0.log, > candidate_sparseTaxis_searchsparse-sorted.0.log > > Time Spent: 10m > Remaining Estimate: 0h > > h3. Problem Statement > We noticed DocValue read performance regression with the iterative API when > upgrading from Lucene 5 to Lucene 9. Our latency is increased by 50%. The > degradation is similar to what's described in > https://issues.apache.org/jira/browse/SOLR-9599 > By analyzing profiling data, we found method "advanceWithinBlock" and > "advanceExactWithinBlock" for Sparse IndexedDISI is slow in Lucene 9 due to > their O(N) doc lookup algorithm. > h3. Changes > Used binary search algorithm to replace current O(N) lookup algorithm in > Sparse IndexedDISI "advanceWithinBlock" and "advanceExactWithinBlock" because > docs are in ascending order. > h3. Test > {code:java} > ./gradlew tidy > ./gradlew check {code} > h3. Benchmark > Ran sparseTaxis from {color:#1d1c1d}luceneutil. Attached the reports of > baseline and candidates.{color} > {color:#1d1c1d}1. Most cases have ~15% latency reduction.{color} > {color:#1d1c1d}2. Some highlights:{color} > * {color:#1d1c1d}T0 green_pickup_latitude:[40.75 TO 40.9] > yellow_pickup_latitude:[40.75 TO 40.9] sort=null{color} > ** {color:#1d1c1d}Baseline: 10973978+ hits hits in 726.81967 msec{color} > ** {color:#1d1c1d}Candidate: 10973978+ hits hits in 484.544594 msec{color} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10624) Binary Search for Sparse IndexedDISI advanceWithinBlock & advanceExactWithinBlock
[ https://issues.apache.org/jira/browse/LUCENE-10624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiming Wu updated LUCENE-10624: Description: h3. Problem Statement We noticed DocValue read performance regression with the iterative API when upgrading from Lucene 5 to Lucene 9. Our latency is increased by 50%. The degradation is similar to what's described in https://issues.apache.org/jira/browse/SOLR-9599 By analyzing profiling data, we found method "advanceWithinBlock" and "advanceExactWithinBlock" for Sparse IndexedDISI is slow in Lucene 9 due to their O(N) doc lookup algorithm. h3. Changes Used binary search algorithm to replace current O(N) lookup algorithm in Sparse IndexedDISI "advanceWithinBlock" and "advanceExactWithinBlock" because docs are in ascending order. h3. Test {code:java} ./gradlew tidy ./gradlew check {code} h3. Benchmark Ran sparseTaxis from {color:#1d1c1d}luceneutil. Attached the reports of baseline and candidates.{color} {color:#1d1c1d}1. Most cases have ~15% latency reduction.{color} {color:#1d1c1d}2. Some highlights (>20%):{color} * *{color:#1d1c1d}T0 green_pickup_latitude:[40.75 TO 40.9] yellow_pickup_latitude:[40.75 TO 40.9] sort=null{color}* ** {color:#1d1c1d}*Baseline:* 10973978+ hits hits in 726.81967 msec{color} ** {color:#1d1c1d}*Candidate:* 10973978+ hits hits in 484.544594 msec{color} * {color:#1d1c1d}T0 cab_color:y cab_color:g sort=null{color} ** {color:#1d1c1d}*Baseline:* 2300174+ hits hits in 95.698324 msec{color} ** {color:#1d1c1d}*Candidate:* 2300174+ hits hits in 78.336193 msec{color} was: h3. Problem Statement We noticed DocValue read performance regression with the iterative API when upgrading from Lucene 5 to Lucene 9. Our latency is increased by 50%. The degradation is similar to what's described in https://issues.apache.org/jira/browse/SOLR-9599 By analyzing profiling data, we found method "advanceWithinBlock" and "advanceExactWithinBlock" for Sparse IndexedDISI is slow in Lucene 9 due to their O(N) doc lookup algorithm. h3. Changes Used binary search algorithm to replace current O(N) lookup algorithm in Sparse IndexedDISI "advanceWithinBlock" and "advanceExactWithinBlock" because docs are in ascending order. h3. Test {code:java} ./gradlew tidy ./gradlew check {code} h3. Benchmark Ran sparseTaxis from {color:#1d1c1d}luceneutil. Attached the reports of baseline and candidates.{color} {color:#1d1c1d}1. Most cases have ~15% latency reduction.{color} {color:#1d1c1d}2. Some highlights:{color} * {color:#1d1c1d}T0 green_pickup_latitude:[40.75 TO 40.9] yellow_pickup_latitude:[40.75 TO 40.9] sort=null{color} ** {color:#1d1c1d}Baseline: 10973978+ hits hits in 726.81967 msec{color} ** {color:#1d1c1d}Candidate: 10973978+ hits hits in 484.544594 msec{color} > Binary Search for Sparse IndexedDISI advanceWithinBlock & > advanceExactWithinBlock > - > > Key: LUCENE-10624 > URL: https://issues.apache.org/jira/browse/LUCENE-10624 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 9.0, 9.1, 9.2 >Reporter: Weiming Wu >Priority: Major > Attachments: baseline_sparseTaxis_searchsparse-sorted.0.log, > candidate_sparseTaxis_searchsparse-sorted.0.log > > Time Spent: 10m > Remaining Estimate: 0h > > h3. Problem Statement > We noticed DocValue read performance regression with the iterative API when > upgrading from Lucene 5 to Lucene 9. Our latency is increased by 50%. The > degradation is similar to what's described in > https://issues.apache.org/jira/browse/SOLR-9599 > By analyzing profiling data, we found method "advanceWithinBlock" and > "advanceExactWithinBlock" for Sparse IndexedDISI is slow in Lucene 9 due to > their O(N) doc lookup algorithm. > h3. Changes > Used binary search algorithm to replace current O(N) lookup algorithm in > Sparse IndexedDISI "advanceWithinBlock" and "advanceExactWithinBlock" because > docs are in ascending order. > h3. Test > {code:java} > ./gradlew tidy > ./gradlew check {code} > h3. Benchmark > Ran sparseTaxis from {color:#1d1c1d}luceneutil. Attached the reports of > baseline and candidates.{color} > {color:#1d1c1d}1. Most cases have ~15% latency reduction.{color} > {color:#1d1c1d}2. Some highlights (>20%):{color} > * *{color:#1d1c1d}T0 green_pickup_latitude:[40.75 TO 40.9] > yellow_pickup_latitude:[40.75 TO 40.9] sort=null{color}* > ** {color:#1d1c1d}*Baseline:* 10973978+ hits hits in 726.81967 msec{color} > ** {color:#1d1c1d}*Candidate:* 10973978+ hits hits in 484.544594 msec{color} > * {color:#1d1c1d}T0 cab_color:y cab_color:g sort=null{color} > ** {color:#1d1c1d}*Baseline:* 2300174+ hits hits in 95.698324 msec{color} > ** {color:#1d1c1d}*Candidate:* 2300174+ hits hits in 78.336193 msec{c
[jira] [Commented] (LUCENE-10624) Binary Search for Sparse IndexedDISI advanceWithinBlock & advanceExactWithinBlock
[ https://issues.apache.org/jira/browse/LUCENE-10624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556662#comment-17556662 ] Weiming Wu commented on LUCENE-10624: - Added benchmark data to the content. > Binary Search for Sparse IndexedDISI advanceWithinBlock & > advanceExactWithinBlock > - > > Key: LUCENE-10624 > URL: https://issues.apache.org/jira/browse/LUCENE-10624 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 9.0, 9.1, 9.2 >Reporter: Weiming Wu >Priority: Major > Attachments: baseline_sparseTaxis_searchsparse-sorted.0.log, > candidate_sparseTaxis_searchsparse-sorted.0.log > > Time Spent: 10m > Remaining Estimate: 0h > > h3. Problem Statement > We noticed DocValue read performance regression with the iterative API when > upgrading from Lucene 5 to Lucene 9. Our latency is increased by 50%. The > degradation is similar to what's described in > https://issues.apache.org/jira/browse/SOLR-9599 > By analyzing profiling data, we found method "advanceWithinBlock" and > "advanceExactWithinBlock" for Sparse IndexedDISI is slow in Lucene 9 due to > their O(N) doc lookup algorithm. > h3. Changes > Used binary search algorithm to replace current O(N) lookup algorithm in > Sparse IndexedDISI "advanceWithinBlock" and "advanceExactWithinBlock" because > docs are in ascending order. > h3. Test > {code:java} > ./gradlew tidy > ./gradlew check {code} > h3. Benchmark > Ran sparseTaxis test cases from {color:#1d1c1d}luceneutil. Attached the > reports of baseline and candidates in attachments section. > {color} > {color:#1d1c1d}1. Most cases have ~10% search latency reduction.{color} > {color:#1d1c1d}2. Some highlights (>20%):{color} > * *{color:#1d1c1d}T0 green_pickup_latitude:[40.75 TO 40.9] > yellow_pickup_latitude:[40.75 TO 40.9] sort=null{color}* > ** {color:#1d1c1d}*Baseline:* 10973978+ hits hits in *726.81967 msec*{color} > ** {color:#1d1c1d}*Candidate:* 10973978+ hits hits in *484.544594 > msec*{color} > * *{color:#1d1c1d}T0 cab_color:y cab_color:g sort=null{color}* > ** {color:#1d1c1d}*Baseline:* 2300174+ hits hits in *95.698324 msec*{color} > ** {color:#1d1c1d}*Candidate:* 2300174+ hits hits in *78.336193 msec*{color} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10624) Binary Search for Sparse IndexedDISI advanceWithinBlock & advanceExactWithinBlock
[ https://issues.apache.org/jira/browse/LUCENE-10624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiming Wu updated LUCENE-10624: Description: h3. Problem Statement We noticed DocValue read performance regression with the iterative API when upgrading from Lucene 5 to Lucene 9. Our latency is increased by 50%. The degradation is similar to what's described in https://issues.apache.org/jira/browse/SOLR-9599 By analyzing profiling data, we found method "advanceWithinBlock" and "advanceExactWithinBlock" for Sparse IndexedDISI is slow in Lucene 9 due to their O(N) doc lookup algorithm. h3. Changes Used binary search algorithm to replace current O(N) lookup algorithm in Sparse IndexedDISI "advanceWithinBlock" and "advanceExactWithinBlock" because docs are in ascending order. h3. Test {code:java} ./gradlew tidy ./gradlew check {code} h3. Benchmark Ran sparseTaxis test cases from {color:#1d1c1d}luceneutil. Attached the reports of baseline and candidates in attachments section. {color} {color:#1d1c1d}1. Most cases have ~10% search latency reduction.{color} {color:#1d1c1d}2. Some highlights (>20%):{color} * *{color:#1d1c1d}T0 green_pickup_latitude:[40.75 TO 40.9] yellow_pickup_latitude:[40.75 TO 40.9] sort=null{color}* ** {color:#1d1c1d}*Baseline:* 10973978+ hits hits in *726.81967 msec*{color} ** {color:#1d1c1d}*Candidate:* 10973978+ hits hits in *484.544594 msec*{color} * *{color:#1d1c1d}T0 cab_color:y cab_color:g sort=null{color}* ** {color:#1d1c1d}*Baseline:* 2300174+ hits hits in *95.698324 msec*{color} ** {color:#1d1c1d}*Candidate:* 2300174+ hits hits in *78.336193 msec*{color} was: h3. Problem Statement We noticed DocValue read performance regression with the iterative API when upgrading from Lucene 5 to Lucene 9. Our latency is increased by 50%. The degradation is similar to what's described in https://issues.apache.org/jira/browse/SOLR-9599 By analyzing profiling data, we found method "advanceWithinBlock" and "advanceExactWithinBlock" for Sparse IndexedDISI is slow in Lucene 9 due to their O(N) doc lookup algorithm. h3. Changes Used binary search algorithm to replace current O(N) lookup algorithm in Sparse IndexedDISI "advanceWithinBlock" and "advanceExactWithinBlock" because docs are in ascending order. h3. Test {code:java} ./gradlew tidy ./gradlew check {code} h3. Benchmark Ran sparseTaxis from {color:#1d1c1d}luceneutil. Attached the reports of baseline and candidates.{color} {color:#1d1c1d}1. Most cases have ~15% latency reduction.{color} {color:#1d1c1d}2. Some highlights (>20%):{color} * *{color:#1d1c1d}T0 green_pickup_latitude:[40.75 TO 40.9] yellow_pickup_latitude:[40.75 TO 40.9] sort=null{color}* ** {color:#1d1c1d}*Baseline:* 10973978+ hits hits in 726.81967 msec{color} ** {color:#1d1c1d}*Candidate:* 10973978+ hits hits in 484.544594 msec{color} * {color:#1d1c1d}T0 cab_color:y cab_color:g sort=null{color} ** {color:#1d1c1d}*Baseline:* 2300174+ hits hits in 95.698324 msec{color} ** {color:#1d1c1d}*Candidate:* 2300174+ hits hits in 78.336193 msec{color} > Binary Search for Sparse IndexedDISI advanceWithinBlock & > advanceExactWithinBlock > - > > Key: LUCENE-10624 > URL: https://issues.apache.org/jira/browse/LUCENE-10624 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 9.0, 9.1, 9.2 >Reporter: Weiming Wu >Priority: Major > Attachments: baseline_sparseTaxis_searchsparse-sorted.0.log, > candidate_sparseTaxis_searchsparse-sorted.0.log > > Time Spent: 10m > Remaining Estimate: 0h > > h3. Problem Statement > We noticed DocValue read performance regression with the iterative API when > upgrading from Lucene 5 to Lucene 9. Our latency is increased by 50%. The > degradation is similar to what's described in > https://issues.apache.org/jira/browse/SOLR-9599 > By analyzing profiling data, we found method "advanceWithinBlock" and > "advanceExactWithinBlock" for Sparse IndexedDISI is slow in Lucene 9 due to > their O(N) doc lookup algorithm. > h3. Changes > Used binary search algorithm to replace current O(N) lookup algorithm in > Sparse IndexedDISI "advanceWithinBlock" and "advanceExactWithinBlock" because > docs are in ascending order. > h3. Test > {code:java} > ./gradlew tidy > ./gradlew check {code} > h3. Benchmark > Ran sparseTaxis test cases from {color:#1d1c1d}luceneutil. Attached the > reports of baseline and candidates in attachments section. > {color} > {color:#1d1c1d}1. Most cases have ~10% search latency reduction.{color} > {color:#1d1c1d}2. Some highlights (>20%):{color} > * *{color:#1d1c1d}T0 green_pickup_latitude:[40.75 TO 40.9] > yellow_pickup_latitude:[40.75 TO 40.9] sort=null{color}* > ** {color:#1d1c1d}*Baseline:* 10973978+ hits h
[jira] [Updated] (LUCENE-10624) Binary Search for Sparse IndexedDISI advanceWithinBlock & advanceExactWithinBlock
[ https://issues.apache.org/jira/browse/LUCENE-10624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiming Wu updated LUCENE-10624: Description: h3. Problem Statement We noticed DocValue read performance regression with the iterative API when upgrading from Lucene 5 to Lucene 9. Our latency is increased by 50%. The degradation is similar to what's described in https://issues.apache.org/jira/browse/SOLR-9599 By analyzing profiling data, we found method "advanceWithinBlock" and "advanceExactWithinBlock" for Sparse IndexedDISI is slow in Lucene 9 due to their O(N) doc lookup algorithm. h3. Changes Used binary search algorithm to replace current O(N) lookup algorithm in Sparse IndexedDISI "advanceWithinBlock" and "advanceExactWithinBlock" because docs are in ascending order. h3. Test {code:java} ./gradlew tidy ./gradlew check {code} h3. Benchmark Ran sparseTaxis test cases from {color:#1d1c1d}luceneutil. Attached the reports of baseline and candidates in attachments section.{color} {color:#1d1c1d}1. Most cases have 5-10% search latency reduction.{color} {color:#1d1c1d}2. Some highlights (>20%):{color} * *{color:#1d1c1d}T0 green_pickup_latitude:[40.75 TO 40.9] yellow_pickup_latitude:[40.75 TO 40.9] sort=null{color}* ** {color:#1d1c1d}*Baseline:* 10973978+ hits hits in *726.81967 msec*{color} ** {color:#1d1c1d}*Candidate:* 10973978+ hits hits in *484.544594 msec*{color} * *{color:#1d1c1d}T0 cab_color:y cab_color:g sort=null{color}* ** {color:#1d1c1d}*Baseline:* 2300174+ hits hits in *95.698324 msec*{color} ** {color:#1d1c1d}*Candidate:* 2300174+ hits hits in *78.336193 msec*{color} * {color:#1d1c1d}*T1 cab_color:y cab_color:g sort=null*{color} ** {color:#1d1c1d}*Baseline:* 2300174+ hits hits in *391.565239 msec*{color} ** {color:#1d1c1d}*Candidate:* 300174+ hits hits in *227.592885 msec*{color}{*}{*} * {color:#1d1c1d}*...*{color} was: h3. Problem Statement We noticed DocValue read performance regression with the iterative API when upgrading from Lucene 5 to Lucene 9. Our latency is increased by 50%. The degradation is similar to what's described in https://issues.apache.org/jira/browse/SOLR-9599 By analyzing profiling data, we found method "advanceWithinBlock" and "advanceExactWithinBlock" for Sparse IndexedDISI is slow in Lucene 9 due to their O(N) doc lookup algorithm. h3. Changes Used binary search algorithm to replace current O(N) lookup algorithm in Sparse IndexedDISI "advanceWithinBlock" and "advanceExactWithinBlock" because docs are in ascending order. h3. Test {code:java} ./gradlew tidy ./gradlew check {code} h3. Benchmark Ran sparseTaxis test cases from {color:#1d1c1d}luceneutil. Attached the reports of baseline and candidates in attachments section. {color} {color:#1d1c1d}1. Most cases have ~10% search latency reduction.{color} {color:#1d1c1d}2. Some highlights (>20%):{color} * *{color:#1d1c1d}T0 green_pickup_latitude:[40.75 TO 40.9] yellow_pickup_latitude:[40.75 TO 40.9] sort=null{color}* ** {color:#1d1c1d}*Baseline:* 10973978+ hits hits in *726.81967 msec*{color} ** {color:#1d1c1d}*Candidate:* 10973978+ hits hits in *484.544594 msec*{color} * *{color:#1d1c1d}T0 cab_color:y cab_color:g sort=null{color}* ** {color:#1d1c1d}*Baseline:* 2300174+ hits hits in *95.698324 msec*{color} ** {color:#1d1c1d}*Candidate:* 2300174+ hits hits in *78.336193 msec*{color} > Binary Search for Sparse IndexedDISI advanceWithinBlock & > advanceExactWithinBlock > - > > Key: LUCENE-10624 > URL: https://issues.apache.org/jira/browse/LUCENE-10624 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 9.0, 9.1, 9.2 >Reporter: Weiming Wu >Priority: Major > Attachments: baseline_sparseTaxis_searchsparse-sorted.0.log, > candidate_sparseTaxis_searchsparse-sorted.0.log > > Time Spent: 10m > Remaining Estimate: 0h > > h3. Problem Statement > We noticed DocValue read performance regression with the iterative API when > upgrading from Lucene 5 to Lucene 9. Our latency is increased by 50%. The > degradation is similar to what's described in > https://issues.apache.org/jira/browse/SOLR-9599 > By analyzing profiling data, we found method "advanceWithinBlock" and > "advanceExactWithinBlock" for Sparse IndexedDISI is slow in Lucene 9 due to > their O(N) doc lookup algorithm. > h3. Changes > Used binary search algorithm to replace current O(N) lookup algorithm in > Sparse IndexedDISI "advanceWithinBlock" and "advanceExactWithinBlock" because > docs are in ascending order. > h3. Test > {code:java} > ./gradlew tidy > ./gradlew check {code} > h3. Benchmark > Ran sparseTaxis test cases from {color:#1d1c1d}luceneutil. Attached the > reports of baseline and candidates in attachments sectio
[jira] [Comment Edited] (LUCENE-10624) Binary Search for Sparse IndexedDISI advanceWithinBlock & advanceExactWithinBlock
[ https://issues.apache.org/jira/browse/LUCENE-10624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556662#comment-17556662 ] Weiming Wu edited comment on LUCENE-10624 at 6/21/22 6:16 AM: -- Added benchmark data to the description. was (Author: JIRAUSER290435): Added benchmark data to the content. > Binary Search for Sparse IndexedDISI advanceWithinBlock & > advanceExactWithinBlock > - > > Key: LUCENE-10624 > URL: https://issues.apache.org/jira/browse/LUCENE-10624 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 9.0, 9.1, 9.2 >Reporter: Weiming Wu >Priority: Major > Attachments: baseline_sparseTaxis_searchsparse-sorted.0.log, > candidate_sparseTaxis_searchsparse-sorted.0.log > > Time Spent: 10m > Remaining Estimate: 0h > > h3. Problem Statement > We noticed DocValue read performance regression with the iterative API when > upgrading from Lucene 5 to Lucene 9. Our latency is increased by 50%. The > degradation is similar to what's described in > https://issues.apache.org/jira/browse/SOLR-9599 > By analyzing profiling data, we found method "advanceWithinBlock" and > "advanceExactWithinBlock" for Sparse IndexedDISI is slow in Lucene 9 due to > their O(N) doc lookup algorithm. > h3. Changes > Used binary search algorithm to replace current O(N) lookup algorithm in > Sparse IndexedDISI "advanceWithinBlock" and "advanceExactWithinBlock" because > docs are in ascending order. > h3. Test > {code:java} > ./gradlew tidy > ./gradlew check {code} > h3. Benchmark > Ran sparseTaxis test cases from {color:#1d1c1d}luceneutil. Attached the > reports of baseline and candidates in attachments section.{color} > {color:#1d1c1d}1. Most cases have 5-10% search latency reduction.{color} > {color:#1d1c1d}2. Some highlights (>20%):{color} > * *{color:#1d1c1d}T0 green_pickup_latitude:[40.75 TO 40.9] > yellow_pickup_latitude:[40.75 TO 40.9] sort=null{color}* > ** {color:#1d1c1d}*Baseline:* 10973978+ hits hits in *726.81967 msec*{color} > ** {color:#1d1c1d}*Candidate:* 10973978+ hits hits in *484.544594 > msec*{color} > * *{color:#1d1c1d}T0 cab_color:y cab_color:g sort=null{color}* > ** {color:#1d1c1d}*Baseline:* 2300174+ hits hits in *95.698324 msec*{color} > ** {color:#1d1c1d}*Candidate:* 2300174+ hits hits in *78.336193 msec*{color} > * {color:#1d1c1d}*T1 cab_color:y cab_color:g sort=null*{color} > ** {color:#1d1c1d}*Baseline:* 2300174+ hits hits in *391.565239 msec*{color} > ** {color:#1d1c1d}*Candidate:* 300174+ hits hits in *227.592885 > msec*{color}{*}{*} > * {color:#1d1c1d}*...*{color} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10624) Binary Search for Sparse IndexedDISI advanceWithinBlock & advanceExactWithinBlock
[ https://issues.apache.org/jira/browse/LUCENE-10624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556673#comment-17556673 ] Adrien Grand commented on LUCENE-10624: --- I find these speedups surprising since I was not expecting these queries to leverage doc values. The one query where I would expect a speedup is the term query sorted by field: http://people.apache.org/~mikemccand/lucenebench/sparseResults.html#search_sort_qps. Regarding the implementation, in the past we observed better performance for this sort of things with exponential search than with binary search, since exponential search would better optimize for the case when callers repeatedly call advance() on small increments. > Binary Search for Sparse IndexedDISI advanceWithinBlock & > advanceExactWithinBlock > - > > Key: LUCENE-10624 > URL: https://issues.apache.org/jira/browse/LUCENE-10624 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 9.0, 9.1, 9.2 >Reporter: Weiming Wu >Priority: Major > Attachments: baseline_sparseTaxis_searchsparse-sorted.0.log, > candidate_sparseTaxis_searchsparse-sorted.0.log > > Time Spent: 10m > Remaining Estimate: 0h > > h3. Problem Statement > We noticed DocValue read performance regression with the iterative API when > upgrading from Lucene 5 to Lucene 9. Our latency is increased by 50%. The > degradation is similar to what's described in > https://issues.apache.org/jira/browse/SOLR-9599 > By analyzing profiling data, we found method "advanceWithinBlock" and > "advanceExactWithinBlock" for Sparse IndexedDISI is slow in Lucene 9 due to > their O(N) doc lookup algorithm. > h3. Changes > Used binary search algorithm to replace current O(N) lookup algorithm in > Sparse IndexedDISI "advanceWithinBlock" and "advanceExactWithinBlock" because > docs are in ascending order. > h3. Test > {code:java} > ./gradlew tidy > ./gradlew check {code} > h3. Benchmark > Ran sparseTaxis test cases from {color:#1d1c1d}luceneutil. Attached the > reports of baseline and candidates in attachments section.{color} > {color:#1d1c1d}1. Most cases have 5-10% search latency reduction.{color} > {color:#1d1c1d}2. Some highlights (>20%):{color} > * *{color:#1d1c1d}T0 green_pickup_latitude:[40.75 TO 40.9] > yellow_pickup_latitude:[40.75 TO 40.9] sort=null{color}* > ** {color:#1d1c1d}*Baseline:* 10973978+ hits hits in *726.81967 msec*{color} > ** {color:#1d1c1d}*Candidate:* 10973978+ hits hits in *484.544594 > msec*{color} > * *{color:#1d1c1d}T0 cab_color:y cab_color:g sort=null{color}* > ** {color:#1d1c1d}*Baseline:* 2300174+ hits hits in *95.698324 msec*{color} > ** {color:#1d1c1d}*Candidate:* 2300174+ hits hits in *78.336193 msec*{color} > * {color:#1d1c1d}*T1 cab_color:y cab_color:g sort=null*{color} > ** {color:#1d1c1d}*Baseline:* 2300174+ hits hits in *391.565239 msec*{color} > ** {color:#1d1c1d}*Candidate:* 300174+ hits hits in *227.592885 > msec*{color}{*}{*} > * {color:#1d1c1d}*...*{color} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org