[GitHub] [lucene] jpountz commented on pull request #1006: LUCENE-10480: Move scoring from advance to TwoPhaseIterator#matches to improve disjunction within conjunction
jpountz commented on PR #1006: URL: https://github.com/apache/lucene/pull/1006#issuecomment-1177173038 Ah, that makes sense to me now! Thanks for explaining. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on pull request #1006: LUCENE-10480: Move scoring from advance to TwoPhaseIterator#matches to improve disjunction within conjunction
zacharymorn commented on PR #1006: URL: https://github.com/apache/lucene/pull/1006#issuecomment-1177230791 > Ah, that makes sense to me now! Thanks for explaining. No problem! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10645) Wrong autocomplete suggestion
Emiliyan Sinigerov created LUCENE-10645: --- Summary: Wrong autocomplete suggestion Key: LUCENE-10645 URL: https://issues.apache.org/jira/browse/LUCENE-10645 Project: Lucene - Core Issue Type: Bug Reporter: Emiliyan Sinigerov I have problem with autocomplete suggestion (I use your test to show you where is the bug https://github.com/apache/lucene/blob/698f40ad51af0c42b0a4a8321ab89968e8d0860b/lucene/suggest/src/test/org/apache/lucene/search/suggest/analyzing/TestAnalyzingInfixSuggester.java). This is your test and everything works fine: public void testBothExactAndPrefix() throws Exception { Analyzer a = new MockAnalyzer(random(), MockTokenizer.WHITESPACE, false); AnalyzingInfixSuggester suggester = new AnalyzingInfixSuggester(newDirectory(), a, a, 3, false); suggester.build(new InputArrayIterator(new Input[0])); suggester.add(new BytesRef("the pen is pretty"), null, 10, new BytesRef("foobaz")); suggester.refresh(); List results = suggester.lookup(TestUtil.stringToCharSequence("pen p", random()), 10, true, true); assertEquals(1, results.size()); assertEquals("the pen is pretty", results.get(0).key); assertEquals("the pen is pretty", results.get(0).highlightKey); assertEquals(10, results.get(0).value); assertEquals(new BytesRef("foobaz"), results.get(0).payload); suggester.close(); a.close(); } But if I add this row to the test {*}suggester.add(new BytesRef("the pen is fretty"), null, 10, new BytesRef("foobaz")){*}, the test goes wrong. public void testBothExactAndPrefix() throws Exception { Analyzer a = new MockAnalyzer(random(), MockTokenizer.WHITESPACE, false); AnalyzingInfixSuggester suggester = new AnalyzingInfixSuggester(newDirectory(), a, a, 3, false); suggester.build(new InputArrayIterator(new Input[0])); suggester.add(new BytesRef("the pen is pretty"), null, 10, new BytesRef("foobaz")); *suggester.add(new BytesRef("the pen is fretty"), null, 10, new BytesRef("foobaz"));* suggester.refresh(); List results = suggester.lookup(TestUtil.stringToCharSequence("pen p", random()), 10, true, true); assertEquals(1, results.size()); assertEquals("the pen is pretty", results.get(0).key); assertEquals("the pen is pretty", results.get(0).highlightKey); assertEquals(10, results.get(0).value); assertEquals(new BytesRef("foobaz"), results.get(0).payload); suggester.close(); a.close(); } We want to find everything that contains "pen p" and we have just one matcher "the pen is pretty", but in the results we have two matches "the pen is pretty" and "the pen is fretty". I think when we want to find some words - in this study "pen" and the second word with one letter, which is the same as the first letter in our word - in this study "p", the suggester first match word "pen" and then match "p" in "pen", which is inccorect. We want to match "p" in a word other than "pen". Thank you, Emiliyan. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563627#comment-17563627 ] ASF subversion and git services commented on LUCENE-10480: -- Commit da8143bfa38cd5fadae4b4712b9e639e79016021 in lucene's branch refs/heads/main from zacharymorn [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=da8143bfa38 ] LUCENE-10480: Move scoring from advance to TwoPhaseIterator#matches to improve disjunction within conjunction (#1006) > Specialize 2-clauses disjunctions > - > > Key: LUCENE-10480 > URL: https://issues.apache.org/jira/browse/LUCENE-10480 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Time Spent: 7h > Remaining Estimate: 0h > > WANDScorer is nice, but it also has lots of overhead to maintain its > invariants: one linked list for the current candidates, one priority queue of > scorers that are behind, another one for scorers that are ahead. All this > could be simplified in the 2-clauses case, which feels worth specializing for > as it's very common that end users enter queries that only have two terms? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zacharymorn opened a new pull request, #1008: LUCENE-10480: (Backporting) Move scoring from advance to TwoPhaseIterator#matches to improve disjunction within conjunction (#1006)
zacharymorn opened a new pull request, #1008: URL: https://github.com/apache/lucene/pull/1008 This PR backports https://github.com/apache/lucene/pull/1006 into `branch_9x` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zacharymorn merged pull request #1006: LUCENE-10480: Move scoring from advance to TwoPhaseIterator#matches to improve disjunction within conjunction
zacharymorn merged PR #1006: URL: https://github.com/apache/lucene/pull/1006 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on issue #8: Set assignee field for issues if the account mapping is given
mocobeta commented on issue #8: URL: https://github.com/apache/lucene-jira-archive/issues/8#issuecomment-1177419371 Thank you @dweiss for noticing this. I invited you to a test repository. I think an email has been sent. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10643) Lucene Jenkins CI - s390x support
[ https://issues.apache.org/jira/browse/LUCENE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563713#comment-17563713 ] Nayana Thorat commented on LUCENE-10643: [~uschindler] Yes.. oracle does not offer JDK 19 for s390x yet however Eclipse Adoptium has nightly (beta) release for java19. I have installed it on s390x nodes in directory : /home/jenkins/tools/java/adoptjdk19 Version installed: $ /home/jenkins/tools/java/adoptjdk19/bin/java --version openjdk 19-beta 2022-09-20 OpenJDK Runtime Environment Temurin-19+29-202207070331 (build 19-beta+29-202207070331) OpenJDK 64-Bit Server VM Temurin-19+29-202207070331 (build 19-beta+29-202207070331, mixed mode, sharing) > Lucene Jenkins CI - s390x support > -- > > Key: LUCENE-10643 > URL: https://issues.apache.org/jira/browse/LUCENE-10643 > Project: Lucene - Core > Issue Type: Wish >Reporter: Nayana Thorat >Assignee: Uwe Schindler >Priority: Major > Labels: jenkins > > This issue adds Lucene builds on ASF Jenkins with S390x architecture (big > endian). > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand commented on issue #8: Set assignee field for issues if the account mapping is given
mikemccand commented on issue #8: URL: https://github.com/apache/lucene-jira-archive/issues/8#issuecomment-1177455363 > @mikemccand I invited you to a test repository to test if we can set (migrate) issues' `Assignee` field. An email should have been sent - can you please accept it? > > I tested it with my account (API's caller and issue author), just wanted to confirm it also works for other accounts. Thanks @mocobeta! I just accepted the invitation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10643) Lucene Jenkins CI - s390x support
[ https://issues.apache.org/jira/browse/LUCENE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563719#comment-17563719 ] Uwe Schindler commented on LUCENE-10643: Great thanks, will setup a job for that. It look like it is recent enough. > Lucene Jenkins CI - s390x support > -- > > Key: LUCENE-10643 > URL: https://issues.apache.org/jira/browse/LUCENE-10643 > Project: Lucene - Core > Issue Type: Wish >Reporter: Nayana Thorat >Assignee: Uwe Schindler >Priority: Major > Labels: jenkins > > This issue adds Lucene builds on ASF Jenkins with S390x architecture (big > endian). > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10643) Lucene Jenkins CI - s390x support
[ https://issues.apache.org/jira/browse/LUCENE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563727#comment-17563727 ] Nayana Thorat commented on LUCENE-10643: [~uschindler] [https://ci-builds.apache.org/job/Lucene/job/Lucene-Check-main%20(s390x%20big%20endian)/2/console] The Build is successful however I could see below exception for artifacts . Any conf needs to be done? Archiving artifacts hudson.FilePath$ValidateAntFileMask$1Cancel at hudson.FilePath$ValidateAntFileMask$1.isCaseSensitive(FilePath.java:3209) > Lucene Jenkins CI - s390x support > -- > > Key: LUCENE-10643 > URL: https://issues.apache.org/jira/browse/LUCENE-10643 > Project: Lucene - Core > Issue Type: Wish >Reporter: Nayana Thorat >Assignee: Uwe Schindler >Priority: Major > Labels: jenkins > > This issue adds Lucene builds on ASF Jenkins with S390x architecture (big > endian). > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10643) Lucene Jenkins CI - s390x support
[ https://issues.apache.org/jira/browse/LUCENE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563727#comment-17563727 ] Nayana Thorat edited comment on LUCENE-10643 at 7/7/22 11:39 AM: - [~uschindler] [https://ci-builds.apache.org/job/Lucene/job/Lucene-Check-main%20(s390x%20big%20endian)/2/console] The Build is successful however I could see below exception for artifacts . Any conf needs to be done? _Archiving artifacts_ _hudson.FilePath$ValidateAntFileMask$1Cancel_ _at hudson.FilePath$ValidateAntFileMask$1.isCaseSensitive(FilePath.java:3209)_ was (Author: nayana): [~uschindler] [https://ci-builds.apache.org/job/Lucene/job/Lucene-Check-main%20(s390x%20big%20endian)/2/console] The Build is successful however I could see below exception for artifacts . Any conf needs to be done? Archiving artifacts hudson.FilePath$ValidateAntFileMask$1Cancel at hudson.FilePath$ValidateAntFileMask$1.isCaseSensitive(FilePath.java:3209) > Lucene Jenkins CI - s390x support > -- > > Key: LUCENE-10643 > URL: https://issues.apache.org/jira/browse/LUCENE-10643 > Project: Lucene - Core > Issue Type: Wish >Reporter: Nayana Thorat >Assignee: Uwe Schindler >Priority: Major > Labels: jenkins > > This issue adds Lucene builds on ASF Jenkins with S390x architecture (big > endian). > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10643) Lucene Jenkins CI - s390x support
[ https://issues.apache.org/jira/browse/LUCENE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563729#comment-17563729 ] Uwe Schindler commented on LUCENE-10643: [~Nayana]: This is not a problem. It appears on all builds and has to do with some bug in Jenkins. Cannot be prevented, sorry. As long as builds succeed all is fine. [~dweiss] has some hints about the bug. It happens on all our builds. > Lucene Jenkins CI - s390x support > -- > > Key: LUCENE-10643 > URL: https://issues.apache.org/jira/browse/LUCENE-10643 > Project: Lucene - Core > Issue Type: Wish >Reporter: Nayana Thorat >Assignee: Uwe Schindler >Priority: Major > Labels: jenkins > > This issue adds Lucene builds on ASF Jenkins with S390x architecture (big > endian). > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10643) Lucene Jenkins CI - s390x support
[ https://issues.apache.org/jira/browse/LUCENE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563729#comment-17563729 ] Uwe Schindler edited comment on LUCENE-10643 at 7/7/22 11:41 AM: - [~Nayana]: This is not a problem. It appears on all builds and has to do with some bug in Jenkins. Cannot be prevented, sorry. As long as builds succeed all is fine. [~dweiss] has some hints about the bug. was (Author: thetaphi): [~Nayana]: This is not a problem. It appears on all builds and has to do with some bug in Jenkins. Cannot be prevented, sorry. As long as builds succeed all is fine. [~dweiss] has some hints about the bug. It happens on all our builds. > Lucene Jenkins CI - s390x support > -- > > Key: LUCENE-10643 > URL: https://issues.apache.org/jira/browse/LUCENE-10643 > Project: Lucene - Core > Issue Type: Wish >Reporter: Nayana Thorat >Assignee: Uwe Schindler >Priority: Major > Labels: jenkins > > This issue adds Lucene builds on ASF Jenkins with S390x architecture (big > endian). > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10643) Lucene Jenkins CI - s390x support
[ https://issues.apache.org/jira/browse/LUCENE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563737#comment-17563737 ] Nayana Thorat commented on LUCENE-10643: [~uschindler] Oh ok. Thank you for clarification > Lucene Jenkins CI - s390x support > -- > > Key: LUCENE-10643 > URL: https://issues.apache.org/jira/browse/LUCENE-10643 > Project: Lucene - Core > Issue Type: Wish >Reporter: Nayana Thorat >Assignee: Uwe Schindler >Priority: Major > Labels: jenkins > > This issue adds Lucene builds on ASF Jenkins with S390x architecture (big > endian). > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10643) Lucene Jenkins CI - s390x support
[ https://issues.apache.org/jira/browse/LUCENE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563742#comment-17563742 ] Uwe Schindler commented on LUCENE-10643: See this: https://www.mail-archive.com/dev@lucene.apache.org/msg314005.html > Lucene Jenkins CI - s390x support > -- > > Key: LUCENE-10643 > URL: https://issues.apache.org/jira/browse/LUCENE-10643 > Project: Lucene - Core > Issue Type: Wish >Reporter: Nayana Thorat >Assignee: Uwe Schindler >Priority: Major > Labels: jenkins > > This issue adds Lucene builds on ASF Jenkins with S390x architecture (big > endian). > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10643) Lucene Jenkins CI - s390x support
[ https://issues.apache.org/jira/browse/LUCENE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563743#comment-17563743 ] Nayana Thorat commented on LUCENE-10643: One more thing want to ask: How frequently these jobs will execute ? ( on any pull request check or merge etc.) > Lucene Jenkins CI - s390x support > -- > > Key: LUCENE-10643 > URL: https://issues.apache.org/jira/browse/LUCENE-10643 > Project: Lucene - Core > Issue Type: Wish >Reporter: Nayana Thorat >Assignee: Uwe Schindler >Priority: Major > Labels: jenkins > > This issue adds Lucene builds on ASF Jenkins with S390x architecture (big > endian). > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10643) Lucene Jenkins CI - s390x support
[ https://issues.apache.org/jira/browse/LUCENE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563763#comment-17563763 ] Uwe Schindler commented on LUCENE-10643: It is configured to be {{@daily}}. Normal lucene builds run {{@hourly}} on our special "lucene" tagged nodes to not occupy nodes used by other projects, by always running builds. The reason for this is how Lucene tests works: They check with random data, so whenever you see a failure, it is something new (often JVM bugs): - https://www.youtube.com/watch?v=-uVE_w8flIU - https://2019.berlinbuzzwords.de/sites/2019.berlinbuzzwords.de/files/media/documents/dawidweiss-randomizedtesting-pub.pdf - https://www.youtube.com/watch?v=PVRdLyQGUxE - https://2013.berlinbuzzwords.de/sites/2013.berlinbuzzwords.de/files/slides/Schindler-BugsBugsBugs.pdf > Lucene Jenkins CI - s390x support > -- > > Key: LUCENE-10643 > URL: https://issues.apache.org/jira/browse/LUCENE-10643 > Project: Lucene - Core > Issue Type: Wish >Reporter: Nayana Thorat >Assignee: Uwe Schindler >Priority: Major > Labels: jenkins > > This issue adds Lucene builds on ASF Jenkins with S390x architecture (big > endian). > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] dweiss commented on issue #8: Set assignee field for issues if the account mapping is given
dweiss commented on issue #8: URL: https://github.com/apache/lucene-jira-archive/issues/8#issuecomment-1177587814 Accepted the invitation just now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10643) Lucene Jenkins CI - s390x support
[ https://issues.apache.org/jira/browse/LUCENE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563795#comment-17563795 ] Dawid Weiss commented on LUCENE-10643: -- The timeout is caused by a hard limit in jenkins that should be configurable via system properties - [https://www.jenkins.io/doc/book/managing/system-properties/#hudson-filepath-validate_ant_file_mask_bound] we never got around to locating how this can be done though. > Lucene Jenkins CI - s390x support > -- > > Key: LUCENE-10643 > URL: https://issues.apache.org/jira/browse/LUCENE-10643 > Project: Lucene - Core > Issue Type: Wish >Reporter: Nayana Thorat >Assignee: Uwe Schindler >Priority: Major > Labels: jenkins > > This issue adds Lucene builds on ASF Jenkins with S390x architecture (big > endian). > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10627) Using CompositeByteBuf to Reduce Memory Copy
[ https://issues.apache.org/jira/browse/LUCENE-10627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563805#comment-17563805 ] Robert Muir commented on LUCENE-10627: -- Yes we have to stop another PagedBytes/ByteBlockPool from entering our codebase. To me it doesn't matter if the performance improvement is 1000% > Using CompositeByteBuf to Reduce Memory Copy > > > Key: LUCENE-10627 > URL: https://issues.apache.org/jira/browse/LUCENE-10627 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs, core/store >Reporter: LuYunCheng >Priority: Major > > Code: [https://github.com/apache/lucene/pull/987] > I see When Lucene Do flush and merge store fields, need many memory copies: > {code:java} > Lucene Merge Thread #25940]" #906546 daemon prio=5 os_prio=0 cpu=20503.95ms > elapsed=68.76s tid=0x7ee990002c50 nid=0x3aac54 runnable > [0x7f17718db000] > java.lang.Thread.State: RUNNABLE > at > org.apache.lucene.store.ByteBuffersDataOutput.toArrayCopy(ByteBuffersDataOutput.java:271) > at > org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.flush(CompressingStoredFieldsWriter.java:239) > at > org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.finishDocument(CompressingStoredFieldsWriter.java:169) > at > org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.merge(CompressingStoredFieldsWriter.java:654) > at > org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:228) > at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:105) > at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4760) > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4364) > at > org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5923) > at > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624) > at > org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:100) > at > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:682) > {code} > When Lucene *CompressingStoredFieldsWriter* do flush documents, it needs many > memory copies: > With Lucene90 using {*}LZ4WithPresetDictCompressionMode{*}: > # bufferedDocs.toArrayCopy copy blocks into one continue content for chunk > compress > # compressor copy dict and data into one block buffer > # do compress > # copy compressed data out > With Lucene90 using {*}DeflateWithPresetDictCompressionMode{*}: > # bufferedDocs.toArrayCopy copy blocks into one continue content for chunk > compress > # do compress > # copy compressed data out > > I think we can use CompositeByteBuf to reduce temp memory copies: > # we do not have to *bufferedDocs.toArrayCopy* when just need continues > content for chunk compress > > I write a simple mini benchamrk in test code ([link > |https://github.com/apache/lucene/blob/5a406a5c483c7fadaf0e8a5f06732c79ad174d11/lucene/core/src/test/org/apache/lucene/codecs/lucene90/compressing/TestCompressingStoredFieldsFormat.java#L353]): > *LZ4WithPresetDict run* Capacity:41943040(bytes) , iter 10times: Origin > elapse:5391ms , New elapse:5297ms > *DeflateWithPresetDict run* Capacity:41943040(bytes), iter 10times: Origin > elapse:{*}115ms{*}, New elapse:{*}12ms{*} > > And I run runStoredFieldsBenchmark with doc_limit=-1: > shows: > ||Msec to index||BEST_SPEED ||BEST_COMPRESSION|| > |Baseline|318877.00|606288.00| > |Candidate|314442.00|604719.00| -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on issue #8: Set assignee field for issues if the account mapping is given
mocobeta commented on issue #8: URL: https://github.com/apache/lucene-jira-archive/issues/8#issuecomment-1177747831 Thank you both, confirmed that the assignee can be ported. Issue search result  Issue detail   -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand commented on issue #8: Set assignee field for issues if the account mapping is given
mikemccand commented on issue #8: URL: https://github.com/apache/lucene-jira-archive/issues/8#issuecomment-1177759497 Woot! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta merged pull request #18: Check if the assignee account can be assigned on the repo
mocobeta merged PR #18: URL: https://github.com/apache/lucene-jira-archive/pull/18 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10619) Optimize the writeBytes in TermsHashPerField
[ https://issues.apache.org/jira/browse/LUCENE-10619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563829#comment-17563829 ] tangdh commented on LUCENE-10619: - [~jpountz] ,can this pr be merged? > Optimize the writeBytes in TermsHashPerField > > > Key: LUCENE-10619 > URL: https://issues.apache.org/jira/browse/LUCENE-10619 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Affects Versions: 9.2 >Reporter: tangdh >Priority: Major > Time Spent: 1h > Remaining Estimate: 0h > > Because we don't know the length of slice, writeBytes will always write byte > one after another instead of writing a block of bytes. > May be we could return both offset and length in ByteBlockPool#allocSlice? > 1. BYTE_BLOCK_SIZE is 32768, offset is at most 15 bits. > 2. slice size is at most 200, so it could fit in 8 bits. > So we could put them together into an int offset | length > There are only two places where this function is used,the cost of change it > is relatively small. > When allocSlice could return the offset and length of new Slice, we could > change writeBytes like below > {code:java} > // write block of bytes each time > while(remaining > 0 ) { >int offsetAndLength = allocSlice(bytes, offset); >length = min(remaining, (offsetAndLength & 0xff) - 1); >offset = offsetAndLength >> 8; >System.arraycopy(src, srcPos, bytePool.buffer, offset, length); >remaining -= length; >offset+= (length + 1); > } > {code} > If it could work, I'd like to raise a pr. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10646) Add some comment on LevenshteinAutomata
tangdh created LUCENE-10646: --- Summary: Add some comment on LevenshteinAutomata Key: LUCENE-10646 URL: https://issues.apache.org/jira/browse/LUCENE-10646 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Affects Versions: 9.2 Reporter: tangdh After having a hard time reading the code, I may have understood the relevant code of levenshteinautomata, except for the part of minErrors. I think this part of the code is too difficult to understand, full of magic numbers. I will sort it out and then raise a PR to add some necessary comments to this part of the code. So, others can better understand this part of the code. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10643) Lucene Jenkins CI - s390x support
[ https://issues.apache.org/jira/browse/LUCENE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563836#comment-17563836 ] Uwe Schindler commented on LUCENE-10643: Hi [~Nayana], the Java 19 (JDK project Panama) run to support Lucene's MMapDirectory v2 (see PR https://github.com/apache/lucene/pull/912) was working fine with this Big Endian platform. I will also report this also to OpenJDK community, as this is an important thing for them to know! It looks like all memory swap instuctions in Java's MemorySegment API are inserted at correct places when reading writing the little endian file format of Lucene. The MMap v2 job is here: https://ci-builds.apache.org/job/Lucene/job/Lucene-MMAPv2-Linux%20(s390x%20big%20endian)/ > Lucene Jenkins CI - s390x support > -- > > Key: LUCENE-10643 > URL: https://issues.apache.org/jira/browse/LUCENE-10643 > Project: Lucene - Core > Issue Type: Wish >Reporter: Nayana Thorat >Assignee: Uwe Schindler >Priority: Major > Labels: jenkins > > This issue adds Lucene builds on ASF Jenkins with S390x architecture (big > endian). > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10643) Lucene Jenkins CI - s390x support
[ https://issues.apache.org/jira/browse/LUCENE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563840#comment-17563840 ] Uwe Schindler commented on LUCENE-10643: bq. The timeout is caused by a hard limit in jenkins that should be configurable via system properties I raised this setting on Policeman Jenkins to 30.000. > Lucene Jenkins CI - s390x support > -- > > Key: LUCENE-10643 > URL: https://issues.apache.org/jira/browse/LUCENE-10643 > Project: Lucene - Core > Issue Type: Wish >Reporter: Nayana Thorat >Assignee: Uwe Schindler >Priority: Major > Labels: jenkins > > This issue adds Lucene builds on ASF Jenkins with S390x architecture (big > endian). > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #1004: LUCENE-10603: Stop using SortedSetDocValues.NO_MORE_ORDS in tests
gsmiller commented on PR #1004: URL: https://github.com/apache/lucene/pull/1004#issuecomment-1177927537 Looks good. Thanks @stefanvodita! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues
[ https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563876#comment-17563876 ] ASF subversion and git services commented on LUCENE-10603: -- Commit dd4e8b82d711b8f665e91f0d74f159ef1e63939f in lucene's branch refs/heads/main from Stefan Vodita [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=dd4e8b82d71 ] LUCENE-10603: Stop using SortedSetDocValues.NO_MORE_ORDS in tests (#1004) > Improve iteration of ords for SortedSetDocValues > > > Key: LUCENE-10603 > URL: https://issues.apache.org/jira/browse/LUCENE-10603 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Lu Xugang >Assignee: Lu Xugang >Priority: Trivial > Time Spent: 5h 20m > Remaining Estimate: 0h > > After SortedSetDocValues#docValueCount added since Lucene 9.2, should we > refactor the implementation of ords iterations using docValueCount instead of > NO_MORE_ORDS? > Similar how SortedNumericDocValues did > From > {code:java} > for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord > = values.nextOrd()) { > }{code} > to > {code:java} > for (int i = 0; i < values.docValueCount(); i++) { > long ord = values.nextOrd(); > }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller merged pull request #1004: LUCENE-10603: Stop using SortedSetDocValues.NO_MORE_ORDS in tests
gsmiller merged PR #1004: URL: https://github.com/apache/lucene/pull/1004 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues
[ https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563901#comment-17563901 ] ASF subversion and git services commented on LUCENE-10603: -- Commit c46e1f03901ebaac9e010862acbb0cf460d807ef in lucene's branch refs/heads/branch_9x from Stefan Vodita [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=c46e1f03901 ] LUCENE-10603: Stop using SortedSetDocValues.NO_MORE_ORDS in tests (#1004) > Improve iteration of ords for SortedSetDocValues > > > Key: LUCENE-10603 > URL: https://issues.apache.org/jira/browse/LUCENE-10603 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Lu Xugang >Assignee: Lu Xugang >Priority: Trivial > Time Spent: 5h 20m > Remaining Estimate: 0h > > After SortedSetDocValues#docValueCount added since Lucene 9.2, should we > refactor the implementation of ords iterations using docValueCount instead of > NO_MORE_ORDS? > Similar how SortedNumericDocValues did > From > {code:java} > for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord > = values.nextOrd()) { > }{code} > to > {code:java} > for (int i = 0; i < values.docValueCount(); i++) { > long ord = values.nextOrd(); > }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues
[ https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563912#comment-17563912 ] Greg Miller commented on LUCENE-10603: -- It looks like the only remaining work is to: # Remove the NO_MORE_ORDS definition # Update all the SortedSetDocValue implementations to stop returning NO_MORE_ORDS in nextOrd() # Remove all the test assertions that validate that SSDV#nextOrd() returns NO_MORE_ORDS This should all be main branch work, and not something we backport to 9.x. I think 9.x is now good. > Improve iteration of ords for SortedSetDocValues > > > Key: LUCENE-10603 > URL: https://issues.apache.org/jira/browse/LUCENE-10603 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Lu Xugang >Assignee: Lu Xugang >Priority: Trivial > Time Spent: 5h 20m > Remaining Estimate: 0h > > After SortedSetDocValues#docValueCount added since Lucene 9.2, should we > refactor the implementation of ords iterations using docValueCount instead of > NO_MORE_ORDS? > Similar how SortedNumericDocValues did > From > {code:java} > for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord > = values.nextOrd()) { > }{code} > to > {code:java} > for (int i = 0; i < values.docValueCount(); i++) { > long ord = values.nextOrd(); > }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing
jtibshirani commented on code in PR #992: URL: https://github.com/apache/lucene/pull/992#discussion_r916054211 ## lucene/core/src/java/org/apache/lucene/codecs/perfield/PerFieldKnnVectorsFormat.java: ## @@ -102,9 +104,22 @@ private class FieldsWriter extends KnnVectorsWriter { } @Override -public void writeField(FieldInfo fieldInfo, KnnVectorsReader knnVectorsReader) +public KnnFieldVectorsWriter addField(FieldInfo fieldInfo) throws IOException { + KnnVectorsWriter writer = getInstance(fieldInfo); + return writer.addField(fieldInfo); +} + +@Override +public void flush(int maxDoc, Sorter.DocMap sortMap) throws IOException { + for (WriterAndSuffix was : formats.values()) { +was.writer.flush(maxDoc, sortMap); + } +} + +@Override +public void mergeOneField(FieldInfo fieldInfo, KnnVectorsReader knnVectorsReader) throws IOException { - getInstance(fieldInfo).writeField(fieldInfo, knnVectorsReader); + getInstance(fieldInfo).mergeOneField(fieldInfo, knnVectorsReader); Review Comment: Small comment, maybe we can throw an `UnsupportedOperationException` here because we expect it never to be called? ## lucene/core/src/java/org/apache/lucene/codecs/lucene93/Lucene93HnswVectorsWriter.java: ## @@ -266,65 +470,128 @@ private void writeMeta( } } - private OnHeapHnswGraph writeGraph( - RandomAccessVectorValuesProducer vectorValues, VectorSimilarityFunction similarityFunction) + /** + * Writes the vector values to the output and returns a set of documents that contains vectors. + */ + private static DocsWithFieldSet writeVectorData(IndexOutput output, VectorValues vectors) throws IOException { +DocsWithFieldSet docsWithField = new DocsWithFieldSet(); +for (int docV = vectors.nextDoc(); docV != NO_MORE_DOCS; docV = vectors.nextDoc()) { + // write vector + BytesRef binaryValue = vectors.binaryValue(); + assert binaryValue.length == vectors.dimension() * Float.BYTES; + output.writeBytes(binaryValue.bytes, binaryValue.offset, binaryValue.length); + docsWithField.add(docV); +} +return docsWithField; + } -// build graph -HnswGraphBuilder hnswGraphBuilder = -new HnswGraphBuilder( -vectorValues, similarityFunction, M, beamWidth, HnswGraphBuilder.randSeed); -hnswGraphBuilder.setInfoStream(segmentWriteState.infoStream); -OnHeapHnswGraph graph = hnswGraphBuilder.build(vectorValues.randomAccess()); + @Override + public void close() throws IOException { +IOUtils.close(meta, vectorData, vectorIndex); + } -// write vectors' neighbours on each level into the vectorIndex file -int countOnLevel0 = graph.size(); -for (int level = 0; level < graph.numLevels(); level++) { - int maxConnOnLevel = level == 0 ? (M * 2) : M; - NodesIterator nodesOnLevel = graph.getNodesOnLevel(level); - while (nodesOnLevel.hasNext()) { -int node = nodesOnLevel.nextInt(); -NeighborArray neighbors = graph.getNeighbors(level, node); -int size = neighbors.size(); -vectorIndex.writeInt(size); -// Destructively modify; it's ok we are discarding it after this -int[] nnodes = neighbors.node(); -Arrays.sort(nnodes, 0, size); -for (int i = 0; i < size; i++) { - int nnode = nnodes[i]; - assert nnode < countOnLevel0 : "node too large: " + nnode + ">=" + countOnLevel0; - vectorIndex.writeInt(nnode); -} -// if number of connections < maxConn, add bogus values up to maxConn to have predictable -// offsets -for (int i = size; i < maxConnOnLevel; i++) { - vectorIndex.writeInt(0); -} + private static class FieldData extends KnnFieldVectorsWriter { Review Comment: Small comment, we could rename this to `FieldWriter` now since that's its purpose. ## lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java: ## @@ -24,28 +24,40 @@ import org.apache.lucene.index.DocIDMerger; import org.apache.lucene.index.FieldInfo; import org.apache.lucene.index.MergeState; +import org.apache.lucene.index.Sorter; import org.apache.lucene.index.VectorValues; import org.apache.lucene.search.TopDocs; +import org.apache.lucene.util.Accountable; import org.apache.lucene.util.Bits; import org.apache.lucene.util.BytesRef; /** Writes vectors to an index. */ -public abstract class KnnVectorsWriter implements Closeable { +public abstract class KnnVectorsWriter implements Accountable, Closeable { /** Sole constructor */ protected KnnVectorsWriter() {} - /** Write all values contained in the provided reader */ - public abstract void writeField(FieldInfo fieldInfo, KnnVectorsReader knnVectorsReader) + /** Add new field for indexing */ + public abstract void addField(FieldInfo fieldInfo) throws IOException; + + /** Add new docID with
[GitHub] [lucene] mayya-sharipova commented on pull request #992: LUCENE-10592 Build HNSW Graph on indexing
mayya-sharipova commented on PR #992: URL: https://github.com/apache/lucene/pull/992#issuecomment-1178060346 @jtibshirani Thanks for another set of comments, I will work on addressing them. Meanwhile, I have run another set of benchmarks on a different dataset sift-128-euclidean M:16 efConstruction:100. And similar results were observed here: - the whole indexing + flush approximately the same (533s sec in baseline VS 538s in candidate) - baseline: indexing is fast, but flush takes 532 sec - candidate: indexing takes most time, and flush is very fast - 1.8 sec ### Baseline (main branch): ```bash IW 0 [2022-07-07T18:27:08.982483Z; main]: MMapDirectory.UNMAP_SUPPORTED=true Done indexing 100 documents; now flush IW 0 [2022-07-07T18:27:09.935570Z; main]: now flush at close IW 0 [2022-07-07T18:27:09.936155Z; main]: start flush: applyAllDeletes=true IW 0 [2022-07-07T18:27:09.936850Z; main]: index before flush DW 0 [2022-07-07T18:27:09.936917Z; main]: startFullFlush DW 0 [2022-07-07T18:27:09.941606Z; main]: anyChanges? numDocsInRam=100 deletes=false hasTickets:false pendingChangesInFullFlush: false DWPT 0 [2022-07-07T18:27:09.951278Z; main]: flush postings as segment _1 numDocs=100 IW 0 [2022-07-07T18:27:09.952530Z; main]: 0 msec to write norms IW 0 [2022-07-07T18:27:09.952902Z; main]: 0 msec to write docValues IW 0 [2022-07-07T18:27:09.953073Z; main]: 0 msec to write points HNSW 0 [2022-07-07T18:27:11.094024Z; main]: build graph from 100 vectors HNSW 0 [2022-07-07T18:35:55.150931Z; main]: built 99 in 6450/524148 ms IW 0 [2022-07-07T18:36:01.320864Z; main]: 531459 msec to write vectors IW 0 [2022-07-07T18:36:01.336914Z; main]: 15 msec to finish stored fields IW 0 [2022-07-07T18:36:01.337204Z; main]: 0 msec to write postings and finish vectors IW 0 [2022-07-07T18:36:01.337924Z; main]: 0 msec to write fieldInfos DWPT 0 [2022-07-07T18:36:02.197589Z; main]: flush time 532338.523458 msec Indexed 100 documents in 533s ``` ### Candidate (this PR with the changes so far): ```bash IW 0 [2022-07-07T17:44:01.642762Z; main]: MMapDirectory.UNMAP_SUPPORTED=true Done indexing 100 documents; now flush IW 0 [2022-07-07T17:52:58.049830Z; main]: now flush at close IW 0 [2022-07-07T17:52:58.050277Z; main]: start flush: applyAllDeletes=true IW 0 [2022-07-07T17:52:58.050726Z; main]: index before flush DW 0 [2022-07-07T17:52:58.050776Z; main]: startFullFlush DW 0 [2022-07-07T17:52:58.056958Z; main]: anyChanges? numDocsInRam=100 deletes=false hasTickets:false pendingChangesInFullFlush: false DWPT 0 [2022-07-07T17:52:58.066937Z; main]: flush postings as segment _0 numDocs=100 IW 0 [2022-07-07T17:52:58.068554Z; main]: 0 msec to write norms IW 0 [2022-07-07T17:52:58.068864Z; main]: 0 msec to write docValues IW 0 [2022-07-07T17:52:58.068958Z; main]: 0 msec to write points IW 0 [2022-07-07T17:52:59.017719Z; main]: 947 msec to write vectors IW 0 [2022-07-07T17:52:59.038544Z; main]: 19 msec to finish stored fields IW 0 [2022-07-07T17:52:59.039281Z; main]: 0 msec to write postings and finish vectors IW 0 [2022-07-07T17:52:59.043069Z; main]: 3 msec to write fieldInfos DWPT 0 [2022-07-07T17:52:59.915562Z; main]: flush time 1848.19675 msec Indexed 100 documents in 538s ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10194) Should IndexWriter buffer KNN vectors on disk?
[ https://issues.apache.org/jira/browse/LUCENE-10194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563919#comment-17563919 ] Julie Tibshirani edited comment on LUCENE-10194 at 7/7/22 6:48 PM: --- [~mayya] [~jpountz] can we close this since we've decided to go ahead with LUCENE-10592 ? was (Author: julietibs): [~mayya] [~jpountz] can we close this since we've decided to go ahead with https://issues.apache.org/jira/browse/LUCENE-10592 ? > Should IndexWriter buffer KNN vectors on disk? > -- > > Key: LUCENE-10194 > URL: https://issues.apache.org/jira/browse/LUCENE-10194 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Mayya Sharipova >Priority: Minor > Time Spent: 4h 10m > Remaining Estimate: 0h > > VectorValuesWriter buffers data in memory, like we do for all data structures > that are computed on flush. But I wonder if this is the right trade-off. > The use-case I have in mind is someone trying to load a dataset of vectors in > Lucene. Given that HNSW graphs are super expensive to create, we'd ideally > load that dataset into a single segment rather than many small segments that > then need to be merged together, which in-turn re-creates the HNSW graph. > Yet buffering vectors in memory is expensive. For instance assuming 256 > dimensions, each vector consumes 1kB of memory. Should we consider buffering > vectors on disk to reduce chances of having to create new segments only > because the RAM buffer is full? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10194) Should IndexWriter buffer KNN vectors on disk?
[ https://issues.apache.org/jira/browse/LUCENE-10194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563919#comment-17563919 ] Julie Tibshirani commented on LUCENE-10194: --- [~mayya] [~jpountz] can we close this since we've decided to go ahead with https://issues.apache.org/jira/browse/LUCENE-10592 ? > Should IndexWriter buffer KNN vectors on disk? > -- > > Key: LUCENE-10194 > URL: https://issues.apache.org/jira/browse/LUCENE-10194 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Mayya Sharipova >Priority: Minor > Time Spent: 4h 10m > Remaining Estimate: 0h > > VectorValuesWriter buffers data in memory, like we do for all data structures > that are computed on flush. But I wonder if this is the right trade-off. > The use-case I have in mind is someone trying to load a dataset of vectors in > Lucene. Given that HNSW graphs are super expensive to create, we'd ideally > load that dataset into a single segment rather than many small segments that > then need to be merged together, which in-turn re-creates the HNSW graph. > Yet buffering vectors in memory is expensive. For instance assuming 256 > dimensions, each vector consumes 1kB of memory. Should we consider buffering > vectors on disk to reduce chances of having to create new segments only > because the RAM buffer is full? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova closed pull request #728: LUCENE-10194 Buffer KNN vectors on disk
mayya-sharipova closed pull request #728: LUCENE-10194 Buffer KNN vectors on disk URL: https://github.com/apache/lucene/pull/728 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on pull request #728: LUCENE-10194 Buffer KNN vectors on disk
mayya-sharipova commented on PR #728: URL: https://github.com/apache/lucene/pull/728#issuecomment-1178089438 Closing this PR in favour of [alternative](https://github.com/apache/lucene/pull/992) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10194) Should IndexWriter buffer KNN vectors on disk?
[ https://issues.apache.org/jira/browse/LUCENE-10194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayya Sharipova resolved LUCENE-10194. -- Resolution: Won't Fix > Should IndexWriter buffer KNN vectors on disk? > -- > > Key: LUCENE-10194 > URL: https://issues.apache.org/jira/browse/LUCENE-10194 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Mayya Sharipova >Priority: Minor > Time Spent: 4.5h > Remaining Estimate: 0h > > VectorValuesWriter buffers data in memory, like we do for all data structures > that are computed on flush. But I wonder if this is the right trade-off. > The use-case I have in mind is someone trying to load a dataset of vectors in > Lucene. Given that HNSW graphs are super expensive to create, we'd ideally > load that dataset into a single segment rather than many small segments that > then need to be merged together, which in-turn re-creates the HNSW graph. > Yet buffering vectors in memory is expensive. For instance assuming 256 > dimensions, each vector consumes 1kB of memory. Should we consider buffering > vectors on disk to reduce chances of having to create new segments only > because the RAM buffer is full? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10194) Should IndexWriter buffer KNN vectors on disk?
[ https://issues.apache.org/jira/browse/LUCENE-10194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563924#comment-17563924 ] Mayya Sharipova commented on LUCENE-10194: -- + 1 for closing. I've closed the corresponding PR as well. > Should IndexWriter buffer KNN vectors on disk? > -- > > Key: LUCENE-10194 > URL: https://issues.apache.org/jira/browse/LUCENE-10194 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Mayya Sharipova >Priority: Minor > Time Spent: 4.5h > Remaining Estimate: 0h > > VectorValuesWriter buffers data in memory, like we do for all data structures > that are computed on flush. But I wonder if this is the right trade-off. > The use-case I have in mind is someone trying to load a dataset of vectors in > Lucene. Given that HNSW graphs are super expensive to create, we'd ideally > load that dataset into a single segment rather than many small segments that > then need to be merged together, which in-turn re-creates the HNSW graph. > Yet buffering vectors in memory is expensive. For instance assuming 256 > dimensions, each vector consumes 1kB of memory. Should we consider buffering > vectors on disk to reduce chances of having to create new segments only > because the RAM buffer is full? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Closed] (LUCENE-10194) Should IndexWriter buffer KNN vectors on disk?
[ https://issues.apache.org/jira/browse/LUCENE-10194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayya Sharipova closed LUCENE-10194. > Should IndexWriter buffer KNN vectors on disk? > -- > > Key: LUCENE-10194 > URL: https://issues.apache.org/jira/browse/LUCENE-10194 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Mayya Sharipova >Priority: Minor > Time Spent: 4.5h > Remaining Estimate: 0h > > VectorValuesWriter buffers data in memory, like we do for all data structures > that are computed on flush. But I wonder if this is the right trade-off. > The use-case I have in mind is someone trying to load a dataset of vectors in > Lucene. Given that HNSW graphs are super expensive to create, we'd ideally > load that dataset into a single segment rather than many small segments that > then need to be merged together, which in-turn re-creates the HNSW graph. > Yet buffering vectors in memory is expensive. For instance assuming 256 > dimensions, each vector consumes 1kB of memory. Should we consider buffering > vectors on disk to reduce chances of having to create new segments only > because the RAM buffer is full? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a diff in pull request #974: LUCENE-10614: Properly support getTopChildren in RangeFacetCounts
gsmiller commented on code in PR #974: URL: https://github.com/apache/lucene/pull/974#discussion_r916216255 ## lucene/demo/src/java/org/apache/lucene/demo/facet/RangeFacetsExample.java: ## @@ -73,6 +76,35 @@ public void index() throws IOException { indexWriter.addDocument(doc); } +// Add documents with a fake timestamp, 3600 sec (1 hour) after "now", 7200 sec (2 +// hours) after "now", ...: +long startTime = 0; +// Index error messages since a week (24 * 7 = 168 hours) ago +for (int i = 0; i < 168; i++) { + long endTime = startTime + (i + 1) * 3600; + + // Choose a relatively larger number, e,g., "35", in order to create variation in count for + // the top-n children, so that getTopChildren(10) in the searchTopChildren functionality + // can return children with different counts + for (int j = 0; j < i % 35; j++) { +Document doc = new Document(); +// index document at a different timestamp by using endTime - i * j Review Comment: Sorry, I'm sure what you're doing is really obvious to you, but it's just confusing to me. I find myself really stuck on things like `endTime - i * j`, or `i % 35` as a way to generate different numbers of log events within an hour block. What's wrong with just using `Random`? Would that just make it impossible to test? Sorry to be a pain with this, but if I were a user just trying to understand range faceting and I looked at this code, I'd be spending all my time just trying to figure out what we're trying to simulate here instead of understanding faceting. There has to be a simpler way right? As a suggestion, maybe we create a separate Jira issue to add a top-n range faceting example and revert out this work for now? That would let us get the actual change merged in the meantime. ## lucene/demo/src/java/org/apache/lucene/demo/facet/RangeFacetsExample.java: ## @@ -73,6 +76,35 @@ public void index() throws IOException { indexWriter.addDocument(doc); } +// Add documents with a fake timestamp, 3600 sec (1 hour) after "now", 7200 sec (2 +// hours) after "now", ...: +long startTime = 0; +// Index error messages since a week (24 * 7 = 168 hours) ago +for (int i = 0; i < 168; i++) { + long endTime = startTime + (i + 1) * 3600; + + // Choose a relatively larger number, e,g., "35", in order to create variation in count for + // the top-n children, so that getTopChildren(10) in the searchTopChildren functionality + // can return children with different counts + for (int j = 0; j < i % 35; j++) { +Document doc = new Document(); +// index document at a different timestamp by using endTime - i * j +doc.add(new NumericDocValuesField("error log", endTime - i * j)); Review Comment: Maybe "error timestamp" would be a better name? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dnhatn opened a new pull request, #1009: LUCENE-10563: Fix CHANGES list
dnhatn opened a new pull request, #1009: URL: https://github.com/apache/lucene/pull/1009 The CHANGES of 10.0 were accidentally merged into 9x CHANGES in https://github.com/apache/lucene/commit/b7231bb54884f9ce0232430c4a60cdb5753c6b82. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dnhatn commented on pull request #1009: LUCENE-10563: Fix CHANGES list
dnhatn commented on PR #1009: URL: https://github.com/apache/lucene/pull/1009#issuecomment-1178274830 @gsmiller Thanks for review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dnhatn merged pull request #1009: LUCENE-10563: Fix CHANGES list
dnhatn merged PR #1009: URL: https://github.com/apache/lucene/pull/1009 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10563) Unable to Tessellate polygon
[ https://issues.apache.org/jira/browse/LUCENE-10563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563996#comment-17563996 ] ASF subversion and git services commented on LUCENE-10563: -- Commit 8926732a32823be168267fe2ed39eb804d1030f1 in lucene's branch refs/heads/branch_9x from Nhat Nguyen [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=8926732a328 ] LUCENE-10563: Fix CHANGES list (#1009) The CHANGES of 10.0 were accidentally merged into 9x CHANGES in b7231bb. > Unable to Tessellate polygon > > > Key: LUCENE-10563 > URL: https://issues.apache.org/jira/browse/LUCENE-10563 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 9.1 >Reporter: Yixun Xu >Assignee: Ignacio Vera >Priority: Major > Fix For: 9.3 > > Attachments: polygon-1.json, polygon-2.json, polygon-3.json > > Time Spent: 1h 20m > Remaining Estimate: 0h > > Following up to LUCENE-10470, I found some more polygons that cause > {{Tessellator.tessellate}} to throw "Unable to Tessellate shape", which are > not covered by the fix to LUCENE-10470. I attached the geojson of 3 failing > shapes that I got, and this is the > [branch|https://github.com/apache/lucene/compare/main...yixunx:yx/reproduce-tessellator-error?expand=1#diff-5e8e8052af8b8618e7e4325b7d69def4d562a356acbfea3e983198327c7c8d18R17-R19] > I am testing on that demonstrates the tessellation failures. > > [^polygon-1.json] > [^polygon-2.json] > [^polygon-3.json] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] Yuti-G commented on a diff in pull request #974: LUCENE-10614: Properly support getTopChildren in RangeFacetCounts
Yuti-G commented on code in PR #974: URL: https://github.com/apache/lucene/pull/974#discussion_r916342857 ## lucene/demo/src/java/org/apache/lucene/demo/facet/RangeFacetsExample.java: ## @@ -73,6 +76,35 @@ public void index() throws IOException { indexWriter.addDocument(doc); } +// Add documents with a fake timestamp, 3600 sec (1 hour) after "now", 7200 sec (2 +// hours) after "now", ...: +long startTime = 0; +// Index error messages since a week (24 * 7 = 168 hours) ago +for (int i = 0; i < 168; i++) { + long endTime = startTime + (i + 1) * 3600; + + // Choose a relatively larger number, e,g., "35", in order to create variation in count for + // the top-n children, so that getTopChildren(10) in the searchTopChildren functionality + // can return children with different counts + for (int j = 0; j < i % 35; j++) { +Document doc = new Document(); +// index document at a different timestamp by using endTime - i * j Review Comment: Using `Random` does add some complexity for testing, and I was trying to keep it as simple as the current example , but sorry that causes confusion. I will create a separate issue to add a top-n range faceting example after this pr is merged, and will try to use random and add clear comments to the example code. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] dweiss commented on issue #8: Set assignee field for issues if the account mapping is given
dweiss commented on issue #8: URL: https://github.com/apache/lucene-jira-archive/issues/8#issuecomment-1178379951 Excellent! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller opened a new pull request, #1010: Specialize ordinal encoding for SortedSetDocValues
gsmiller opened a new pull request, #1010: URL: https://github.com/apache/lucene/pull/1010 ### Description (or a Jira issue link if you have one) This follows up the work done in LUCENE-10067 by adding additional specialization for SORTED_SET doc values. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #1010: Specialize ordinal encoding for SortedSetDocValues
gsmiller commented on PR #1010: URL: https://github.com/apache/lucene/pull/1010#issuecomment-1178381324 Benchmarks look good on SSDV faceting (and no regressions elsewhere). I think some new bench tasks have recently been added as well that might be relevant here, so I'll update my luceneutil and run again soon. For now, here are results on `wikimediumall`: ``` TaskQPS baseline StdDevQPS candidate StdDevPct diff p-value Prefix3 58.57 (8.6%) 57.65 (9.9%) -1.6% ( -18% - 18%) 0.591 HighTermMonthSort 47.72 (28.0%) 47.26 (15.8%) -1.0% ( -34% - 59%) 0.893 BrowseRandomLabelSSDVFacets2.60 (6.8%)2.58 (5.6%) -0.7% ( -12% - 12%) 0.729 HighSpanNear 17.35 (2.6%) 17.23 (4.2%) -0.7% ( -7% -6%) 0.544 OrHighNotHigh 762.36 (4.2%) 757.77 (4.8%) -0.6% ( -9% -8%) 0.673 Wildcard 27.05 (5.4%) 26.94 (5.9%) -0.4% ( -11% - 11%) 0.820 LowPhrase 35.88 (2.7%) 35.80 (2.7%) -0.2% ( -5% -5%) 0.788 OrNotHighHigh 645.25 (3.1%) 644.30 (3.3%) -0.1% ( -6% -6%) 0.884 LowTerm 1793.47 (3.6%) 1792.45 (3.8%) -0.1% ( -7% -7%) 0.961 OrNotHighMed 653.99 (3.1%) 653.73 (3.2%) -0.0% ( -6% -6%) 0.968 AndHighMed 68.77 (5.3%) 68.75 (6.5%) -0.0% ( -11% - 12%) 0.986 LowIntervalsOrdered 51.08 (4.6%) 51.08 (4.5%)0.0% ( -8% -9%) 1.000 MedPhrase 70.46 (3.1%) 70.46 (3.3%)0.0% ( -6% -6%) 0.995 OrHighNotLow 1055.73 (3.3%) 1055.91 (4.4%)0.0% ( -7% -7%) 0.989 HighIntervalsOrdered8.03 (4.5%)8.03 (4.6%)0.0% ( -8% -9%) 0.984 MedSpanNear 11.88 (2.4%) 11.89 (3.1%)0.1% ( -5% -5%) 0.926 MedTermDayTaxoFacets 18.17 (3.6%) 18.20 (3.8%)0.2% ( -7% -7%) 0.891 OrHighNotMed 780.58 (3.3%) 781.92 (3.8%)0.2% ( -6% -7%) 0.877 OrHighMedDayTaxoFacets4.78 (4.4%)4.79 (5.0%)0.2% ( -8% -9%) 0.906 AndHighHighDayTaxoFacets6.91 (2.3%)6.92 (2.9%)0.2% ( -4% -5%) 0.828 MedIntervalsOrdered4.36 (3.5%)4.37 (3.7%)0.2% ( -6% -7%) 0.851 OrHighHigh 14.24 (2.8%) 14.27 (6.4%)0.3% ( -8% -9%) 0.872 IntNRQ 33.94 (1.1%) 34.05 (1.3%)0.3% ( -2% -2%) 0.381 Fuzzy2 71.29 (1.7%) 71.55 (1.8%)0.4% ( -3% -3%) 0.509 LowSpanNear8.79 (2.7%)8.83 (3.2%)0.4% ( -5% -6%) 0.673 Fuzzy1 76.55 (1.7%) 76.90 (1.6%)0.5% ( -2% -3%) 0.377 AndHighLow 1077.25 (4.1%) 1082.31 (3.8%)0.5% ( -7% -8%) 0.706 BrowseDayOfYearSSDVFacets3.45 (5.8%)3.47 (4.9%)0.6% ( -9% - 11%) 0.722 LowSloppyPhrase 16.00 (1.9%) 16.10 (2.9%)0.6% ( -4% -5%) 0.437 OrHighMed 47.78 (2.0%) 48.08 (3.9%)0.6% ( -5% -6%) 0.527 HighTerm 1147.74 (4.7%) 1155.05 (4.2%)0.6% ( -7% -9%) 0.650 PKLookup 147.38 (3.6%) 148.34 (3.1%)0.7% ( -5% -7%) 0.537 AndHighHigh 19.16 (5.1%) 19.29 (6.8%)0.7% ( -10% - 13%) 0.730 Respell 51.81 (1.5%) 52.15 (1.4%)0.7% ( -2% -3%) 0.147 MedTerm 1406.90 (4.4%) 1417.58 (4.2%)0.8% ( -7% -9%) 0.578 MedSloppyPhrase 28.98 (2.0%) 29.20 (2.8%)0.8% ( -3% -5%) 0.306 AndHighMedDayTaxoFacets 23.12 (2.0%) 23.31 (2.3%)0.8% ( -3% -5%) 0.232 TermDTSort 78.14 (20.6%) 78.83 (20.1%)0.9% ( -32% - 52%) 0.891 HighPhrase 180.25 (2.8%) 182.18 (2.6%)1.1% ( -4% -6%) 0.215
[GitHub] [lucene-jira-archive] mocobeta closed issue #8: Set assignee field for issues if the account mapping is given
mocobeta closed issue #8: Set assignee field for issues if the account mapping is given URL: https://github.com/apache/lucene-jira-archive/issues/8 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10600) SortedSetDocValues#docValueCount should be an int, not long
[ https://issues.apache.org/jira/browse/LUCENE-10600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lu Xugang resolved LUCENE-10600. Resolution: Fixed > SortedSetDocValues#docValueCount should be an int, not long > --- > > Key: LUCENE-10600 > URL: https://issues.apache.org/jira/browse/LUCENE-10600 > Project: Lucene - Core > Issue Type: Bug >Reporter: Adrien Grand >Assignee: Lu Xugang >Priority: Minor > Time Spent: 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10627) Using CompositeByteBuf to Reduce Memory Copy
[ https://issues.apache.org/jira/browse/LUCENE-10627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563044#comment-17563044 ] LuYunCheng edited comment on LUCENE-10627 at 7/8/22 5:33 AM: - [~jpountz] ,[~uschindler] Hi, I try to reuse ByteBuffersDataInput to reduce memory copy because it can get from ByteBuffersDataOutput.toDataInput. and it could reduce this complexity ([latest commit|https://github.com/luyuncheng/lucene/pull/1], [PR|https://github.com/apache/lucene/pull/987]) BUT i am not sure whether can change Compressor interface compress input param from byte[] to ByteBuffersDataInput. If change this interface [like|https://github.com/apache/lucene/blob/382962f22df3ee3af3fb538b877c98d61a622ddb/lucene/core/src/java/org/apache/lucene/codecs/compressing/Compressor.java#L35], it increased the backport code [like|https://github.com/apache/lucene/blob/382962f22df3ee3af3fb538b877c98d61a622ddb/lucene/core/src/java/org/apache/lucene/codecs/compressing/CompressionMode.java#L274], however if we change the interface with ByteBuffersDataInput, we can optimize memory copy into different compress algorithm code. Also, i found we can do more memory copy reduce in *{{{}CompressingStoredFieldsWriter.{}}}{{{}copyOneDoc [like|https://github.com/apache/lucene/blob/382962f22df3ee3af3fb538b877c98d61a622ddb/lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsWriter.java#L516] and CompressingTermVectorsWriter.flush{}}}* I think this commit just reduce memory copy, so we not only use one benchmark time metric but also use jvm gc time to see the improvement. so i try to add StatisticsHelper into StoredFieldsBenchmark.([code|https://github.com/luyuncheng/luceneutil/commit/e77c7c7bff01bb036b1826e7ec5d46ad7ed5666d]) so at latest commit: # using ByteBuffersDataInput to reduce memory copy in {{CompressingStoredFieldsWriter}} doing {{flush}} # using ByteBuffersDataInput to reduce memory copy in {{CompressingTermVectorsWriter}} doing {{flush}} # using ByteBuffer to *reduce memory copy* in *{{CompressingStoredFieldsWriter}} doing {{copyOneDoc}}* # replace compressor interface param from byte[] to ByteBuffersDataInput {{i do the runStoredFieldsBenchmark with jvm StatisticsHelper it shows as following:}} ||Msec to index||BEST_SPEED ||BEST_SPEED YGC ||BEST_COMPRESSION||BEST_COMPRESSION YGC|| |Baseline|317973|1176 ms (258 collections)|605492|1476 ms (264 collections)| |Candidate|314765|1012 ms (238 collections)|601253|1175 ms (234 collections)| {{ }} was (Author: luyuncheng): [~jpountz] ,[~uschindler] Hi, I try to use ByteBuffersDataInput to reduce memory copy because it can get from ByteBuffersDataOutput.toDataInput. and it could reduce this complexity ([latest commit|https://github.com/luyuncheng/lucene/pull/1], [PR|https://github.com/apache/lucene/pull/987]) BUT i am not sure whether can change Compressor interface compress input param from byte[] to ByteBuffersDataInput. If change this interface [like|https://github.com/apache/lucene/blob/382962f22df3ee3af3fb538b877c98d61a622ddb/lucene/core/src/java/org/apache/lucene/codecs/compressing/Compressor.java#L35], it increased the backport code [like|https://github.com/apache/lucene/blob/382962f22df3ee3af3fb538b877c98d61a622ddb/lucene/core/src/java/org/apache/lucene/codecs/compressing/CompressionMode.java#L274], however if we change the interface with ByteBuffersDataInput, we can optimize memory copy into different compress algorithm code. Also, i found we can do more memory copy reduce in *{{{}CompressingStoredFieldsWriter.{}}}{{{}copyOneDoc [like|https://github.com/apache/lucene/blob/382962f22df3ee3af3fb538b877c98d61a622ddb/lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsWriter.java#L516] and CompressingTermVectorsWriter.flush{}}}* I think this commit just reduce memory copy, so we not only use one benchmark time metric but also use jvm gc time to see the improvement. so i try to add StatisticsHelper into StoredFieldsBenchmark.([code|https://github.com/luyuncheng/luceneutil/commit/e77c7c7bff01bb036b1826e7ec5d46ad7ed5666d]) so at latest commit: # using ByteBuffersDataInput to reduce memory copy in {{CompressingStoredFieldsWriter}} doing {{flush}} # using ByteBuffersDataInput to reduce memory copy in {{CompressingTermVectorsWriter}} doing {{flush}} # using ByteBuffer to *reduce memory copy* in *{{CompressingStoredFieldsWriter}} doing {{copyOneDoc}}* # replace compressor interface param from byte[] to ByteBuffersDataInput {{i do the runStoredFieldsBenchmark with jvm StatisticsHelper it shows as following:}} ||Msec to index||BEST_SPEED ||BEST_SPEED YGC ||BEST_COMPRESSION||BEST_COMPRESSION YGC|| |Baseline|317973|1176 ms (258 collections)|605492|1476 ms (264 collections)| |Candidate|314765|1012 ms (238 collecti
[jira] [Updated] (LUCENE-10627) Using CompositeByteBuf to Reduce Memory Copy
[ https://issues.apache.org/jira/browse/LUCENE-10627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LuYunCheng updated LUCENE-10627: Description: Code: [https://github.com/apache/lucene/pull/987] I see When Lucene Do flush and merge store fields, need many memory copies: {code:java} Lucene Merge Thread #25940]" #906546 daemon prio=5 os_prio=0 cpu=20503.95ms elapsed=68.76s tid=0x7ee990002c50 nid=0x3aac54 runnable [0x7f17718db000] java.lang.Thread.State: RUNNABLE at org.apache.lucene.store.ByteBuffersDataOutput.toArrayCopy(ByteBuffersDataOutput.java:271) at org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.flush(CompressingStoredFieldsWriter.java:239) at org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.finishDocument(CompressingStoredFieldsWriter.java:169) at org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.merge(CompressingStoredFieldsWriter.java:654) at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:228) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:105) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4760) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4364) at org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5923) at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624) at org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:100) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:682) {code} When Lucene *CompressingStoredFieldsWriter* do flush documents, it needs many memory copies: With Lucene90 using {*}LZ4WithPresetDictCompressionMode{*}: # bufferedDocs.toArrayCopy copy blocks into one continue content for chunk compress # compressor copy dict and data into one block buffer # do compress # copy compressed data out With Lucene90 using {*}DeflateWithPresetDictCompressionMode{*}: # bufferedDocs.toArrayCopy copy blocks into one continue content for chunk compress # do compress # copy compressed data out I think we can use -CompositeByteBuf- to reduce temp memory copies: # we do not have to *bufferedDocs.toArrayCopy* when just need continues content for chunk compress I write a simple mini benchamrk in test code ([link |https://github.com/apache/lucene/blob/5a406a5c483c7fadaf0e8a5f06732c79ad174d11/lucene/core/src/test/org/apache/lucene/codecs/lucene90/compressing/TestCompressingStoredFieldsFormat.java#L353]): *LZ4WithPresetDict run* Capacity:41943040(bytes) , iter 10times: Origin elapse:5391ms , New elapse:5297ms *DeflateWithPresetDict run* Capacity:41943040(bytes), iter 10times: Origin elapse:{*}115ms{*}, New elapse:{*}12ms{*} And I run runStoredFieldsBenchmark with doc_limit=-1: shows: ||Msec to index||BEST_SPEED ||BEST_COMPRESSION|| |Baseline|318877.00|606288.00| |Candidate|314442.00|604719.00| ---UPDATE--- I try to *reuse ByteBuffersDataInput* to reduce memory copy because it can get from ByteBuffersDataOutput.toDataInput. and it could reduce this complexity ([PR|https://github.com/apache/lucene/pull/987]) BUT i am not sure whether can change Compressor interface compress input param from byte[] to ByteBuffersDataInput. If change this interface [like|https://github.com/apache/lucene/blob/382962f22df3ee3af3fb538b877c98d61a622ddb/lucene/core/src/java/org/apache/lucene/codecs/compressing/Compressor.java#L35], it increased the backport code [like|https://github.com/apache/lucene/blob/382962f22df3ee3af3fb538b877c98d61a622ddb/lucene/core/src/java/org/apache/lucene/codecs/compressing/CompressionMode.java#L274], however if we change the interface with ByteBuffersDataInput, we can optimize memory copy into different compress algorithm code. Also, i found we can do more memory copy reduce in *{{{}CompressingStoredFieldsWriter.{}}}{{{}copyOneDoc [like|https://github.com/apache/lucene/blob/382962f22df3ee3af3fb538b877c98d61a622ddb/lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsWriter.java#L516] and CompressingTermVectorsWriter.flush{}}}* I think this commit just reduce memory copy, so we not only use one benchmark time metric but also use jvm gc time to see the improvement. so i try to add StatisticsHelper into StoredFieldsBenchmark.([code|https://github.com/luyuncheng/luceneutil/commit/e77c7c7bff01bb036b1826e7ec5d46ad7ed5666d]) so at latest commit: # using ByteBuffersDataInput to reduce memory copy in {{CompressingStoredFieldsWriter}} doing {{flush}} # using ByteBuffersDataInput to reduce memory copy in {{CompressingTermVectorsWriter}} doing {{flush}} # using ByteBuffer to *reduce memory copy* in *{{CompressingStoredFieldsWriter}} doing {
[jira] [Commented] (LUCENE-10627) Using CompositeByteBuf to Reduce Memory Copy
[ https://issues.apache.org/jira/browse/LUCENE-10627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17564086#comment-17564086 ] LuYunCheng commented on LUCENE-10627: - [~rcmuir] Hi, At latest commit i *reuse ByteBuffersDataInput* to reduce memory copy because it can get from ByteBuffersDataOutput.toDataInput directly. and it could reduce the code complexity. > Using CompositeByteBuf to Reduce Memory Copy > > > Key: LUCENE-10627 > URL: https://issues.apache.org/jira/browse/LUCENE-10627 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs, core/store >Reporter: LuYunCheng >Priority: Major > > Code: [https://github.com/apache/lucene/pull/987] > I see When Lucene Do flush and merge store fields, need many memory copies: > {code:java} > Lucene Merge Thread #25940]" #906546 daemon prio=5 os_prio=0 cpu=20503.95ms > elapsed=68.76s tid=0x7ee990002c50 nid=0x3aac54 runnable > [0x7f17718db000] > java.lang.Thread.State: RUNNABLE > at > org.apache.lucene.store.ByteBuffersDataOutput.toArrayCopy(ByteBuffersDataOutput.java:271) > at > org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.flush(CompressingStoredFieldsWriter.java:239) > at > org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.finishDocument(CompressingStoredFieldsWriter.java:169) > at > org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.merge(CompressingStoredFieldsWriter.java:654) > at > org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:228) > at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:105) > at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4760) > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4364) > at > org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5923) > at > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624) > at > org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:100) > at > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:682) > {code} > When Lucene *CompressingStoredFieldsWriter* do flush documents, it needs many > memory copies: > With Lucene90 using {*}LZ4WithPresetDictCompressionMode{*}: > # bufferedDocs.toArrayCopy copy blocks into one continue content for chunk > compress > # compressor copy dict and data into one block buffer > # do compress > # copy compressed data out > With Lucene90 using {*}DeflateWithPresetDictCompressionMode{*}: > # bufferedDocs.toArrayCopy copy blocks into one continue content for chunk > compress > # do compress > # copy compressed data out > > I think we can use -CompositeByteBuf- to reduce temp memory copies: > # we do not have to *bufferedDocs.toArrayCopy* when just need continues > content for chunk compress > > I write a simple mini benchamrk in test code ([link > |https://github.com/apache/lucene/blob/5a406a5c483c7fadaf0e8a5f06732c79ad174d11/lucene/core/src/test/org/apache/lucene/codecs/lucene90/compressing/TestCompressingStoredFieldsFormat.java#L353]): > *LZ4WithPresetDict run* Capacity:41943040(bytes) , iter 10times: Origin > elapse:5391ms , New elapse:5297ms > *DeflateWithPresetDict run* Capacity:41943040(bytes), iter 10times: Origin > elapse:{*}115ms{*}, New elapse:{*}12ms{*} > > And I run runStoredFieldsBenchmark with doc_limit=-1: > shows: > ||Msec to index||BEST_SPEED ||BEST_COMPRESSION|| > |Baseline|318877.00|606288.00| > |Candidate|314442.00|604719.00| > > ---UPDATE--- > > I try to *reuse ByteBuffersDataInput* to reduce memory copy because it can > get from ByteBuffersDataOutput.toDataInput. and it could reduce this > complexity ([PR|https://github.com/apache/lucene/pull/987]) > BUT i am not sure whether can change Compressor interface compress input > param from byte[] to ByteBuffersDataInput. If change this interface > [like|https://github.com/apache/lucene/blob/382962f22df3ee3af3fb538b877c98d61a622ddb/lucene/core/src/java/org/apache/lucene/codecs/compressing/Compressor.java#L35], > it increased the backport code > [like|https://github.com/apache/lucene/blob/382962f22df3ee3af3fb538b877c98d61a622ddb/lucene/core/src/java/org/apache/lucene/codecs/compressing/CompressionMode.java#L274], > however if we change the interface with ByteBuffersDataInput, we can > optimize memory copy into different compress algorithm code. > Also, i found we can do more memory copy reduce in > *{{{}CompressingStoredFieldsWriter.{}}}{{{}copyOneDoc > [like|https://github.com/apache/lucene/blob/382962f22df3ee3af3fb538b877c98d61a622ddb/lucene/core/src/java/o
[jira] [Commented] (LUCENE-10647) Failure in TestMergeSchedulerExternal.testSubclassConcurrentMergeScheduler
[ https://issues.apache.org/jira/browse/LUCENE-10647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17564099#comment-17564099 ] Vigya Sharma commented on LUCENE-10647: --- I think the cause of this failure is related, but slightly different from https://issues.apache.org/jira/browse/LUCENE-10617.. However, I'm not able to repro it on my box despite running the tests on repeat. My hunch is that we are hitting an exception in the {{addDocument()}} API, which gets swallowed by the catch block. But, as a result, we end up calling {{writer.rollback()}} before (or rather without) calling getMergeScheduler().sync(). Once rollback is triggered, MergeThreads exit with an abort, which is swallowed (and not rethrown) by {{{}writer.handleMergeExceptions(){}}}. This leaves the excCalled flag as unset, causing the assertion error. (Code Ref - [https://github.com/apache/lucene/blob/main/lucene/core/src/test/org/apache/lucene/TestMergeSchedulerExternal.java#L133-L139)|https://github.com/apache/lucene/blob/main/lucene/core/src/test/org/apache/lucene/TestMergeSchedulerExternal.java#L133-L139] I can raise a quick PR with a fix But I don't have a good way to test and confirm as this has not reproed on my box so far. > Failure in TestMergeSchedulerExternal.testSubclassConcurrentMergeScheduler > -- > > Key: LUCENE-10647 > URL: https://issues.apache.org/jira/browse/LUCENE-10647 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Vigya Sharma >Priority: Major > > Recent builds are intermittently failing on > TestMergeSchedulerExternal.testSubclassConcurrentMergeScheduler. Example: > https://jenkins.thetaphi.de/job/Lucene-main-Linux/35576/testReport/junit/org.apache.lucene/TestMergeSchedulerExternal/testSubclassConcurrentMergeScheduler/ -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10647) Failure in TestMergeSchedulerExternal.testSubclassConcurrentMergeScheduler
Vigya Sharma created LUCENE-10647: - Summary: Failure in TestMergeSchedulerExternal.testSubclassConcurrentMergeScheduler Key: LUCENE-10647 URL: https://issues.apache.org/jira/browse/LUCENE-10647 Project: Lucene - Core Issue Type: Improvement Reporter: Vigya Sharma Recent builds are intermittently failing on TestMergeSchedulerExternal.testSubclassConcurrentMergeScheduler. Example: https://jenkins.thetaphi.de/job/Lucene-main-Linux/35576/testReport/junit/org.apache.lucene/TestMergeSchedulerExternal/testSubclassConcurrentMergeScheduler/ -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org