[GitHub] [lucene] jpountz commented on pull request #1006: LUCENE-10480: Move scoring from advance to TwoPhaseIterator#matches to improve disjunction within conjunction

2022-07-07 Thread GitBox


jpountz commented on PR #1006:
URL: https://github.com/apache/lucene/pull/1006#issuecomment-1177173038

   Ah, that makes sense to me now! Thanks for explaining.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn commented on pull request #1006: LUCENE-10480: Move scoring from advance to TwoPhaseIterator#matches to improve disjunction within conjunction

2022-07-07 Thread GitBox


zacharymorn commented on PR #1006:
URL: https://github.com/apache/lucene/pull/1006#issuecomment-1177230791

   > Ah, that makes sense to me now! Thanks for explaining.
   
   No problem!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10645) Wrong autocomplete suggestion

2022-07-07 Thread Emiliyan Sinigerov (Jira)
Emiliyan Sinigerov created LUCENE-10645:
---

 Summary: Wrong autocomplete suggestion
 Key: LUCENE-10645
 URL: https://issues.apache.org/jira/browse/LUCENE-10645
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Emiliyan Sinigerov


I have problem with autocomplete suggestion (I use your test to show you where 
is the bug 
https://github.com/apache/lucene/blob/698f40ad51af0c42b0a4a8321ab89968e8d0860b/lucene/suggest/src/test/org/apache/lucene/search/suggest/analyzing/TestAnalyzingInfixSuggester.java).

This is your test and everything works fine:

public void testBothExactAndPrefix() throws Exception {
    Analyzer a = new MockAnalyzer(random(), MockTokenizer.WHITESPACE, false);
    AnalyzingInfixSuggester suggester = new 
AnalyzingInfixSuggester(newDirectory(), a, a, 3, false);
    suggester.build(new InputArrayIterator(new Input[0]));
    suggester.add(new BytesRef("the pen is pretty"), null, 10, new 
BytesRef("foobaz"));
    suggester.refresh();

    List results =
        suggester.lookup(TestUtil.stringToCharSequence("pen p", random()), 10, 
true, true);
    assertEquals(1, results.size());
    assertEquals("the pen is pretty", results.get(0).key);
    assertEquals("the pen is pretty", 
results.get(0).highlightKey);
    assertEquals(10, results.get(0).value);
    assertEquals(new BytesRef("foobaz"), results.get(0).payload);
    suggester.close();
    a.close();
 }

 

But if I add this row to the test {*}suggester.add(new BytesRef("the pen is 
fretty"), null, 10, new BytesRef("foobaz")){*}, the test goes wrong.

public void testBothExactAndPrefix() throws Exception {
  Analyzer a = new MockAnalyzer(random(), MockTokenizer.WHITESPACE, false);
  AnalyzingInfixSuggester suggester = new 
AnalyzingInfixSuggester(newDirectory(), a, a, 3, false);
  suggester.build(new InputArrayIterator(new Input[0]));
  suggester.add(new BytesRef("the pen is pretty"), null, 10, new 
BytesRef("foobaz"));
  *suggester.add(new BytesRef("the pen is fretty"), null, 10, new 
BytesRef("foobaz"));*

  suggester.refresh();

  List results =
      suggester.lookup(TestUtil.stringToCharSequence("pen p", random()), 10, 
true, true);
  assertEquals(1, results.size());
  assertEquals("the pen is pretty", results.get(0).key);
  assertEquals("the pen is pretty", results.get(0).highlightKey);
  assertEquals(10, results.get(0).value);
  assertEquals(new BytesRef("foobaz"), results.get(0).payload);
  suggester.close();
  a.close();
}

We want to find everything that contains "pen p" and we have just one matcher 
"the pen is pretty", but in the results we have two matches "the pen is pretty" 
and "the pen is fretty".

I think when we want to find some words - in this study "pen" and the second 
word with one letter, which is the same as the first letter in our word - in 
this study "p", the suggester first match word "pen" and then match "p" in 
"pen", which is inccorect. We want to match "p" in a word other than "pen".

 

Thank you,

 

Emiliyan.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-07 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563627#comment-17563627
 ] 

ASF subversion and git services commented on LUCENE-10480:
--

Commit da8143bfa38cd5fadae4b4712b9e639e79016021 in lucene's branch 
refs/heads/main from zacharymorn
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=da8143bfa38 ]

LUCENE-10480: Move scoring from advance to TwoPhaseIterator#matches to improve 
disjunction within conjunction (#1006)



> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn opened a new pull request, #1008: LUCENE-10480: (Backporting) Move scoring from advance to TwoPhaseIterator#matches to improve disjunction within conjunction (#1006)

2022-07-07 Thread GitBox


zacharymorn opened a new pull request, #1008:
URL: https://github.com/apache/lucene/pull/1008

   This PR backports https://github.com/apache/lucene/pull/1006 into `branch_9x`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn merged pull request #1006: LUCENE-10480: Move scoring from advance to TwoPhaseIterator#matches to improve disjunction within conjunction

2022-07-07 Thread GitBox


zacharymorn merged PR #1006:
URL: https://github.com/apache/lucene/pull/1006


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta commented on issue #8: Set assignee field for issues if the account mapping is given

2022-07-07 Thread GitBox


mocobeta commented on issue #8:
URL: 
https://github.com/apache/lucene-jira-archive/issues/8#issuecomment-1177419371

   Thank you @dweiss for noticing this.
   I invited you to a test repository. I think an email has been sent.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10643) Lucene Jenkins CI - s390x support

2022-07-07 Thread Nayana Thorat (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563713#comment-17563713
 ] 

Nayana Thorat commented on LUCENE-10643:


[~uschindler] Yes.. oracle does not offer JDK 19 for s390x yet however Eclipse 
Adoptium has nightly (beta) release for java19.  I have installed it on s390x 
nodes in directory : /home/jenkins/tools/java/adoptjdk19

Version installed:

$ /home/jenkins/tools/java/adoptjdk19/bin/java --version
openjdk 19-beta 2022-09-20
OpenJDK Runtime Environment Temurin-19+29-202207070331 (build 
19-beta+29-202207070331)
OpenJDK 64-Bit Server VM Temurin-19+29-202207070331 (build 
19-beta+29-202207070331, mixed mode, sharing)

> Lucene Jenkins CI - s390x support 
> --
>
> Key: LUCENE-10643
> URL: https://issues.apache.org/jira/browse/LUCENE-10643
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Nayana Thorat
>Assignee: Uwe Schindler
>Priority: Major
>  Labels: jenkins
>
> This issue adds Lucene builds on ASF Jenkins with S390x architecture (big 
> endian).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mikemccand commented on issue #8: Set assignee field for issues if the account mapping is given

2022-07-07 Thread GitBox


mikemccand commented on issue #8:
URL: 
https://github.com/apache/lucene-jira-archive/issues/8#issuecomment-1177455363

   > @mikemccand I invited you to a test repository to test if we can set 
(migrate) issues' `Assignee` field. An email should have been sent - can you 
please accept it?
   > 
   > I tested it with my account (API's caller and issue author), just wanted 
to confirm it also works for other accounts.
   
   Thanks @mocobeta!  I just accepted the invitation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10643) Lucene Jenkins CI - s390x support

2022-07-07 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563719#comment-17563719
 ] 

Uwe Schindler commented on LUCENE-10643:


Great thanks, will setup a job for that. It look like it is recent enough.

> Lucene Jenkins CI - s390x support 
> --
>
> Key: LUCENE-10643
> URL: https://issues.apache.org/jira/browse/LUCENE-10643
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Nayana Thorat
>Assignee: Uwe Schindler
>Priority: Major
>  Labels: jenkins
>
> This issue adds Lucene builds on ASF Jenkins with S390x architecture (big 
> endian).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10643) Lucene Jenkins CI - s390x support

2022-07-07 Thread Nayana Thorat (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563727#comment-17563727
 ] 

Nayana Thorat commented on LUCENE-10643:


[~uschindler] 
[https://ci-builds.apache.org/job/Lucene/job/Lucene-Check-main%20(s390x%20big%20endian)/2/console]
 The Build is successful however I could see below exception for artifacts . 
Any conf needs to be done?
Archiving artifacts
hudson.FilePath$ValidateAntFileMask$1Cancel
at 
hudson.FilePath$ValidateAntFileMask$1.isCaseSensitive(FilePath.java:3209)

> Lucene Jenkins CI - s390x support 
> --
>
> Key: LUCENE-10643
> URL: https://issues.apache.org/jira/browse/LUCENE-10643
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Nayana Thorat
>Assignee: Uwe Schindler
>Priority: Major
>  Labels: jenkins
>
> This issue adds Lucene builds on ASF Jenkins with S390x architecture (big 
> endian).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10643) Lucene Jenkins CI - s390x support

2022-07-07 Thread Nayana Thorat (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563727#comment-17563727
 ] 

Nayana Thorat edited comment on LUCENE-10643 at 7/7/22 11:39 AM:
-

[~uschindler] 
[https://ci-builds.apache.org/job/Lucene/job/Lucene-Check-main%20(s390x%20big%20endian)/2/console]
 The Build is successful however I could see below exception for artifacts . 
Any conf needs to be done?
_Archiving artifacts_
_hudson.FilePath$ValidateAntFileMask$1Cancel_
_at hudson.FilePath$ValidateAntFileMask$1.isCaseSensitive(FilePath.java:3209)_


was (Author: nayana):
[~uschindler] 
[https://ci-builds.apache.org/job/Lucene/job/Lucene-Check-main%20(s390x%20big%20endian)/2/console]
 The Build is successful however I could see below exception for artifacts . 
Any conf needs to be done?
Archiving artifacts
hudson.FilePath$ValidateAntFileMask$1Cancel
at 
hudson.FilePath$ValidateAntFileMask$1.isCaseSensitive(FilePath.java:3209)

> Lucene Jenkins CI - s390x support 
> --
>
> Key: LUCENE-10643
> URL: https://issues.apache.org/jira/browse/LUCENE-10643
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Nayana Thorat
>Assignee: Uwe Schindler
>Priority: Major
>  Labels: jenkins
>
> This issue adds Lucene builds on ASF Jenkins with S390x architecture (big 
> endian).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10643) Lucene Jenkins CI - s390x support

2022-07-07 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563729#comment-17563729
 ] 

Uwe Schindler commented on LUCENE-10643:


[~Nayana]: This is not a problem. It appears on all builds and has to do with 
some bug in Jenkins. Cannot be prevented, sorry. As long as builds succeed all 
is fine. [~dweiss] has some hints about the bug.

It happens on all our builds.

> Lucene Jenkins CI - s390x support 
> --
>
> Key: LUCENE-10643
> URL: https://issues.apache.org/jira/browse/LUCENE-10643
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Nayana Thorat
>Assignee: Uwe Schindler
>Priority: Major
>  Labels: jenkins
>
> This issue adds Lucene builds on ASF Jenkins with S390x architecture (big 
> endian).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10643) Lucene Jenkins CI - s390x support

2022-07-07 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563729#comment-17563729
 ] 

Uwe Schindler edited comment on LUCENE-10643 at 7/7/22 11:41 AM:
-

[~Nayana]: This is not a problem. It appears on all builds and has to do with 
some bug in Jenkins. Cannot be prevented, sorry. As long as builds succeed all 
is fine. [~dweiss] has some hints about the bug.


was (Author: thetaphi):
[~Nayana]: This is not a problem. It appears on all builds and has to do with 
some bug in Jenkins. Cannot be prevented, sorry. As long as builds succeed all 
is fine. [~dweiss] has some hints about the bug.

It happens on all our builds.

> Lucene Jenkins CI - s390x support 
> --
>
> Key: LUCENE-10643
> URL: https://issues.apache.org/jira/browse/LUCENE-10643
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Nayana Thorat
>Assignee: Uwe Schindler
>Priority: Major
>  Labels: jenkins
>
> This issue adds Lucene builds on ASF Jenkins with S390x architecture (big 
> endian).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10643) Lucene Jenkins CI - s390x support

2022-07-07 Thread Nayana Thorat (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563737#comment-17563737
 ] 

Nayana Thorat commented on LUCENE-10643:


[~uschindler] Oh ok. Thank you for clarification

> Lucene Jenkins CI - s390x support 
> --
>
> Key: LUCENE-10643
> URL: https://issues.apache.org/jira/browse/LUCENE-10643
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Nayana Thorat
>Assignee: Uwe Schindler
>Priority: Major
>  Labels: jenkins
>
> This issue adds Lucene builds on ASF Jenkins with S390x architecture (big 
> endian).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10643) Lucene Jenkins CI - s390x support

2022-07-07 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563742#comment-17563742
 ] 

Uwe Schindler commented on LUCENE-10643:


See this: https://www.mail-archive.com/dev@lucene.apache.org/msg314005.html

> Lucene Jenkins CI - s390x support 
> --
>
> Key: LUCENE-10643
> URL: https://issues.apache.org/jira/browse/LUCENE-10643
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Nayana Thorat
>Assignee: Uwe Schindler
>Priority: Major
>  Labels: jenkins
>
> This issue adds Lucene builds on ASF Jenkins with S390x architecture (big 
> endian).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10643) Lucene Jenkins CI - s390x support

2022-07-07 Thread Nayana Thorat (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563743#comment-17563743
 ] 

Nayana Thorat commented on LUCENE-10643:


One more thing want to ask: How frequently these jobs will execute ? ( on any 
pull request check or merge etc.)

> Lucene Jenkins CI - s390x support 
> --
>
> Key: LUCENE-10643
> URL: https://issues.apache.org/jira/browse/LUCENE-10643
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Nayana Thorat
>Assignee: Uwe Schindler
>Priority: Major
>  Labels: jenkins
>
> This issue adds Lucene builds on ASF Jenkins with S390x architecture (big 
> endian).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10643) Lucene Jenkins CI - s390x support

2022-07-07 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563763#comment-17563763
 ] 

Uwe Schindler commented on LUCENE-10643:


It is configured to be {{@daily}}. Normal lucene builds run {{@hourly}} on our 
special "lucene" tagged nodes to not occupy nodes used by other projects, by 
always running builds.

The reason for this is how Lucene tests works: They check with random data, so 
whenever you see a failure, it is something new  (often JVM bugs):
- https://www.youtube.com/watch?v=-uVE_w8flIU
- 
https://2019.berlinbuzzwords.de/sites/2019.berlinbuzzwords.de/files/media/documents/dawidweiss-randomizedtesting-pub.pdf
- https://www.youtube.com/watch?v=PVRdLyQGUxE
- 
https://2013.berlinbuzzwords.de/sites/2013.berlinbuzzwords.de/files/slides/Schindler-BugsBugsBugs.pdf

> Lucene Jenkins CI - s390x support 
> --
>
> Key: LUCENE-10643
> URL: https://issues.apache.org/jira/browse/LUCENE-10643
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Nayana Thorat
>Assignee: Uwe Schindler
>Priority: Major
>  Labels: jenkins
>
> This issue adds Lucene builds on ASF Jenkins with S390x architecture (big 
> endian).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] dweiss commented on issue #8: Set assignee field for issues if the account mapping is given

2022-07-07 Thread GitBox


dweiss commented on issue #8:
URL: 
https://github.com/apache/lucene-jira-archive/issues/8#issuecomment-1177587814

   Accepted the invitation just now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10643) Lucene Jenkins CI - s390x support

2022-07-07 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563795#comment-17563795
 ] 

Dawid Weiss commented on LUCENE-10643:
--

The timeout is caused by a hard limit in jenkins that should be configurable 
via system properties -

 
[https://www.jenkins.io/doc/book/managing/system-properties/#hudson-filepath-validate_ant_file_mask_bound]
 
we never got around to locating how this can be done though.

> Lucene Jenkins CI - s390x support 
> --
>
> Key: LUCENE-10643
> URL: https://issues.apache.org/jira/browse/LUCENE-10643
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Nayana Thorat
>Assignee: Uwe Schindler
>Priority: Major
>  Labels: jenkins
>
> This issue adds Lucene builds on ASF Jenkins with S390x architecture (big 
> endian).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10627) Using CompositeByteBuf to Reduce Memory Copy

2022-07-07 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563805#comment-17563805
 ] 

Robert Muir commented on LUCENE-10627:
--

Yes we have to stop another PagedBytes/ByteBlockPool from entering our 
codebase. To me it doesn't matter if the performance improvement is 1000%

> Using CompositeByteBuf to Reduce Memory Copy
> 
>
> Key: LUCENE-10627
> URL: https://issues.apache.org/jira/browse/LUCENE-10627
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs, core/store
>Reporter: LuYunCheng
>Priority: Major
>
> Code: [https://github.com/apache/lucene/pull/987]
> I see When Lucene Do flush and merge store fields, need many memory copies:
> {code:java}
> Lucene Merge Thread #25940]" #906546 daemon prio=5 os_prio=0 cpu=20503.95ms 
> elapsed=68.76s tid=0x7ee990002c50 nid=0x3aac54 runnable  
> [0x7f17718db000]
>    java.lang.Thread.State: RUNNABLE
>     at 
> org.apache.lucene.store.ByteBuffersDataOutput.toArrayCopy(ByteBuffersDataOutput.java:271)
>     at 
> org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.flush(CompressingStoredFieldsWriter.java:239)
>     at 
> org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.finishDocument(CompressingStoredFieldsWriter.java:169)
>     at 
> org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.merge(CompressingStoredFieldsWriter.java:654)
>     at 
> org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:228)
>     at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:105)
>     at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4760)
>     at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4364)
>     at 
> org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5923)
>     at 
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624)
>     at 
> org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:100)
>     at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:682)
>  {code}
> When Lucene *CompressingStoredFieldsWriter* do flush documents, it needs many 
> memory copies:
> With Lucene90 using {*}LZ4WithPresetDictCompressionMode{*}:
>  # bufferedDocs.toArrayCopy copy blocks into one continue content for chunk 
> compress
>  # compressor copy dict and data into one block buffer
>  # do compress
>  # copy compressed data out
> With Lucene90 using {*}DeflateWithPresetDictCompressionMode{*}:
>  # bufferedDocs.toArrayCopy copy blocks into one continue content for chunk 
> compress
>  # do compress
>  # copy compressed data out
>  
> I think we can use CompositeByteBuf to reduce temp memory copies:
>  # we do not have to *bufferedDocs.toArrayCopy* when just need continues 
> content for chunk compress
>  
> I write a simple mini benchamrk in test code ([link 
> |https://github.com/apache/lucene/blob/5a406a5c483c7fadaf0e8a5f06732c79ad174d11/lucene/core/src/test/org/apache/lucene/codecs/lucene90/compressing/TestCompressingStoredFieldsFormat.java#L353]):
> *LZ4WithPresetDict run* Capacity:41943040(bytes) , iter 10times: Origin 
> elapse:5391ms , New elapse:5297ms
> *DeflateWithPresetDict run* Capacity:41943040(bytes), iter 10times: Origin 
> elapse:{*}115ms{*}, New elapse:{*}12ms{*}
>  
> And I run runStoredFieldsBenchmark with doc_limit=-1:
> shows:
> ||Msec to index||BEST_SPEED ||BEST_COMPRESSION||
> |Baseline|318877.00|606288.00|
> |Candidate|314442.00|604719.00|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta commented on issue #8: Set assignee field for issues if the account mapping is given

2022-07-07 Thread GitBox


mocobeta commented on issue #8:
URL: 
https://github.com/apache/lucene-jira-archive/issues/8#issuecomment-1177747831

   Thank you both, confirmed that the assignee can be ported.
   
   Issue search result
   ![Screenshot from 2022-07-07 
23-52-51](https://user-images.githubusercontent.com/1825333/177804267-77c14495-120d-42b0-b056-b99bfa08b6cd.png)
   
   Issue detail
   ![Screenshot from 2022-07-07 
23-54-05](https://user-images.githubusercontent.com/1825333/177804732-50dd523f-77bd-4bb8-925f-a367340fb1ac.png)
   
   ![Screenshot from 2022-07-07 
23-54-43](https://user-images.githubusercontent.com/1825333/177804796-2ba2bdb8-bae5-4cde-a5c8-24b27675db1b.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mikemccand commented on issue #8: Set assignee field for issues if the account mapping is given

2022-07-07 Thread GitBox


mikemccand commented on issue #8:
URL: 
https://github.com/apache/lucene-jira-archive/issues/8#issuecomment-1177759497

   Woot!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta merged pull request #18: Check if the assignee account can be assigned on the repo

2022-07-07 Thread GitBox


mocobeta merged PR #18:
URL: https://github.com/apache/lucene-jira-archive/pull/18


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10619) Optimize the writeBytes in TermsHashPerField

2022-07-07 Thread tangdh (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563829#comment-17563829
 ] 

tangdh commented on LUCENE-10619:
-

[~jpountz] ,can this pr be merged?

> Optimize the writeBytes in TermsHashPerField
> 
>
> Key: LUCENE-10619
> URL: https://issues.apache.org/jira/browse/LUCENE-10619
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 9.2
>Reporter: tangdh
>Priority: Major
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Because we don't know the length of slice, writeBytes will always write byte 
> one after another instead of writing a block of bytes.
> May be we could return both offset and length in ByteBlockPool#allocSlice?
> 1. BYTE_BLOCK_SIZE is 32768, offset is at most 15 bits.
> 2. slice size is at most 200, so it could fit in 8 bits.
> So we could put them together into an int  offset | length
> There are only two places where this function is used,the cost of change it 
> is relatively small.
> When allocSlice could return the offset and length of new Slice, we could 
> change writeBytes like below
> {code:java}
> // write block of bytes each time
> while(remaining > 0 ) {
>int offsetAndLength = allocSlice(bytes, offset);
>length = min(remaining, (offsetAndLength & 0xff) - 1);
>offset = offsetAndLength >> 8;
>System.arraycopy(src, srcPos, bytePool.buffer, offset, length);
>remaining -= length;
>offset+= (length + 1);
> }
> {code}
> If it could work, I'd like to raise a pr.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10646) Add some comment on LevenshteinAutomata

2022-07-07 Thread tangdh (Jira)
tangdh created LUCENE-10646:
---

 Summary: Add some comment on LevenshteinAutomata
 Key: LUCENE-10646
 URL: https://issues.apache.org/jira/browse/LUCENE-10646
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Affects Versions: 9.2
Reporter: tangdh


After having a hard time reading the code, I may have understood the relevant 
code of levenshteinautomata, except for the part of minErrors.

I think this part of the code is too difficult to understand, full of magic 
numbers. I will sort it out and then raise a PR to add some necessary comments 
to this part of the code. So, others can better understand this part of the 
code.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10643) Lucene Jenkins CI - s390x support

2022-07-07 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563836#comment-17563836
 ] 

Uwe Schindler commented on LUCENE-10643:


Hi [~Nayana],
the Java 19 (JDK project Panama) run to support Lucene's MMapDirectory v2 (see 
PR https://github.com/apache/lucene/pull/912) was working fine with this Big 
Endian platform. I will also report this also to OpenJDK community, as this is 
an important thing for them to know! It looks like all memory swap instuctions 
in Java's MemorySegment API are inserted at correct places when reading writing 
the little endian file format of Lucene.

The MMap v2 job is here: 
https://ci-builds.apache.org/job/Lucene/job/Lucene-MMAPv2-Linux%20(s390x%20big%20endian)/

> Lucene Jenkins CI - s390x support 
> --
>
> Key: LUCENE-10643
> URL: https://issues.apache.org/jira/browse/LUCENE-10643
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Nayana Thorat
>Assignee: Uwe Schindler
>Priority: Major
>  Labels: jenkins
>
> This issue adds Lucene builds on ASF Jenkins with S390x architecture (big 
> endian).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10643) Lucene Jenkins CI - s390x support

2022-07-07 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563840#comment-17563840
 ] 

Uwe Schindler commented on LUCENE-10643:


bq. The timeout is caused by a hard limit in jenkins that should be 
configurable via system properties

I raised this setting on Policeman Jenkins to 30.000.

> Lucene Jenkins CI - s390x support 
> --
>
> Key: LUCENE-10643
> URL: https://issues.apache.org/jira/browse/LUCENE-10643
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Nayana Thorat
>Assignee: Uwe Schindler
>Priority: Major
>  Labels: jenkins
>
> This issue adds Lucene builds on ASF Jenkins with S390x architecture (big 
> endian).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #1004: LUCENE-10603: Stop using SortedSetDocValues.NO_MORE_ORDS in tests

2022-07-07 Thread GitBox


gsmiller commented on PR #1004:
URL: https://github.com/apache/lucene/pull/1004#issuecomment-1177927537

   Looks good. Thanks @stefanvodita!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues

2022-07-07 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563876#comment-17563876
 ] 

ASF subversion and git services commented on LUCENE-10603:
--

Commit dd4e8b82d711b8f665e91f0d74f159ef1e63939f in lucene's branch 
refs/heads/main from Stefan Vodita
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=dd4e8b82d71 ]

LUCENE-10603: Stop using SortedSetDocValues.NO_MORE_ORDS in tests (#1004)



> Improve iteration of ords for SortedSetDocValues
> 
>
> Key: LUCENE-10603
> URL: https://issues.apache.org/jira/browse/LUCENE-10603
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Assignee: Lu Xugang
>Priority: Trivial
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> After SortedSetDocValues#docValueCount added since Lucene 9.2, should we 
> refactor the implementation of ords iterations using docValueCount instead of 
> NO_MORE_ORDS?
> Similar how SortedNumericDocValues did
> From 
> {code:java}
> for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord 
> = values.nextOrd()) {
> }{code}
> to
> {code:java}
> for (int i = 0; i < values.docValueCount(); i++) {
>   long ord = values.nextOrd();
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller merged pull request #1004: LUCENE-10603: Stop using SortedSetDocValues.NO_MORE_ORDS in tests

2022-07-07 Thread GitBox


gsmiller merged PR #1004:
URL: https://github.com/apache/lucene/pull/1004


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues

2022-07-07 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563901#comment-17563901
 ] 

ASF subversion and git services commented on LUCENE-10603:
--

Commit c46e1f03901ebaac9e010862acbb0cf460d807ef in lucene's branch 
refs/heads/branch_9x from Stefan Vodita
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=c46e1f03901 ]

LUCENE-10603: Stop using SortedSetDocValues.NO_MORE_ORDS in tests (#1004)



> Improve iteration of ords for SortedSetDocValues
> 
>
> Key: LUCENE-10603
> URL: https://issues.apache.org/jira/browse/LUCENE-10603
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Assignee: Lu Xugang
>Priority: Trivial
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> After SortedSetDocValues#docValueCount added since Lucene 9.2, should we 
> refactor the implementation of ords iterations using docValueCount instead of 
> NO_MORE_ORDS?
> Similar how SortedNumericDocValues did
> From 
> {code:java}
> for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord 
> = values.nextOrd()) {
> }{code}
> to
> {code:java}
> for (int i = 0; i < values.docValueCount(); i++) {
>   long ord = values.nextOrd();
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues

2022-07-07 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563912#comment-17563912
 ] 

Greg Miller commented on LUCENE-10603:
--

It looks like the only remaining work is to:
 # Remove the NO_MORE_ORDS definition
 # Update all the SortedSetDocValue implementations to stop returning 
NO_MORE_ORDS in nextOrd()
 # Remove all the test assertions that validate that SSDV#nextOrd() returns 
NO_MORE_ORDS

This should all be main branch work, and not something we backport to 9.x. I 
think 9.x is now good.

> Improve iteration of ords for SortedSetDocValues
> 
>
> Key: LUCENE-10603
> URL: https://issues.apache.org/jira/browse/LUCENE-10603
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Assignee: Lu Xugang
>Priority: Trivial
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> After SortedSetDocValues#docValueCount added since Lucene 9.2, should we 
> refactor the implementation of ords iterations using docValueCount instead of 
> NO_MORE_ORDS?
> Similar how SortedNumericDocValues did
> From 
> {code:java}
> for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord 
> = values.nextOrd()) {
> }{code}
> to
> {code:java}
> for (int i = 0; i < values.docValueCount(); i++) {
>   long ord = values.nextOrd();
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing

2022-07-07 Thread GitBox


jtibshirani commented on code in PR #992:
URL: https://github.com/apache/lucene/pull/992#discussion_r916054211


##
lucene/core/src/java/org/apache/lucene/codecs/perfield/PerFieldKnnVectorsFormat.java:
##
@@ -102,9 +104,22 @@ private class FieldsWriter extends KnnVectorsWriter {
 }
 
 @Override
-public void writeField(FieldInfo fieldInfo, KnnVectorsReader 
knnVectorsReader)
+public KnnFieldVectorsWriter addField(FieldInfo fieldInfo) throws 
IOException {
+  KnnVectorsWriter writer = getInstance(fieldInfo);
+  return writer.addField(fieldInfo);
+}
+
+@Override
+public void flush(int maxDoc, Sorter.DocMap sortMap) throws IOException {
+  for (WriterAndSuffix was : formats.values()) {
+was.writer.flush(maxDoc, sortMap);
+  }
+}
+
+@Override
+public void mergeOneField(FieldInfo fieldInfo, KnnVectorsReader 
knnVectorsReader)
 throws IOException {
-  getInstance(fieldInfo).writeField(fieldInfo, knnVectorsReader);
+  getInstance(fieldInfo).mergeOneField(fieldInfo, knnVectorsReader);

Review Comment:
   Small comment, maybe we can throw an `UnsupportedOperationException` here 
because we expect it never to be called?



##
lucene/core/src/java/org/apache/lucene/codecs/lucene93/Lucene93HnswVectorsWriter.java:
##
@@ -266,65 +470,128 @@ private void writeMeta(
 }
   }
 
-  private OnHeapHnswGraph writeGraph(
-  RandomAccessVectorValuesProducer vectorValues, VectorSimilarityFunction 
similarityFunction)
+  /**
+   * Writes the vector values to the output and returns a set of documents 
that contains vectors.
+   */
+  private static DocsWithFieldSet writeVectorData(IndexOutput output, 
VectorValues vectors)
   throws IOException {
+DocsWithFieldSet docsWithField = new DocsWithFieldSet();
+for (int docV = vectors.nextDoc(); docV != NO_MORE_DOCS; docV = 
vectors.nextDoc()) {
+  // write vector
+  BytesRef binaryValue = vectors.binaryValue();
+  assert binaryValue.length == vectors.dimension() * Float.BYTES;
+  output.writeBytes(binaryValue.bytes, binaryValue.offset, 
binaryValue.length);
+  docsWithField.add(docV);
+}
+return docsWithField;
+  }
 
-// build graph
-HnswGraphBuilder hnswGraphBuilder =
-new HnswGraphBuilder(
-vectorValues, similarityFunction, M, beamWidth, 
HnswGraphBuilder.randSeed);
-hnswGraphBuilder.setInfoStream(segmentWriteState.infoStream);
-OnHeapHnswGraph graph = 
hnswGraphBuilder.build(vectorValues.randomAccess());
+  @Override
+  public void close() throws IOException {
+IOUtils.close(meta, vectorData, vectorIndex);
+  }
 
-// write vectors' neighbours on each level into the vectorIndex file
-int countOnLevel0 = graph.size();
-for (int level = 0; level < graph.numLevels(); level++) {
-  int maxConnOnLevel = level == 0 ? (M * 2) : M;
-  NodesIterator nodesOnLevel = graph.getNodesOnLevel(level);
-  while (nodesOnLevel.hasNext()) {
-int node = nodesOnLevel.nextInt();
-NeighborArray neighbors = graph.getNeighbors(level, node);
-int size = neighbors.size();
-vectorIndex.writeInt(size);
-// Destructively modify; it's ok we are discarding it after this
-int[] nnodes = neighbors.node();
-Arrays.sort(nnodes, 0, size);
-for (int i = 0; i < size; i++) {
-  int nnode = nnodes[i];
-  assert nnode < countOnLevel0 : "node too large: " + nnode + ">=" + 
countOnLevel0;
-  vectorIndex.writeInt(nnode);
-}
-// if number of connections < maxConn, add bogus values up to maxConn 
to have predictable
-// offsets
-for (int i = size; i < maxConnOnLevel; i++) {
-  vectorIndex.writeInt(0);
-}
+  private static class FieldData extends KnnFieldVectorsWriter {

Review Comment:
   Small comment, we could rename this to `FieldWriter` now since that's its 
purpose.



##
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java:
##
@@ -24,28 +24,40 @@
 import org.apache.lucene.index.DocIDMerger;
 import org.apache.lucene.index.FieldInfo;
 import org.apache.lucene.index.MergeState;
+import org.apache.lucene.index.Sorter;
 import org.apache.lucene.index.VectorValues;
 import org.apache.lucene.search.TopDocs;
+import org.apache.lucene.util.Accountable;
 import org.apache.lucene.util.Bits;
 import org.apache.lucene.util.BytesRef;
 
 /** Writes vectors to an index. */
-public abstract class KnnVectorsWriter implements Closeable {
+public abstract class KnnVectorsWriter implements Accountable, Closeable {
 
   /** Sole constructor */
   protected KnnVectorsWriter() {}
 
-  /** Write all values contained in the provided reader */
-  public abstract void writeField(FieldInfo fieldInfo, KnnVectorsReader 
knnVectorsReader)
+  /** Add new field for indexing */
+  public abstract void addField(FieldInfo fieldInfo) throws IOException;
+
+  /** Add new docID with

[GitHub] [lucene] mayya-sharipova commented on pull request #992: LUCENE-10592 Build HNSW Graph on indexing

2022-07-07 Thread GitBox


mayya-sharipova commented on PR #992:
URL: https://github.com/apache/lucene/pull/992#issuecomment-1178060346

   @jtibshirani Thanks for another set of comments, I will work on addressing 
them.
   
   
   Meanwhile, I have run another set of benchmarks on a different dataset 
sift-128-euclidean M:16 efConstruction:100.
   And similar results were observed here:
   
   - the whole indexing + flush approximately the same (533s sec in baseline VS 
538s in candidate)
   - baseline: indexing is fast, but flush takes 532 sec
   - candidate: indexing takes most time, and flush is very fast - 1.8 sec
   
   ### Baseline (main branch): 
   ```bash
   IW 0 [2022-07-07T18:27:08.982483Z; main]: MMapDirectory.UNMAP_SUPPORTED=true
   Done indexing 100 documents; now flush
   IW 0 [2022-07-07T18:27:09.935570Z; main]: now flush at close
   IW 0 [2022-07-07T18:27:09.936155Z; main]:   start flush: applyAllDeletes=true
   IW 0 [2022-07-07T18:27:09.936850Z; main]:   index before flush
   DW 0 [2022-07-07T18:27:09.936917Z; main]: startFullFlush
   DW 0 [2022-07-07T18:27:09.941606Z; main]: anyChanges? numDocsInRam=100 
deletes=false hasTickets:false pendingChangesInFullFlush: false
   DWPT 0 [2022-07-07T18:27:09.951278Z; main]: flush postings as segment _1 
numDocs=100
   IW 0 [2022-07-07T18:27:09.952530Z; main]: 0 msec to write norms
   IW 0 [2022-07-07T18:27:09.952902Z; main]: 0 msec to write docValues
   IW 0 [2022-07-07T18:27:09.953073Z; main]: 0 msec to write points
   HNSW 0 [2022-07-07T18:27:11.094024Z; main]: build graph from 100 vectors
   
   HNSW 0 [2022-07-07T18:35:55.150931Z; main]: built 99 in 6450/524148 ms
   IW 0 [2022-07-07T18:36:01.320864Z; main]: 531459 msec to write vectors
   IW 0 [2022-07-07T18:36:01.336914Z; main]: 15 msec to finish stored fields
   IW 0 [2022-07-07T18:36:01.337204Z; main]: 0 msec to write postings and 
finish vectors
   IW 0 [2022-07-07T18:36:01.337924Z; main]: 0 msec to write fieldInfos
   
   DWPT 0 [2022-07-07T18:36:02.197589Z; main]: flush time 532338.523458 msec
   Indexed 100 documents in 533s
   ```
   
   ### Candidate (this PR with the changes so far): 
   
   ```bash
   IW 0 [2022-07-07T17:44:01.642762Z; main]: MMapDirectory.UNMAP_SUPPORTED=true
   Done indexing 100 documents; now flush
   IW 0 [2022-07-07T17:52:58.049830Z; main]: now flush at close
   IW 0 [2022-07-07T17:52:58.050277Z; main]:   start flush: applyAllDeletes=true
   IW 0 [2022-07-07T17:52:58.050726Z; main]:   index before flush
   DW 0 [2022-07-07T17:52:58.050776Z; main]: startFullFlush
   DW 0 [2022-07-07T17:52:58.056958Z; main]: anyChanges? numDocsInRam=100 
deletes=false hasTickets:false pendingChangesInFullFlush: false
   DWPT 0 [2022-07-07T17:52:58.066937Z; main]: flush postings as segment _0 
numDocs=100
   IW 0 [2022-07-07T17:52:58.068554Z; main]: 0 msec to write norms
   IW 0 [2022-07-07T17:52:58.068864Z; main]: 0 msec to write docValues
   IW 0 [2022-07-07T17:52:58.068958Z; main]: 0 msec to write points
   IW 0 [2022-07-07T17:52:59.017719Z; main]: 947 msec to write vectors
   IW 0 [2022-07-07T17:52:59.038544Z; main]: 19 msec to finish stored fields
   IW 0 [2022-07-07T17:52:59.039281Z; main]: 0 msec to write postings and 
finish vectors
   IW 0 [2022-07-07T17:52:59.043069Z; main]: 3 msec to write fieldInfos
   
   DWPT 0 [2022-07-07T17:52:59.915562Z; main]: flush time 1848.19675 msec
   Indexed 100 documents in 538s
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10194) Should IndexWriter buffer KNN vectors on disk?

2022-07-07 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563919#comment-17563919
 ] 

Julie Tibshirani edited comment on LUCENE-10194 at 7/7/22 6:48 PM:
---

[~mayya] [~jpountz] can we close this since we've decided to go ahead with 
LUCENE-10592 ?


was (Author: julietibs):
[~mayya] [~jpountz] can we close this since we've decided to go ahead with 
https://issues.apache.org/jira/browse/LUCENE-10592 ?

> Should IndexWriter buffer KNN vectors on disk?
> --
>
> Key: LUCENE-10194
> URL: https://issues.apache.org/jira/browse/LUCENE-10194
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Mayya Sharipova
>Priority: Minor
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> VectorValuesWriter buffers data in memory, like we do for all data structures 
> that are computed on flush. But I wonder if this is the right trade-off.
> The use-case I have in mind is someone trying to load a dataset of vectors in 
> Lucene. Given that HNSW graphs are super expensive to create, we'd ideally 
> load that dataset into a single segment rather than many small segments that 
> then need to be merged together, which in-turn re-creates the HNSW graph.
> Yet buffering vectors in memory is expensive. For instance assuming 256 
> dimensions, each vector consumes 1kB of memory. Should we consider buffering 
> vectors on disk to reduce chances of having to create new segments only 
> because the RAM buffer is full?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10194) Should IndexWriter buffer KNN vectors on disk?

2022-07-07 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563919#comment-17563919
 ] 

Julie Tibshirani commented on LUCENE-10194:
---

[~mayya] [~jpountz] can we close this since we've decided to go ahead with 
https://issues.apache.org/jira/browse/LUCENE-10592 ?

> Should IndexWriter buffer KNN vectors on disk?
> --
>
> Key: LUCENE-10194
> URL: https://issues.apache.org/jira/browse/LUCENE-10194
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Mayya Sharipova
>Priority: Minor
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> VectorValuesWriter buffers data in memory, like we do for all data structures 
> that are computed on flush. But I wonder if this is the right trade-off.
> The use-case I have in mind is someone trying to load a dataset of vectors in 
> Lucene. Given that HNSW graphs are super expensive to create, we'd ideally 
> load that dataset into a single segment rather than many small segments that 
> then need to be merged together, which in-turn re-creates the HNSW graph.
> Yet buffering vectors in memory is expensive. For instance assuming 256 
> dimensions, each vector consumes 1kB of memory. Should we consider buffering 
> vectors on disk to reduce chances of having to create new segments only 
> because the RAM buffer is full?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova closed pull request #728: LUCENE-10194 Buffer KNN vectors on disk

2022-07-07 Thread GitBox


mayya-sharipova closed pull request #728: LUCENE-10194 Buffer KNN vectors on 
disk
URL: https://github.com/apache/lucene/pull/728


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on pull request #728: LUCENE-10194 Buffer KNN vectors on disk

2022-07-07 Thread GitBox


mayya-sharipova commented on PR #728:
URL: https://github.com/apache/lucene/pull/728#issuecomment-1178089438

   Closing this PR in favour of 
[alternative](https://github.com/apache/lucene/pull/992)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10194) Should IndexWriter buffer KNN vectors on disk?

2022-07-07 Thread Mayya Sharipova (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayya Sharipova resolved LUCENE-10194.
--
Resolution: Won't Fix

> Should IndexWriter buffer KNN vectors on disk?
> --
>
> Key: LUCENE-10194
> URL: https://issues.apache.org/jira/browse/LUCENE-10194
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Mayya Sharipova
>Priority: Minor
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> VectorValuesWriter buffers data in memory, like we do for all data structures 
> that are computed on flush. But I wonder if this is the right trade-off.
> The use-case I have in mind is someone trying to load a dataset of vectors in 
> Lucene. Given that HNSW graphs are super expensive to create, we'd ideally 
> load that dataset into a single segment rather than many small segments that 
> then need to be merged together, which in-turn re-creates the HNSW graph.
> Yet buffering vectors in memory is expensive. For instance assuming 256 
> dimensions, each vector consumes 1kB of memory. Should we consider buffering 
> vectors on disk to reduce chances of having to create new segments only 
> because the RAM buffer is full?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10194) Should IndexWriter buffer KNN vectors on disk?

2022-07-07 Thread Mayya Sharipova (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563924#comment-17563924
 ] 

Mayya Sharipova commented on LUCENE-10194:
--

+ 1 for closing.

I've closed the corresponding PR as well.

> Should IndexWriter buffer KNN vectors on disk?
> --
>
> Key: LUCENE-10194
> URL: https://issues.apache.org/jira/browse/LUCENE-10194
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Mayya Sharipova
>Priority: Minor
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> VectorValuesWriter buffers data in memory, like we do for all data structures 
> that are computed on flush. But I wonder if this is the right trade-off.
> The use-case I have in mind is someone trying to load a dataset of vectors in 
> Lucene. Given that HNSW graphs are super expensive to create, we'd ideally 
> load that dataset into a single segment rather than many small segments that 
> then need to be merged together, which in-turn re-creates the HNSW graph.
> Yet buffering vectors in memory is expensive. For instance assuming 256 
> dimensions, each vector consumes 1kB of memory. Should we consider buffering 
> vectors on disk to reduce chances of having to create new segments only 
> because the RAM buffer is full?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Closed] (LUCENE-10194) Should IndexWriter buffer KNN vectors on disk?

2022-07-07 Thread Mayya Sharipova (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayya Sharipova closed LUCENE-10194.


> Should IndexWriter buffer KNN vectors on disk?
> --
>
> Key: LUCENE-10194
> URL: https://issues.apache.org/jira/browse/LUCENE-10194
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Mayya Sharipova
>Priority: Minor
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> VectorValuesWriter buffers data in memory, like we do for all data structures 
> that are computed on flush. But I wonder if this is the right trade-off.
> The use-case I have in mind is someone trying to load a dataset of vectors in 
> Lucene. Given that HNSW graphs are super expensive to create, we'd ideally 
> load that dataset into a single segment rather than many small segments that 
> then need to be merged together, which in-turn re-creates the HNSW graph.
> Yet buffering vectors in memory is expensive. For instance assuming 256 
> dimensions, each vector consumes 1kB of memory. Should we consider buffering 
> vectors on disk to reduce chances of having to create new segments only 
> because the RAM buffer is full?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a diff in pull request #974: LUCENE-10614: Properly support getTopChildren in RangeFacetCounts

2022-07-07 Thread GitBox


gsmiller commented on code in PR #974:
URL: https://github.com/apache/lucene/pull/974#discussion_r916216255


##
lucene/demo/src/java/org/apache/lucene/demo/facet/RangeFacetsExample.java:
##
@@ -73,6 +76,35 @@ public void index() throws IOException {
   indexWriter.addDocument(doc);
 }
 
+// Add documents with a fake timestamp, 3600 sec (1 hour) after "now", 
7200 sec (2
+// hours) after "now", ...:
+long startTime = 0;
+// Index error messages since a week (24 * 7 = 168 hours) ago
+for (int i = 0; i < 168; i++) {
+  long endTime = startTime + (i + 1) * 3600;
+
+  // Choose a relatively larger number, e,g., "35", in order to create 
variation in count for
+  // the top-n children, so that getTopChildren(10) in the 
searchTopChildren functionality
+  // can return children with different counts
+  for (int j = 0; j < i % 35; j++) {
+Document doc = new Document();
+// index document at a different timestamp by using endTime - i * j

Review Comment:
   Sorry, I'm sure what you're doing is really obvious to you, but it's just 
confusing to me. I find myself really stuck on things like `endTime - i * j`, 
or `i % 35` as a way to generate different numbers of log events within an hour 
block. What's wrong with just using `Random`? Would that just make it 
impossible to test? Sorry to be a pain with this, but if I were a user just 
trying to understand range faceting and I looked at this code, I'd be spending 
all my time just trying to figure out what we're trying to simulate here 
instead of understanding faceting. There has to be a simpler way right?
   
   As a suggestion, maybe we create a separate Jira issue to add a top-n range 
faceting example and revert out this work for now? That would let us get the 
actual change merged in the meantime.



##
lucene/demo/src/java/org/apache/lucene/demo/facet/RangeFacetsExample.java:
##
@@ -73,6 +76,35 @@ public void index() throws IOException {
   indexWriter.addDocument(doc);
 }
 
+// Add documents with a fake timestamp, 3600 sec (1 hour) after "now", 
7200 sec (2
+// hours) after "now", ...:
+long startTime = 0;
+// Index error messages since a week (24 * 7 = 168 hours) ago
+for (int i = 0; i < 168; i++) {
+  long endTime = startTime + (i + 1) * 3600;
+
+  // Choose a relatively larger number, e,g., "35", in order to create 
variation in count for
+  // the top-n children, so that getTopChildren(10) in the 
searchTopChildren functionality
+  // can return children with different counts
+  for (int j = 0; j < i % 35; j++) {
+Document doc = new Document();
+// index document at a different timestamp by using endTime - i * j
+doc.add(new NumericDocValuesField("error log", endTime - i * j));

Review Comment:
   Maybe "error timestamp" would be a better name?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dnhatn opened a new pull request, #1009: LUCENE-10563: Fix CHANGES list

2022-07-07 Thread GitBox


dnhatn opened a new pull request, #1009:
URL: https://github.com/apache/lucene/pull/1009

   The CHANGES of 10.0 were accidentally merged into 9x CHANGES in 
https://github.com/apache/lucene/commit/b7231bb54884f9ce0232430c4a60cdb5753c6b82.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dnhatn commented on pull request #1009: LUCENE-10563: Fix CHANGES list

2022-07-07 Thread GitBox


dnhatn commented on PR #1009:
URL: https://github.com/apache/lucene/pull/1009#issuecomment-1178274830

   @gsmiller Thanks for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dnhatn merged pull request #1009: LUCENE-10563: Fix CHANGES list

2022-07-07 Thread GitBox


dnhatn merged PR #1009:
URL: https://github.com/apache/lucene/pull/1009


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10563) Unable to Tessellate polygon

2022-07-07 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563996#comment-17563996
 ] 

ASF subversion and git services commented on LUCENE-10563:
--

Commit 8926732a32823be168267fe2ed39eb804d1030f1 in lucene's branch 
refs/heads/branch_9x from Nhat Nguyen
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=8926732a328 ]

LUCENE-10563: Fix CHANGES list (#1009)

The CHANGES of 10.0 were accidentally merged into 9x CHANGES in b7231bb.

> Unable to Tessellate polygon
> 
>
> Key: LUCENE-10563
> URL: https://issues.apache.org/jira/browse/LUCENE-10563
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 9.1
>Reporter: Yixun Xu
>Assignee: Ignacio Vera
>Priority: Major
> Fix For: 9.3
>
> Attachments: polygon-1.json, polygon-2.json, polygon-3.json
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Following up to LUCENE-10470, I found some more polygons that cause 
> {{Tessellator.tessellate}} to throw "Unable to Tessellate shape", which are 
> not covered by the fix to LUCENE-10470. I attached the geojson of 3 failing 
> shapes that I got, and this is the 
> [branch|https://github.com/apache/lucene/compare/main...yixunx:yx/reproduce-tessellator-error?expand=1#diff-5e8e8052af8b8618e7e4325b7d69def4d562a356acbfea3e983198327c7c8d18R17-R19]
>  I am testing on that demonstrates the tessellation failures. 
>  
> [^polygon-1.json]
> [^polygon-2.json]
> [^polygon-3.json]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Yuti-G commented on a diff in pull request #974: LUCENE-10614: Properly support getTopChildren in RangeFacetCounts

2022-07-07 Thread GitBox


Yuti-G commented on code in PR #974:
URL: https://github.com/apache/lucene/pull/974#discussion_r916342857


##
lucene/demo/src/java/org/apache/lucene/demo/facet/RangeFacetsExample.java:
##
@@ -73,6 +76,35 @@ public void index() throws IOException {
   indexWriter.addDocument(doc);
 }
 
+// Add documents with a fake timestamp, 3600 sec (1 hour) after "now", 
7200 sec (2
+// hours) after "now", ...:
+long startTime = 0;
+// Index error messages since a week (24 * 7 = 168 hours) ago
+for (int i = 0; i < 168; i++) {
+  long endTime = startTime + (i + 1) * 3600;
+
+  // Choose a relatively larger number, e,g., "35", in order to create 
variation in count for
+  // the top-n children, so that getTopChildren(10) in the 
searchTopChildren functionality
+  // can return children with different counts
+  for (int j = 0; j < i % 35; j++) {
+Document doc = new Document();
+// index document at a different timestamp by using endTime - i * j

Review Comment:
   Using `Random` does add some complexity for testing, and I was trying to 
keep it as simple as the current example , but sorry that causes confusion. I 
will create a separate issue to add a top-n range faceting example after this 
pr is merged, and will try to use random and add clear comments to the example 
code. Thanks! 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] dweiss commented on issue #8: Set assignee field for issues if the account mapping is given

2022-07-07 Thread GitBox


dweiss commented on issue #8:
URL: 
https://github.com/apache/lucene-jira-archive/issues/8#issuecomment-1178379951

   Excellent!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller opened a new pull request, #1010: Specialize ordinal encoding for SortedSetDocValues

2022-07-07 Thread GitBox


gsmiller opened a new pull request, #1010:
URL: https://github.com/apache/lucene/pull/1010

   ### Description (or a Jira issue link if you have one)
   
   This follows up the work done in LUCENE-10067 by adding additional 
specialization for SORTED_SET doc values.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #1010: Specialize ordinal encoding for SortedSetDocValues

2022-07-07 Thread GitBox


gsmiller commented on PR #1010:
URL: https://github.com/apache/lucene/pull/1010#issuecomment-1178381324

   Benchmarks look good on SSDV faceting (and no regressions elsewhere). I 
think some new bench tasks have recently been added as well that might be 
relevant here, so I'll update my luceneutil and run again soon. For now, here 
are results on `wikimediumall`:
   
   ```
   TaskQPS baseline  StdDevQPS candidate  
StdDevPct diff p-value
Prefix3   58.57  (8.6%)   57.65  
(9.9%)   -1.6% ( -18% -   18%) 0.591
  HighTermMonthSort   47.72 (28.0%)   47.26 
(15.8%)   -1.0% ( -34% -   59%) 0.893
BrowseRandomLabelSSDVFacets2.60  (6.8%)2.58  
(5.6%)   -0.7% ( -12% -   12%) 0.729
   HighSpanNear   17.35  (2.6%)   17.23  
(4.2%)   -0.7% (  -7% -6%) 0.544
  OrHighNotHigh  762.36  (4.2%)  757.77  
(4.8%)   -0.6% (  -9% -8%) 0.673
   Wildcard   27.05  (5.4%)   26.94  
(5.9%)   -0.4% ( -11% -   11%) 0.820
  LowPhrase   35.88  (2.7%)   35.80  
(2.7%)   -0.2% (  -5% -5%) 0.788
  OrNotHighHigh  645.25  (3.1%)  644.30  
(3.3%)   -0.1% (  -6% -6%) 0.884
LowTerm 1793.47  (3.6%) 1792.45  
(3.8%)   -0.1% (  -7% -7%) 0.961
   OrNotHighMed  653.99  (3.1%)  653.73  
(3.2%)   -0.0% (  -6% -6%) 0.968
 AndHighMed   68.77  (5.3%)   68.75  
(6.5%)   -0.0% ( -11% -   12%) 0.986
LowIntervalsOrdered   51.08  (4.6%)   51.08  
(4.5%)0.0% (  -8% -9%) 1.000
  MedPhrase   70.46  (3.1%)   70.46  
(3.3%)0.0% (  -6% -6%) 0.995
   OrHighNotLow 1055.73  (3.3%) 1055.91  
(4.4%)0.0% (  -7% -7%) 0.989
   HighIntervalsOrdered8.03  (4.5%)8.03  
(4.6%)0.0% (  -8% -9%) 0.984
MedSpanNear   11.88  (2.4%)   11.89  
(3.1%)0.1% (  -5% -5%) 0.926
   MedTermDayTaxoFacets   18.17  (3.6%)   18.20  
(3.8%)0.2% (  -7% -7%) 0.891
   OrHighNotMed  780.58  (3.3%)  781.92  
(3.8%)0.2% (  -6% -7%) 0.877
 OrHighMedDayTaxoFacets4.78  (4.4%)4.79  
(5.0%)0.2% (  -8% -9%) 0.906
   AndHighHighDayTaxoFacets6.91  (2.3%)6.92  
(2.9%)0.2% (  -4% -5%) 0.828
MedIntervalsOrdered4.36  (3.5%)4.37  
(3.7%)0.2% (  -6% -7%) 0.851
 OrHighHigh   14.24  (2.8%)   14.27  
(6.4%)0.3% (  -8% -9%) 0.872
 IntNRQ   33.94  (1.1%)   34.05  
(1.3%)0.3% (  -2% -2%) 0.381
 Fuzzy2   71.29  (1.7%)   71.55  
(1.8%)0.4% (  -3% -3%) 0.509
LowSpanNear8.79  (2.7%)8.83  
(3.2%)0.4% (  -5% -6%) 0.673
 Fuzzy1   76.55  (1.7%)   76.90  
(1.6%)0.5% (  -2% -3%) 0.377
 AndHighLow 1077.25  (4.1%) 1082.31  
(3.8%)0.5% (  -7% -8%) 0.706
  BrowseDayOfYearSSDVFacets3.45  (5.8%)3.47  
(4.9%)0.6% (  -9% -   11%) 0.722
LowSloppyPhrase   16.00  (1.9%)   16.10  
(2.9%)0.6% (  -4% -5%) 0.437
  OrHighMed   47.78  (2.0%)   48.08  
(3.9%)0.6% (  -5% -6%) 0.527
   HighTerm 1147.74  (4.7%) 1155.05  
(4.2%)0.6% (  -7% -9%) 0.650
   PKLookup  147.38  (3.6%)  148.34  
(3.1%)0.7% (  -5% -7%) 0.537
AndHighHigh   19.16  (5.1%)   19.29  
(6.8%)0.7% ( -10% -   13%) 0.730
Respell   51.81  (1.5%)   52.15  
(1.4%)0.7% (  -2% -3%) 0.147
MedTerm 1406.90  (4.4%) 1417.58  
(4.2%)0.8% (  -7% -9%) 0.578
MedSloppyPhrase   28.98  (2.0%)   29.20  
(2.8%)0.8% (  -3% -5%) 0.306
AndHighMedDayTaxoFacets   23.12  (2.0%)   23.31  
(2.3%)0.8% (  -3% -5%) 0.232
 TermDTSort   78.14 (20.6%)   78.83 
(20.1%)0.9% ( -32% -   52%) 0.891
 HighPhrase  180.25  (2.8%)  182.18  
(2.6%)1.1% (  -4% -6%) 0.215

[GitHub] [lucene-jira-archive] mocobeta closed issue #8: Set assignee field for issues if the account mapping is given

2022-07-07 Thread GitBox


mocobeta closed issue #8: Set assignee field for issues if the account mapping 
is given
URL: https://github.com/apache/lucene-jira-archive/issues/8


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10600) SortedSetDocValues#docValueCount should be an int, not long

2022-07-07 Thread Lu Xugang (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lu Xugang resolved LUCENE-10600.

Resolution: Fixed

> SortedSetDocValues#docValueCount should be an int, not long
> ---
>
> Key: LUCENE-10600
> URL: https://issues.apache.org/jira/browse/LUCENE-10600
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Assignee: Lu Xugang
>Priority: Minor
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10627) Using CompositeByteBuf to Reduce Memory Copy

2022-07-07 Thread LuYunCheng (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563044#comment-17563044
 ] 

LuYunCheng edited comment on LUCENE-10627 at 7/8/22 5:33 AM:
-

[~jpountz] ,[~uschindler]  Hi, I try to reuse ByteBuffersDataInput to reduce 
memory copy because it can get from ByteBuffersDataOutput.toDataInput.  and it 
could reduce this complexity ([latest 
commit|https://github.com/luyuncheng/lucene/pull/1], 
[PR|https://github.com/apache/lucene/pull/987])

BUT i am not sure whether can change Compressor interface compress input param 
from byte[] to ByteBuffersDataInput. If change this interface 
[like|https://github.com/apache/lucene/blob/382962f22df3ee3af3fb538b877c98d61a622ddb/lucene/core/src/java/org/apache/lucene/codecs/compressing/Compressor.java#L35],
 it increased the backport code 
[like|https://github.com/apache/lucene/blob/382962f22df3ee3af3fb538b877c98d61a622ddb/lucene/core/src/java/org/apache/lucene/codecs/compressing/CompressionMode.java#L274],
 however if we change the interface with ByteBuffersDataInput, we can optimize 
memory copy into different compress algorithm code.

Also, i found we can do more memory copy reduce in 
*{{{}CompressingStoredFieldsWriter.{}}}{{{}copyOneDoc 
[like|https://github.com/apache/lucene/blob/382962f22df3ee3af3fb538b877c98d61a622ddb/lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsWriter.java#L516]
 and CompressingTermVectorsWriter.flush{}}}* 
 

I think this commit just reduce memory copy, so we not only use one benchmark 
time metric but also use jvm gc time to see the improvement. so i try to add 
StatisticsHelper into 
StoredFieldsBenchmark.([code|https://github.com/luyuncheng/luceneutil/commit/e77c7c7bff01bb036b1826e7ec5d46ad7ed5666d])

so at latest commit:
 # using ByteBuffersDataInput to reduce memory copy in 
{{CompressingStoredFieldsWriter}} doing {{flush}}
 # using ByteBuffersDataInput to reduce memory copy in 
{{CompressingTermVectorsWriter}} doing {{flush}}
 # using ByteBuffer to *reduce memory copy* in 
*{{CompressingStoredFieldsWriter}} doing {{copyOneDoc}}*
 # replace compressor interface param from byte[] to ByteBuffersDataInput

 

{{i do the runStoredFieldsBenchmark with jvm StatisticsHelper it shows as 
following:}}
||Msec to index||BEST_SPEED ||BEST_SPEED YGC 
||BEST_COMPRESSION||BEST_COMPRESSION YGC||
|Baseline|317973|1176 ms (258 collections)|605492|1476 ms (264 collections)|
|Candidate|314765|1012 ms (238 collections)|601253|1175 ms (234 collections)|

{{ }}

 


was (Author: luyuncheng):
[~jpountz] ,[~uschindler]  Hi, I try to use ByteBuffersDataInput to reduce 
memory copy because it can get from ByteBuffersDataOutput.toDataInput.  and it 
could reduce this complexity ([latest 
commit|https://github.com/luyuncheng/lucene/pull/1], 
[PR|https://github.com/apache/lucene/pull/987])

BUT i am not sure whether can change Compressor interface compress input param 
from byte[] to ByteBuffersDataInput. If change this interface 
[like|https://github.com/apache/lucene/blob/382962f22df3ee3af3fb538b877c98d61a622ddb/lucene/core/src/java/org/apache/lucene/codecs/compressing/Compressor.java#L35],
 it increased the backport code 
[like|https://github.com/apache/lucene/blob/382962f22df3ee3af3fb538b877c98d61a622ddb/lucene/core/src/java/org/apache/lucene/codecs/compressing/CompressionMode.java#L274],
 however if we change the interface with ByteBuffersDataInput, we can optimize 
memory copy into different compress algorithm code.

Also, i found we can do more memory copy reduce in 
*{{{}CompressingStoredFieldsWriter.{}}}{{{}copyOneDoc 
[like|https://github.com/apache/lucene/blob/382962f22df3ee3af3fb538b877c98d61a622ddb/lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsWriter.java#L516]
 and CompressingTermVectorsWriter.flush{}}}* 
 

I think this commit just reduce memory copy, so we not only use one benchmark 
time metric but also use jvm gc time to see the improvement. so i try to add 
StatisticsHelper into 
StoredFieldsBenchmark.([code|https://github.com/luyuncheng/luceneutil/commit/e77c7c7bff01bb036b1826e7ec5d46ad7ed5666d])

so at latest commit:
 # using ByteBuffersDataInput to reduce memory copy in 
{{CompressingStoredFieldsWriter}} doing {{flush}}
 # using ByteBuffersDataInput to reduce memory copy in 
{{CompressingTermVectorsWriter}} doing {{flush}}
 # using ByteBuffer to *reduce memory copy* in 
*{{CompressingStoredFieldsWriter}} doing {{copyOneDoc}}*
 # replace compressor interface param from byte[] to ByteBuffersDataInput

 

{{i do the runStoredFieldsBenchmark with jvm StatisticsHelper it shows as 
following:}}
||Msec to index||BEST_SPEED ||BEST_SPEED YGC 
||BEST_COMPRESSION||BEST_COMPRESSION YGC||
|Baseline|317973|1176 ms (258 collections)|605492|1476 ms (264 collections)|
|Candidate|314765|1012 ms (238 collecti

[jira] [Updated] (LUCENE-10627) Using CompositeByteBuf to Reduce Memory Copy

2022-07-07 Thread LuYunCheng (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LuYunCheng updated LUCENE-10627:

Description: 
Code: [https://github.com/apache/lucene/pull/987]

I see When Lucene Do flush and merge store fields, need many memory copies:
{code:java}
Lucene Merge Thread #25940]" #906546 daemon prio=5 os_prio=0 cpu=20503.95ms 
elapsed=68.76s tid=0x7ee990002c50 nid=0x3aac54 runnable  
[0x7f17718db000]
   java.lang.Thread.State: RUNNABLE
    at 
org.apache.lucene.store.ByteBuffersDataOutput.toArrayCopy(ByteBuffersDataOutput.java:271)
    at 
org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.flush(CompressingStoredFieldsWriter.java:239)
    at 
org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.finishDocument(CompressingStoredFieldsWriter.java:169)
    at 
org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.merge(CompressingStoredFieldsWriter.java:654)
    at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:228)
    at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:105)
    at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4760)
    at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4364)
    at 
org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5923)
    at 
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624)
    at 
org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:100)
    at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:682)
 {code}
When Lucene *CompressingStoredFieldsWriter* do flush documents, it needs many 
memory copies:

With Lucene90 using {*}LZ4WithPresetDictCompressionMode{*}:
 # bufferedDocs.toArrayCopy copy blocks into one continue content for chunk 
compress
 # compressor copy dict and data into one block buffer
 # do compress
 # copy compressed data out

With Lucene90 using {*}DeflateWithPresetDictCompressionMode{*}:
 # bufferedDocs.toArrayCopy copy blocks into one continue content for chunk 
compress
 # do compress
 # copy compressed data out

 

I think we can use -CompositeByteBuf- to reduce temp memory copies:
 # we do not have to *bufferedDocs.toArrayCopy* when just need continues 
content for chunk compress

 

I write a simple mini benchamrk in test code ([link 
|https://github.com/apache/lucene/blob/5a406a5c483c7fadaf0e8a5f06732c79ad174d11/lucene/core/src/test/org/apache/lucene/codecs/lucene90/compressing/TestCompressingStoredFieldsFormat.java#L353]):
*LZ4WithPresetDict run* Capacity:41943040(bytes) , iter 10times: Origin 
elapse:5391ms , New elapse:5297ms
*DeflateWithPresetDict run* Capacity:41943040(bytes), iter 10times: Origin 
elapse:{*}115ms{*}, New elapse:{*}12ms{*}
 
And I run runStoredFieldsBenchmark with doc_limit=-1:
shows:
||Msec to index||BEST_SPEED ||BEST_COMPRESSION||
|Baseline|318877.00|606288.00|
|Candidate|314442.00|604719.00|

 

---UPDATE---

 

 I try to *reuse ByteBuffersDataInput* to reduce memory copy because it can get 
from ByteBuffersDataOutput.toDataInput.  and it could reduce this complexity 
([PR|https://github.com/apache/lucene/pull/987])

BUT i am not sure whether can change Compressor interface compress input param 
from byte[] to ByteBuffersDataInput. If change this interface 
[like|https://github.com/apache/lucene/blob/382962f22df3ee3af3fb538b877c98d61a622ddb/lucene/core/src/java/org/apache/lucene/codecs/compressing/Compressor.java#L35],
 it increased the backport code 
[like|https://github.com/apache/lucene/blob/382962f22df3ee3af3fb538b877c98d61a622ddb/lucene/core/src/java/org/apache/lucene/codecs/compressing/CompressionMode.java#L274],
 however if we change the interface with ByteBuffersDataInput, we can optimize 
memory copy into different compress algorithm code.

Also, i found we can do more memory copy reduce in 
*{{{}CompressingStoredFieldsWriter.{}}}{{{}copyOneDoc 
[like|https://github.com/apache/lucene/blob/382962f22df3ee3af3fb538b877c98d61a622ddb/lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsWriter.java#L516]
 and CompressingTermVectorsWriter.flush{}}}* 
 

I think this commit just reduce memory copy, so we not only use one benchmark 
time metric but also use jvm gc time to see the improvement. so i try to add 
StatisticsHelper into 
StoredFieldsBenchmark.([code|https://github.com/luyuncheng/luceneutil/commit/e77c7c7bff01bb036b1826e7ec5d46ad7ed5666d])

so at latest commit:
 # using ByteBuffersDataInput to reduce memory copy in 
{{CompressingStoredFieldsWriter}} doing {{flush}}
 # using ByteBuffersDataInput to reduce memory copy in 
{{CompressingTermVectorsWriter}} doing {{flush}}
 # using ByteBuffer to *reduce memory copy* in 
*{{CompressingStoredFieldsWriter}} doing {

[jira] [Commented] (LUCENE-10627) Using CompositeByteBuf to Reduce Memory Copy

2022-07-07 Thread LuYunCheng (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17564086#comment-17564086
 ] 

LuYunCheng commented on LUCENE-10627:
-

[~rcmuir]  Hi, At latest commit i *reuse ByteBuffersDataInput* to reduce memory 
copy because it can get from ByteBuffersDataOutput.toDataInput directly. and it 
could reduce the code complexity. 

> Using CompositeByteBuf to Reduce Memory Copy
> 
>
> Key: LUCENE-10627
> URL: https://issues.apache.org/jira/browse/LUCENE-10627
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs, core/store
>Reporter: LuYunCheng
>Priority: Major
>
> Code: [https://github.com/apache/lucene/pull/987]
> I see When Lucene Do flush and merge store fields, need many memory copies:
> {code:java}
> Lucene Merge Thread #25940]" #906546 daemon prio=5 os_prio=0 cpu=20503.95ms 
> elapsed=68.76s tid=0x7ee990002c50 nid=0x3aac54 runnable  
> [0x7f17718db000]
>    java.lang.Thread.State: RUNNABLE
>     at 
> org.apache.lucene.store.ByteBuffersDataOutput.toArrayCopy(ByteBuffersDataOutput.java:271)
>     at 
> org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.flush(CompressingStoredFieldsWriter.java:239)
>     at 
> org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.finishDocument(CompressingStoredFieldsWriter.java:169)
>     at 
> org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.merge(CompressingStoredFieldsWriter.java:654)
>     at 
> org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:228)
>     at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:105)
>     at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4760)
>     at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4364)
>     at 
> org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5923)
>     at 
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624)
>     at 
> org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:100)
>     at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:682)
>  {code}
> When Lucene *CompressingStoredFieldsWriter* do flush documents, it needs many 
> memory copies:
> With Lucene90 using {*}LZ4WithPresetDictCompressionMode{*}:
>  # bufferedDocs.toArrayCopy copy blocks into one continue content for chunk 
> compress
>  # compressor copy dict and data into one block buffer
>  # do compress
>  # copy compressed data out
> With Lucene90 using {*}DeflateWithPresetDictCompressionMode{*}:
>  # bufferedDocs.toArrayCopy copy blocks into one continue content for chunk 
> compress
>  # do compress
>  # copy compressed data out
>  
> I think we can use -CompositeByteBuf- to reduce temp memory copies:
>  # we do not have to *bufferedDocs.toArrayCopy* when just need continues 
> content for chunk compress
>  
> I write a simple mini benchamrk in test code ([link 
> |https://github.com/apache/lucene/blob/5a406a5c483c7fadaf0e8a5f06732c79ad174d11/lucene/core/src/test/org/apache/lucene/codecs/lucene90/compressing/TestCompressingStoredFieldsFormat.java#L353]):
> *LZ4WithPresetDict run* Capacity:41943040(bytes) , iter 10times: Origin 
> elapse:5391ms , New elapse:5297ms
> *DeflateWithPresetDict run* Capacity:41943040(bytes), iter 10times: Origin 
> elapse:{*}115ms{*}, New elapse:{*}12ms{*}
>  
> And I run runStoredFieldsBenchmark with doc_limit=-1:
> shows:
> ||Msec to index||BEST_SPEED ||BEST_COMPRESSION||
> |Baseline|318877.00|606288.00|
> |Candidate|314442.00|604719.00|
>  
> ---UPDATE---
>  
>  I try to *reuse ByteBuffersDataInput* to reduce memory copy because it can 
> get from ByteBuffersDataOutput.toDataInput.  and it could reduce this 
> complexity ([PR|https://github.com/apache/lucene/pull/987])
> BUT i am not sure whether can change Compressor interface compress input 
> param from byte[] to ByteBuffersDataInput. If change this interface 
> [like|https://github.com/apache/lucene/blob/382962f22df3ee3af3fb538b877c98d61a622ddb/lucene/core/src/java/org/apache/lucene/codecs/compressing/Compressor.java#L35],
>  it increased the backport code 
> [like|https://github.com/apache/lucene/blob/382962f22df3ee3af3fb538b877c98d61a622ddb/lucene/core/src/java/org/apache/lucene/codecs/compressing/CompressionMode.java#L274],
>  however if we change the interface with ByteBuffersDataInput, we can 
> optimize memory copy into different compress algorithm code.
> Also, i found we can do more memory copy reduce in 
> *{{{}CompressingStoredFieldsWriter.{}}}{{{}copyOneDoc 
> [like|https://github.com/apache/lucene/blob/382962f22df3ee3af3fb538b877c98d61a622ddb/lucene/core/src/java/o

[jira] [Commented] (LUCENE-10647) Failure in TestMergeSchedulerExternal.testSubclassConcurrentMergeScheduler

2022-07-07 Thread Vigya Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17564099#comment-17564099
 ] 

Vigya Sharma commented on LUCENE-10647:
---

I think the cause of this failure is related, but slightly different from 
https://issues.apache.org/jira/browse/LUCENE-10617.. However, I'm not able to 
repro it on my box despite running the tests on repeat.

My hunch is that we are hitting an exception in the {{addDocument()}} API, 
which gets swallowed by the catch block. But, as a result, we end up calling 
{{writer.rollback()}} before (or rather without) calling 
getMergeScheduler().sync(). 
Once rollback is triggered, MergeThreads exit with an abort, which is swallowed 
(and not rethrown) by {{{}writer.handleMergeExceptions(){}}}. This leaves the 
excCalled flag as unset, causing the assertion error.

(Code Ref - 
[https://github.com/apache/lucene/blob/main/lucene/core/src/test/org/apache/lucene/TestMergeSchedulerExternal.java#L133-L139)|https://github.com/apache/lucene/blob/main/lucene/core/src/test/org/apache/lucene/TestMergeSchedulerExternal.java#L133-L139]

I can raise a quick PR with a fix But I don't have a good way to test and 
confirm as this has not reproed on my box so far.

> Failure in TestMergeSchedulerExternal.testSubclassConcurrentMergeScheduler
> --
>
> Key: LUCENE-10647
> URL: https://issues.apache.org/jira/browse/LUCENE-10647
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Vigya Sharma
>Priority: Major
>
> Recent builds are intermittently failing on 
> TestMergeSchedulerExternal.testSubclassConcurrentMergeScheduler. Example:
> https://jenkins.thetaphi.de/job/Lucene-main-Linux/35576/testReport/junit/org.apache.lucene/TestMergeSchedulerExternal/testSubclassConcurrentMergeScheduler/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10647) Failure in TestMergeSchedulerExternal.testSubclassConcurrentMergeScheduler

2022-07-07 Thread Vigya Sharma (Jira)
Vigya Sharma created LUCENE-10647:
-

 Summary: Failure in 
TestMergeSchedulerExternal.testSubclassConcurrentMergeScheduler
 Key: LUCENE-10647
 URL: https://issues.apache.org/jira/browse/LUCENE-10647
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Vigya Sharma


Recent builds are intermittently failing on 
TestMergeSchedulerExternal.testSubclassConcurrentMergeScheduler. Example:

https://jenkins.thetaphi.de/job/Lucene-main-Linux/35576/testReport/junit/org.apache.lucene/TestMergeSchedulerExternal/testSubclassConcurrentMergeScheduler/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org