[GitHub] [lucene-jira-archive] mocobeta commented on issue #29: Can/should we make Jira read-only on migration to GitHub issues?

2022-07-16 Thread GitBox


mocobeta commented on issue #29:
URL: 
https://github.com/apache/lucene-jira-archive/issues/29#issuecomment-1186138763

   I just wanted to let you know that I'm not able to edit the Jira 
configuration such as workflow or issue template (I don't have permission). So 
anyway, I have to pass it to you after GitHub issue is lifted. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mikemccand commented on issue #29: Can/should we make Jira read-only on migration to GitHub issues?

2022-07-16 Thread GitBox


mikemccand commented on issue #29:
URL: 
https://github.com/apache/lucene-jira-archive/issues/29#issuecomment-1186145578

   Hmm OK let me see if I have permissions ;)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-07-16 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567465#comment-17567465
 ] 

Michael McCandless commented on LUCENE-10557:
-

bq. [TEST] This was moved to GitHub issue: 
https://github.com/mocobeta/migration-test-3/issues/196.

Oooh that looks promising!!

> Migrate to GitHub issue from Jira
> -
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: Screen Shot 2022-06-29 at 11.02.35 AM.png, 
> image-2022-06-29-13-36-57-365.png, screenshot-1.png
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * (/) Get a consensus about the migration among committers
>  * (/) Choose issues that should be moved to GitHub - We'll migrate all 
> issues towards an atomic switch to GitHub if no major technical obstacles 
> show up.
>  ** Discussion thread 
> [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12]
>  ** -Conclusion for now: We don't migrate any issues. Only new issues should 
> be opened on GitHub.-
>  ** Write a prototype migration script - the decision could be made on that. 
> Things to consider:
>  *** version numbers - labels or milestones?
>  *** add a comment/ prepend a link to the source Jira issue on github side,
>  *** add a comment/ prepend a link on the jira side to the new issue on 
> github side (for people who access jira from blogs, mailing list archives and 
> other sources that will have stale links),
>  *** convert cross-issue automatic links in comments/ descriptions (as 
> suggested by Robert),
>  *** strategy to deal with sub-issues (hierarchies),
>  *** maybe prefix (or postfix) the issue title on github side with the 
> original LUCENE-XYZ key so that it is easier to search for a particular issue 
> there?
>  *** how to deal with user IDs (author, reporter, commenters)? Do they have 
> to be github users? Will information about people not registered on github be 
> lost?
>  *** create an extra mapping file of old-issue-new-issue URLs for any 
> potential future uses.
>  *** what to do with issue numbers in git/svn commits? These could be 
> rewritten but it'd change the entire git history tree - I don't think this is 
> practical, while doable.
> * Prepare a complete migration tool
> ** See https://github.com/apache/lucene-jira-archive/issues/5 
> * Build the convention for issue label/milestone management
>  ** See [https://github.com/apache/lucene-jira-archive/issues/6]
>  ** Do some experiments on a sandbox repository 
> [https://github.com/mocobeta/sandbox-lucene-10557]
>  ** Make documentation for metadata (label/milestone) management 
>  * (/) Enable Github issue on the lucene's repository
>  ** Raise an issue on INFRA
>  ** (Create an issue-only private repository for sensitive issues if it's 
> needed and allowed)
>  ** Set a mail hook to 
> [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to 
> the general mail group name)
>  * Set a schedule for migration
>  ** See [https://github.com/apache/lucene-jira-archive/issues/7]
>  ** Give some time to committers to play around with issues/labels/milestones 
> before the actual migration
>  ** Make an announcement on the mail lists
>  ** Show some text messages when opening a new Jira issue (in issue template?)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta commented on issue #7: Make a detailed migration plan

2022-07-16 Thread GitBox


mocobeta commented on issue #7:
URL: 
https://github.com/apache/lucene-jira-archive/issues/7#issuecomment-1186146650

   Once the migration is started, issues opened in Jira have to be manually 
migrated to GitHub by the authors afterward and it'd be bothersome.
   
   I wanted to add some texts that say, 
   ```
   We are switching from Jira to GitHub issues, and data migration is now in 
progress.
   Although you can still open a Jira issue, you may want to wait until the 
migration is finished
   and open a GitHub issue after that, if you are not in a hurry.
   Migration will be completed within a few days.
   ```
to the Jira issue template (wording could be refined).
   
   But it looks like I don't have permission to browse/edit the issue 
templates... Could someone who is able to edit the issue template help me with 
it?
   
   Thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-07-16 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567466#comment-17567466
 ] 

Michael McCandless commented on LUCENE-10557:
-

OK I am able to administer our Jira instance.

There are some wrinkles – apparently because some of our workflows are shared 
across two projects (Lucene and Solr), the workflows themselves are read-only!  
So we cannot change them unless we work the workflows.

But there is much discussion about this problem, e.g.: 
[https://community.atlassian.com/t5/Jira-questions/Fastest-way-to-make-JIRA-read-only/qaq-p/1261492]

I'll try to find the simplest way that works for us.

> Migrate to GitHub issue from Jira
> -
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: Screen Shot 2022-06-29 at 11.02.35 AM.png, 
> image-2022-06-29-13-36-57-365.png, screenshot-1.png
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * (/) Get a consensus about the migration among committers
>  * (/) Choose issues that should be moved to GitHub - We'll migrate all 
> issues towards an atomic switch to GitHub if no major technical obstacles 
> show up.
>  ** Discussion thread 
> [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12]
>  ** -Conclusion for now: We don't migrate any issues. Only new issues should 
> be opened on GitHub.-
>  ** Write a prototype migration script - the decision could be made on that. 
> Things to consider:
>  *** version numbers - labels or milestones?
>  *** add a comment/ prepend a link to the source Jira issue on github side,
>  *** add a comment/ prepend a link on the jira side to the new issue on 
> github side (for people who access jira from blogs, mailing list archives and 
> other sources that will have stale links),
>  *** convert cross-issue automatic links in comments/ descriptions (as 
> suggested by Robert),
>  *** strategy to deal with sub-issues (hierarchies),
>  *** maybe prefix (or postfix) the issue title on github side with the 
> original LUCENE-XYZ key so that it is easier to search for a particular issue 
> there?
>  *** how to deal with user IDs (author, reporter, commenters)? Do they have 
> to be github users? Will information about people not registered on github be 
> lost?
>  *** create an extra mapping file of old-issue-new-issue URLs for any 
> potential future uses.
>  *** what to do with issue numbers in git/svn commits? These could be 
> rewritten but it'd change the entire git history tree - I don't think this is 
> practical, while doable.
> * Prepare a complete migration tool
> ** See https://github.com/apache/lucene-jira-archive/issues/5 
> * Build the convention for issue label/milestone management
>  ** See [https://github.com/apache/lucene-jira-archive/issues/6]
>  ** Do some experiments on a sandbox repository 
> [https://github.com/mocobeta/sandbox-lucene-10557]
>  ** Make documentation for metadata (label/milestone) management 
>  * (/) Enable Github issue on the lucene's repository
>  ** Raise an issue on INFRA
>  ** (Create an issue-only private repository for sensitive issues if it's 
> needed and allowed)
>  ** Set a mail hook to 
> [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to 
> the general mail group name)
>  * Set a schedule for migration
>  ** See [https://github.com/apache/lucene-jira-archive/issues/7]
>  ** Give some time to committers to play around with issues/labels/milestones 
> before the actual migration
>  ** Make an announcement on the mail lists
>  ** Show some text messages when opening a new Jira issue (in issue template?)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on pull request #1024: LUCENE-10557: Add GitHub issue templates

2022-07-16 Thread GitBox


mocobeta commented on PR #1024:
URL: https://github.com/apache/lucene/pull/1024#issuecomment-1186152291

   There are five pre-fixed issue templates (forms) written in YAML and they 
look like:
   
   - Bug Report
   ![Screenshot from 2022-07-16 
19-52-19](https://user-images.githubusercontent.com/1825333/179351932-599b9699-bf1c-4602-a348-d0773a14dcf3.png)
   
   - Test Improvement / Failure Report
   ![Screenshot from 2022-07-16 
19-52-52](https://user-images.githubusercontent.com/1825333/179351953-9d123001-eaf1-454e-ae14-63dc8aa7ecae.png)
   
   - Enhance Request/Suggestions
   - Task
   - Documentation Improvement
   ![Screenshot from 2022-07-16 
19-54-50](https://user-images.githubusercontent.com/1825333/179352003-049e7b61-3159-4ffd-8188-e152b15c219b.png)
   
   Other fields/components (checkbox, dropbox, and so on) can be added if we'd 
like.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #987: LUCENE-10627: Using CompositeByteBuf to Reduce Memory Copy

2022-07-16 Thread GitBox


jpountz commented on code in PR #987:
URL: https://github.com/apache/lucene/pull/987#discussion_r922672203


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsWriter.java:
##
@@ -247,21 +249,18 @@ private void flush(boolean force) throws IOException {
 writeHeader(docBase, numBufferedDocs, numStoredFields, lengths, sliced, 
dirtyChunk);
 
 // compress stored fields to fieldsStream.
-//
-// TODO: do we need to slice it since we already have the slices in the 
buffer? Perhaps
-// we should use max-block-bits restriction on the buffer itself, then we 
won't have to check it
-// here.
-byte[] content = bufferedDocs.toArrayCopy();
-bufferedDocs.reset();
-
 if (sliced) {
-  // big chunk, slice it
-  for (int compressed = 0; compressed < content.length; compressed += 
chunkSize) {
-compressor.compress(
-content, compressed, Math.min(chunkSize, content.length - 
compressed), fieldsStream);
+  // big chunk, slice it, using ByteBuffersDataInput ignore memory copy
+  ByteBuffersDataInput bytebuffers = bufferedDocs.toDataInput();
+  final int capacity = (int) bytebuffers.size();
+  for (int compressed = 0; compressed < capacity; compressed += chunkSize) 
{
+int l = Math.min(chunkSize, capacity - compressed);
+ByteBuffersDataInput bbdi = bytebuffers.slice(compressed, l);
+compressor.compress(bbdi, fieldsStream);
   }
 } else {
-  compressor.compress(content, 0, content.length, fieldsStream);
+  ByteBuffersDataInput bytebuffers = bufferedDocs.toDataInput();

Review Comment:
   Maybe move this before the `if` statement since we create `byteBuffers` the 
same way on both branches?



##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsWriter.java:
##
@@ -519,7 +518,13 @@ private void 
copyOneDoc(Lucene90CompressingStoredFieldsReader reader, int docID)
 assert reader.getVersion() == VERSION_CURRENT;
 SerializedDocument doc = reader.document(docID);
 startDocument();
-bufferedDocs.copyBytes(doc.in, doc.length);
+
+if (doc.in instanceof ByteArrayDataInput) {
+  // reuse ByteArrayDataInput to reduce memory copy
+  bufferedDocs.copyBytes((ByteArrayDataInput) doc.in, doc.length);
+} else {
+  bufferedDocs.copyBytes(doc.in, doc.length);
+}

Review Comment:
   I think that we could avoid this `instanceof` check by overriding 
`ByteBuffersDataOutput#copyBytes` to read directly into its internal buffers 
when they are not direct (ie. backed by a `byte[]`)? (Maybe in a separate 
change?)



##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/DeflateWithPresetDictCompressionMode.java:
##
@@ -163,12 +165,16 @@ private static class DeflateWithPresetDictCompressor 
extends Compressor {
 final Deflater compressor;
 final BugfixDeflater_JDK8252739 deflaterBugfix;
 byte[] compressed;
+byte[] bufferDict;
+byte[] bufferBlock;
 boolean closed;
 
 DeflateWithPresetDictCompressor(int level) {
   compressor = new Deflater(level, true);
   deflaterBugfix = BugfixDeflater_JDK8252739.createBugfix(compressor);
   compressed = new byte[64];
+  bufferDict = BytesRef.EMPTY_BYTES;
+  bufferBlock = BytesRef.EMPTY_BYTES;
 }
 
 private void doCompress(byte[] bytes, int off, int len, DataOutput out) 
throws IOException {

Review Comment:
   Can we remove this one and require callers to use the ByteBuffer variant 
instead?



##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/DeflateWithPresetDictCompressionMode.java:
##
@@ -199,23 +205,65 @@ private void doCompress(byte[] bytes, int off, int len, 
DataOutput out) throws I
   out.writeBytes(compressed, totalCount);
 }
 
+private void doCompress(ByteBuffer bytes, int len, DataOutput out) throws 
IOException {
+  if (len == 0) {
+out.writeVInt(0);
+return;
+  }
+  compressor.setInput(bytes);
+  compressor.finish();
+  if (compressor.needsInput()) {
+throw new IllegalStateException();
+  }
+
+  int totalCount = 0;
+  for (; ; ) {
+final int count =
+compressor.deflate(compressed, totalCount, compressed.length - 
totalCount);
+totalCount += count;
+assert totalCount <= compressed.length;
+if (compressor.finished()) {
+  break;
+} else {
+  compressed = ArrayUtil.grow(compressed);
+}
+  }
+
+  out.writeVInt(totalCount);
+  out.writeBytes(compressed, totalCount);
+}
+
 @Override
-public void compress(byte[] bytes, int off, int len, DataOutput out) 
throws IOException {
+public void compress(ByteBuffersDataInput buffersInput, DataOutput out) 
throws IOException {
+  final int len = (int) (buffersInput.size() - buffersInput.position());
+  final int en

[GitHub] [lucene] jpountz commented on a diff in pull request #1018: LUCENE-10480: Use BulkScorer to limit BMMScorer to only top-level disjunctions

2022-07-16 Thread GitBox


jpountz commented on code in PR #1018:
URL: https://github.com/apache/lucene/pull/1018#discussion_r922674843


##
lucene/core/src/java/org/apache/lucene/search/BooleanWeight.java:
##
@@ -191,6 +191,69 @@ public long cost() {
   // or null if it is not applicable
   // pkg-private for forcing use of BooleanScorer in tests
   BulkScorer optionalBulkScorer(LeafReaderContext context) throws IOException {
+if (scoreMode == ScoreMode.TOP_SCORES) {
+  if (query.getMinimumNumberShouldMatch() > 1 || weightedClauses.size() > 
2) {
+return null;
+  }
+
+  List optional = new ArrayList<>();
+  for (WeightedBooleanClause wc : weightedClauses) {
+Weight w = wc.weight;
+BooleanClause c = wc.clause;
+if (c.getOccur() != Occur.SHOULD) {
+  continue;
+}
+ScorerSupplier scorer = w.scorerSupplier(context);
+if (scorer != null) {
+  optional.add(scorer);
+}
+  }
+
+  if (optional.size() <= 1) {
+return null;
+  }
+
+  List optionalScorers = new ArrayList<>();
+  for (ScorerSupplier ss : optional) {
+optionalScorers.add(ss.get(Long.MAX_VALUE));
+  }
+
+  return new BulkScorer() {

Review Comment:
   It shouldn't be slower than the current code in `main` since `main` is using 
`DefaultBulkScorer`, is it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #1018: LUCENE-10480: Use BulkScorer to limit BMMScorer to only top-level disjunctions

2022-07-16 Thread GitBox


jpountz commented on code in PR #1018:
URL: https://github.com/apache/lucene/pull/1018#discussion_r922675268


##
lucene/core/src/java/org/apache/lucene/search/BooleanWeight.java:
##
@@ -191,6 +191,69 @@ public long cost() {
   // or null if it is not applicable
   // pkg-private for forcing use of BooleanScorer in tests
   BulkScorer optionalBulkScorer(LeafReaderContext context) throws IOException {
+if (scoreMode == ScoreMode.TOP_SCORES) {
+  if (query.getMinimumNumberShouldMatch() > 1 || weightedClauses.size() > 
2) {
+return null;
+  }
+
+  List optional = new ArrayList<>();
+  for (WeightedBooleanClause wc : weightedClauses) {
+Weight w = wc.weight;
+BooleanClause c = wc.clause;
+if (c.getOccur() != Occur.SHOULD) {
+  continue;
+}
+ScorerSupplier scorer = w.scorerSupplier(context);
+if (scorer != null) {
+  optional.add(scorer);
+}
+  }
+
+  if (optional.size() <= 1) {
+return null;
+  }
+
+  List optionalScorers = new ArrayList<>();
+  for (ScorerSupplier ss : optional) {
+optionalScorers.add(ss.get(Long.MAX_VALUE));
+  }
+
+  return new BulkScorer() {
+final Scorer bmmScorer = new 
BlockMaxMaxscoreScorer(BooleanWeight.this, optionalScorers);
+final int maxDoc = context.reader().maxDoc();
+final DocIdSetIterator iterator = bmmScorer.iterator();
+
+@Override
+public int score(LeafCollector collector, Bits acceptDocs, int min, 
int max)
+throws IOException {
+  max = Math.min(max, maxDoc);

Review Comment:
   I don't think we need this, do tests fail without it?



##
lucene/core/src/java/org/apache/lucene/search/BooleanWeight.java:
##
@@ -191,6 +191,69 @@ public long cost() {
   // or null if it is not applicable
   // pkg-private for forcing use of BooleanScorer in tests
   BulkScorer optionalBulkScorer(LeafReaderContext context) throws IOException {
+if (scoreMode == ScoreMode.TOP_SCORES) {
+  if (query.getMinimumNumberShouldMatch() > 1 || weightedClauses.size() > 
2) {
+return null;
+  }
+
+  List optional = new ArrayList<>();
+  for (WeightedBooleanClause wc : weightedClauses) {
+Weight w = wc.weight;
+BooleanClause c = wc.clause;
+if (c.getOccur() != Occur.SHOULD) {
+  continue;
+}
+ScorerSupplier scorer = w.scorerSupplier(context);
+if (scorer != null) {
+  optional.add(scorer);
+}
+  }
+
+  if (optional.size() <= 1) {
+return null;
+  }
+
+  List optionalScorers = new ArrayList<>();
+  for (ScorerSupplier ss : optional) {
+optionalScorers.add(ss.get(Long.MAX_VALUE));
+  }
+
+  return new BulkScorer() {
+final Scorer bmmScorer = new 
BlockMaxMaxscoreScorer(BooleanWeight.this, optionalScorers);
+final int maxDoc = context.reader().maxDoc();
+final DocIdSetIterator iterator = bmmScorer.iterator();
+
+@Override
+public int score(LeafCollector collector, Bits acceptDocs, int min, 
int max)
+throws IOException {
+  max = Math.min(max, maxDoc);
+  collector.setScorer(bmmScorer);
+
+  for (int doc = min; doc < max; ) {
+int advancedDoc = iterator.advance(doc);
+if (advancedDoc == DocIdSetIterator.NO_MORE_DOCS) {
+  return DocIdSetIterator.NO_MORE_DOCS;
+} else if (advancedDoc >= max) {
+  return max;
+}
+
+if (acceptDocs == null || acceptDocs.get(advancedDoc)) {
+  collector.collect(advancedDoc);
+}
+
+doc = advancedDoc + 1;
+  }
+
+  return max == maxDoc ? DocIdSetIterator.NO_MORE_DOCS : max;

Review Comment:
   Maybe we could remove the end condition from the for loop, so that we would 
hit the `if (advanceDoc >= max)` condition instead, and remove the above line?



##
lucene/core/src/java/org/apache/lucene/search/BooleanWeight.java:
##
@@ -191,6 +191,69 @@ public long cost() {
   // or null if it is not applicable
   // pkg-private for forcing use of BooleanScorer in tests
   BulkScorer optionalBulkScorer(LeafReaderContext context) throws IOException {
+if (scoreMode == ScoreMode.TOP_SCORES) {
+  if (query.getMinimumNumberShouldMatch() > 1 || weightedClauses.size() > 
2) {
+return null;
+  }
+
+  List optional = new ArrayList<>();
+  for (WeightedBooleanClause wc : weightedClauses) {
+Weight w = wc.weight;
+BooleanClause c = wc.clause;
+if (c.getOccur() != Occur.SHOULD) {
+  continue;
+}
+ScorerSupplier scorer = w.scorerSupplier(context);
+if (scorer != null) {
+  optional.add(scorer);
+}
+  }
+
+  if (optional.size() <= 1) {
+return null;
+  }
+
+  List optionalScorers = new Ar

[jira] [Commented] (LUCENE-10655) can we optimize visited bitset usage in HNSW graph search/indexing?

2022-07-16 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567482#comment-17567482
 ] 

Adrien Grand commented on LUCENE-10655:
---

I've been wondering if using a simple int hash set would help. FixedBitSet is 
super efficient CPU-wise, but it also requires lots of memory on large segments 
while we typically only set a limited number of bits, so it can quickly become 
memory-bound for random access, like we do when building the graph. An int hash 
set should also be cheaper to clear.

> can we optimize visited bitset usage in HNSW graph search/indexing?
> ---
>
> Key: LUCENE-10655
> URL: https://issues.apache.org/jira/browse/LUCENE-10655
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/hnsw
>Reporter: Michael Sokolov
>Priority: Major
>
> When running {{luceneutil}}  I noticed that {{FixedBitSet.clear()}} dominates 
> the CPU profiler output. I had a few ideas:
>  # In upper graph layers, the occupied nodes are very sparse - maybe 
> {{SparseFixedBitSet}} would be a better fit for those
>  # We are caching these bitsets, but they are only used for a single search 
> (single document insert, during indexing). Should we cache across searches? 
> We would need to pool them though, and they would vary by field since fields 
> can have different numbers of vector nodes. This starts to get complex
>  # Are we sure that clearing a bitset is more efficient than allocating a new 
> one? Maybe the JDK maintains a pool of already-zeroed memory for us
> I think we could try specializing the bitset type by graph level, and then I 
> think we ought to measure the performance of allocation vs the limited reuse 
> that we currently have.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10633) Dynamic pruning for queries sorted by SORTED(_SET) field

2022-07-16 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567485#comment-17567485
 ] 

Adrien Grand commented on LUCENE-10633:
---

Indeed the speedup is impressive. :) I should have noted that I had to tweak 
luceneutil to also index fields that were used for sorting so that the inverted 
index could be used to skip hits.

This change is very similar to LUCENE-9280, which led to annotation DD on 
[https://home.apache.org/~mikemccand/lucenebench/TermDayOfYearSort.html] and 
https://home.apache.org/~mikemccand/lucenebench/TermDTSort.html.

> Dynamic pruning for queries sorted by SORTED(_SET) field
> 
>
> Key: LUCENE-10633
> URL: https://issues.apache.org/jira/browse/LUCENE-10633
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> LUCENE-9280 introduced the ability to dynamically prune non-competitive hits 
> when sorting by a numeric field, by leveraging the points index to skip 
> documents that do not compare better than the top of the priority queue 
> maintained by the field comparator.
> However queries sorted by a SORTED(_SET) field still look at all hits, which 
> is disappointing. Could we leverage the terms index to skip hits?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-07-16 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567486#comment-17567486
 ] 

Michael McCandless commented on LUCENE-10557:
-

Hi [~tomoko] – I added you as a Jira Administrator so you can poke around if 
you want to.

But I'll still try to figure out how to make Jira read-only.

> Migrate to GitHub issue from Jira
> -
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: Screen Shot 2022-06-29 at 11.02.35 AM.png, 
> image-2022-06-29-13-36-57-365.png, screenshot-1.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * (/) Get a consensus about the migration among committers
>  * (/) Choose issues that should be moved to GitHub - We'll migrate all 
> issues towards an atomic switch to GitHub if no major technical obstacles 
> show up.
>  ** Discussion thread 
> [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12]
>  ** -Conclusion for now: We don't migrate any issues. Only new issues should 
> be opened on GitHub.-
>  ** Write a prototype migration script - the decision could be made on that. 
> Things to consider:
>  *** version numbers - labels or milestones?
>  *** add a comment/ prepend a link to the source Jira issue on github side,
>  *** add a comment/ prepend a link on the jira side to the new issue on 
> github side (for people who access jira from blogs, mailing list archives and 
> other sources that will have stale links),
>  *** convert cross-issue automatic links in comments/ descriptions (as 
> suggested by Robert),
>  *** strategy to deal with sub-issues (hierarchies),
>  *** maybe prefix (or postfix) the issue title on github side with the 
> original LUCENE-XYZ key so that it is easier to search for a particular issue 
> there?
>  *** how to deal with user IDs (author, reporter, commenters)? Do they have 
> to be github users? Will information about people not registered on github be 
> lost?
>  *** create an extra mapping file of old-issue-new-issue URLs for any 
> potential future uses.
>  *** what to do with issue numbers in git/svn commits? These could be 
> rewritten but it'd change the entire git history tree - I don't think this is 
> practical, while doable.
> * Prepare a complete migration tool
> ** See https://github.com/apache/lucene-jira-archive/issues/5 
> * Build the convention for issue label/milestone management
>  ** See [https://github.com/apache/lucene-jira-archive/issues/6]
>  ** Do some experiments on a sandbox repository 
> [https://github.com/mocobeta/sandbox-lucene-10557]
>  ** Make documentation for metadata (label/milestone) management 
>  * (/) Enable Github issue on the lucene's repository
>  ** Raise an issue on INFRA
>  ** (Create an issue-only private repository for sensitive issues if it's 
> needed and allowed)
>  ** Set a mail hook to 
> [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to 
> the general mail group name)
>  * Set a schedule for migration
>  ** See [https://github.com/apache/lucene-jira-archive/issues/7]
>  ** Give some time to committers to play around with issues/labels/milestones 
> before the actual migration
>  ** Make an announcement on the mail lists
>  ** Show some text messages when opening a new Jira issue (in issue template?)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-07-16 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567492#comment-17567492
 ] 

Tomoko Uchida commented on LUCENE-10557:


[~mikemccand] thank you, I'm now able to edit the configuration. However, I am 
struggling with figuring out how to tweak the issue creation panel. I just 
wanted to set a placeholder or default value to the "Description" field...

> Migrate to GitHub issue from Jira
> -
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: Screen Shot 2022-06-29 at 11.02.35 AM.png, 
> image-2022-06-29-13-36-57-365.png, screenshot-1.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * (/) Get a consensus about the migration among committers
>  * (/) Choose issues that should be moved to GitHub - We'll migrate all 
> issues towards an atomic switch to GitHub if no major technical obstacles 
> show up.
>  ** Discussion thread 
> [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12]
>  ** -Conclusion for now: We don't migrate any issues. Only new issues should 
> be opened on GitHub.-
>  ** Write a prototype migration script - the decision could be made on that. 
> Things to consider:
>  *** version numbers - labels or milestones?
>  *** add a comment/ prepend a link to the source Jira issue on github side,
>  *** add a comment/ prepend a link on the jira side to the new issue on 
> github side (for people who access jira from blogs, mailing list archives and 
> other sources that will have stale links),
>  *** convert cross-issue automatic links in comments/ descriptions (as 
> suggested by Robert),
>  *** strategy to deal with sub-issues (hierarchies),
>  *** maybe prefix (or postfix) the issue title on github side with the 
> original LUCENE-XYZ key so that it is easier to search for a particular issue 
> there?
>  *** how to deal with user IDs (author, reporter, commenters)? Do they have 
> to be github users? Will information about people not registered on github be 
> lost?
>  *** create an extra mapping file of old-issue-new-issue URLs for any 
> potential future uses.
>  *** what to do with issue numbers in git/svn commits? These could be 
> rewritten but it'd change the entire git history tree - I don't think this is 
> practical, while doable.
> * Prepare a complete migration tool
> ** See https://github.com/apache/lucene-jira-archive/issues/5 
> * Build the convention for issue label/milestone management
>  ** See [https://github.com/apache/lucene-jira-archive/issues/6]
>  ** Do some experiments on a sandbox repository 
> [https://github.com/mocobeta/sandbox-lucene-10557]
>  ** Make documentation for metadata (label/milestone) management 
>  * (/) Enable Github issue on the lucene's repository
>  ** Raise an issue on INFRA
>  ** (Create an issue-only private repository for sensitive issues if it's 
> needed and allowed)
>  ** Set a mail hook to 
> [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to 
> the general mail group name)
>  * Set a schedule for migration
>  ** See [https://github.com/apache/lucene-jira-archive/issues/7]
>  ** Give some time to committers to play around with issues/labels/milestones 
> before the actual migration
>  ** Make an announcement on the mail lists
>  ** Show some text messages when opening a new Jira issue (in issue template?)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-07-16 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567495#comment-17567495
 ] 

Tomoko Uchida commented on LUCENE-10557:


it seems "Project admin" is not really allowed to do meaningful things, almost 
all components are shared between projects, and only Jira administrators can 
change them.

> Migrate to GitHub issue from Jira
> -
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: Screen Shot 2022-06-29 at 11.02.35 AM.png, 
> image-2022-06-29-13-36-57-365.png, screenshot-1.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * (/) Get a consensus about the migration among committers
>  * (/) Choose issues that should be moved to GitHub - We'll migrate all 
> issues towards an atomic switch to GitHub if no major technical obstacles 
> show up.
>  ** Discussion thread 
> [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12]
>  ** -Conclusion for now: We don't migrate any issues. Only new issues should 
> be opened on GitHub.-
>  ** Write a prototype migration script - the decision could be made on that. 
> Things to consider:
>  *** version numbers - labels or milestones?
>  *** add a comment/ prepend a link to the source Jira issue on github side,
>  *** add a comment/ prepend a link on the jira side to the new issue on 
> github side (for people who access jira from blogs, mailing list archives and 
> other sources that will have stale links),
>  *** convert cross-issue automatic links in comments/ descriptions (as 
> suggested by Robert),
>  *** strategy to deal with sub-issues (hierarchies),
>  *** maybe prefix (or postfix) the issue title on github side with the 
> original LUCENE-XYZ key so that it is easier to search for a particular issue 
> there?
>  *** how to deal with user IDs (author, reporter, commenters)? Do they have 
> to be github users? Will information about people not registered on github be 
> lost?
>  *** create an extra mapping file of old-issue-new-issue URLs for any 
> potential future uses.
>  *** what to do with issue numbers in git/svn commits? These could be 
> rewritten but it'd change the entire git history tree - I don't think this is 
> practical, while doable.
> * Prepare a complete migration tool
> ** See https://github.com/apache/lucene-jira-archive/issues/5 
> * Build the convention for issue label/milestone management
>  ** See [https://github.com/apache/lucene-jira-archive/issues/6]
>  ** Do some experiments on a sandbox repository 
> [https://github.com/mocobeta/sandbox-lucene-10557]
>  ** Make documentation for metadata (label/milestone) management 
>  * (/) Enable Github issue on the lucene's repository
>  ** Raise an issue on INFRA
>  ** (Create an issue-only private repository for sensitive issues if it's 
> needed and allowed)
>  ** Set a mail hook to 
> [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to 
> the general mail group name)
>  * Set a schedule for migration
>  ** See [https://github.com/apache/lucene-jira-archive/issues/7]
>  ** Give some time to committers to play around with issues/labels/milestones 
> before the actual migration
>  ** Make an announcement on the mail lists
>  ** Show some text messages when opening a new Jira issue (in issue template?)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-07-16 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567504#comment-17567504
 ] 

Michael McCandless commented on LUCENE-10557:
-

OK hmm we will likely need Infra's help for this then.

> Migrate to GitHub issue from Jira
> -
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: Screen Shot 2022-06-29 at 11.02.35 AM.png, 
> image-2022-06-29-13-36-57-365.png, screenshot-1.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * (/) Get a consensus about the migration among committers
>  * (/) Choose issues that should be moved to GitHub - We'll migrate all 
> issues towards an atomic switch to GitHub if no major technical obstacles 
> show up.
>  ** Discussion thread 
> [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12]
>  ** -Conclusion for now: We don't migrate any issues. Only new issues should 
> be opened on GitHub.-
>  ** Write a prototype migration script - the decision could be made on that. 
> Things to consider:
>  *** version numbers - labels or milestones?
>  *** add a comment/ prepend a link to the source Jira issue on github side,
>  *** add a comment/ prepend a link on the jira side to the new issue on 
> github side (for people who access jira from blogs, mailing list archives and 
> other sources that will have stale links),
>  *** convert cross-issue automatic links in comments/ descriptions (as 
> suggested by Robert),
>  *** strategy to deal with sub-issues (hierarchies),
>  *** maybe prefix (or postfix) the issue title on github side with the 
> original LUCENE-XYZ key so that it is easier to search for a particular issue 
> there?
>  *** how to deal with user IDs (author, reporter, commenters)? Do they have 
> to be github users? Will information about people not registered on github be 
> lost?
>  *** create an extra mapping file of old-issue-new-issue URLs for any 
> potential future uses.
>  *** what to do with issue numbers in git/svn commits? These could be 
> rewritten but it'd change the entire git history tree - I don't think this is 
> practical, while doable.
> * Prepare a complete migration tool
> ** See https://github.com/apache/lucene-jira-archive/issues/5 
> * Build the convention for issue label/milestone management
>  ** See [https://github.com/apache/lucene-jira-archive/issues/6]
>  ** Do some experiments on a sandbox repository 
> [https://github.com/mocobeta/sandbox-lucene-10557]
>  ** Make documentation for metadata (label/milestone) management 
>  * (/) Enable Github issue on the lucene's repository
>  ** Raise an issue on INFRA
>  ** (Create an issue-only private repository for sensitive issues if it's 
> needed and allowed)
>  ** Set a mail hook to 
> [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to 
> the general mail group name)
>  * Set a schedule for migration
>  ** See [https://github.com/apache/lucene-jira-archive/issues/7]
>  ** Give some time to committers to play around with issues/labels/milestones 
> before the actual migration
>  ** Make an announcement on the mail lists
>  ** Show some text messages when opening a new Jira issue (in issue template?)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] hcqs33 opened a new pull request, #1026: Fix error in TieredMergePolicy

2022-07-16 Thread GitBox


hcqs33 opened a new pull request, #1026:
URL: https://github.com/apache/lucene/pull/1026

   Fix error in comparing between bytes of candidates and bytes of max merge.
   It's wrong to use `candidateSize` rather than `currentCandidateBytes` 
comparing with `maxMergeBytes`. Minor change to fix it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn commented on a diff in pull request #1018: LUCENE-10480: Use BulkScorer to limit BMMScorer to only top-level disjunctions

2022-07-16 Thread GitBox


zacharymorn commented on code in PR #1018:
URL: https://github.com/apache/lucene/pull/1018#discussion_r922765220


##
lucene/core/src/java/org/apache/lucene/search/BooleanWeight.java:
##
@@ -191,6 +191,69 @@ public long cost() {
   // or null if it is not applicable
   // pkg-private for forcing use of BooleanScorer in tests
   BulkScorer optionalBulkScorer(LeafReaderContext context) throws IOException {
+if (scoreMode == ScoreMode.TOP_SCORES) {
+  if (query.getMinimumNumberShouldMatch() > 1 || weightedClauses.size() > 
2) {
+return null;
+  }
+
+  List optional = new ArrayList<>();
+  for (WeightedBooleanClause wc : weightedClauses) {
+Weight w = wc.weight;
+BooleanClause c = wc.clause;
+if (c.getOccur() != Occur.SHOULD) {
+  continue;
+}
+ScorerSupplier scorer = w.scorerSupplier(context);
+if (scorer != null) {
+  optional.add(scorer);
+}
+  }
+
+  if (optional.size() <= 1) {
+return null;
+  }
+
+  List optionalScorers = new ArrayList<>();
+  for (ScorerSupplier ss : optional) {
+optionalScorers.add(ss.get(Long.MAX_VALUE));
+  }
+
+  return new BulkScorer() {

Review Comment:
   > It shouldn't be slower than the current code in main since main is using 
DefaultBulkScorer, is it?
   
   The baseline of all of the above benchmark results are still using the head 
prior to all BMM changes. Since this approach (anonymous bulk scorer + BMM 
scorer) still has similar performance boost with the previous one (just BMM 
scorer) for top-level disjunctions, but no impact to nested boolean queries, I 
would think so? I'm not sure I'm fully understanding this question though.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn commented on a diff in pull request #1018: LUCENE-10480: Use BulkScorer to limit BMMScorer to only top-level disjunctions

2022-07-16 Thread GitBox


zacharymorn commented on code in PR #1018:
URL: https://github.com/apache/lucene/pull/1018#discussion_r922765998


##
lucene/core/src/java/org/apache/lucene/search/BooleanWeight.java:
##
@@ -191,6 +191,69 @@ public long cost() {
   // or null if it is not applicable
   // pkg-private for forcing use of BooleanScorer in tests
   BulkScorer optionalBulkScorer(LeafReaderContext context) throws IOException {
+if (scoreMode == ScoreMode.TOP_SCORES) {
+  if (query.getMinimumNumberShouldMatch() > 1 || weightedClauses.size() > 
2) {
+return null;
+  }
+
+  List optional = new ArrayList<>();
+  for (WeightedBooleanClause wc : weightedClauses) {
+Weight w = wc.weight;
+BooleanClause c = wc.clause;
+if (c.getOccur() != Occur.SHOULD) {
+  continue;
+}
+ScorerSupplier scorer = w.scorerSupplier(context);
+if (scorer != null) {
+  optional.add(scorer);
+}
+  }
+
+  if (optional.size() <= 1) {
+return null;
+  }
+
+  List optionalScorers = new ArrayList<>();
+  for (ScorerSupplier ss : optional) {
+optionalScorers.add(ss.get(Long.MAX_VALUE));
+  }
+
+  return new BulkScorer() {
+final Scorer bmmScorer = new 
BlockMaxMaxscoreScorer(BooleanWeight.this, optionalScorers);
+final int maxDoc = context.reader().maxDoc();
+final DocIdSetIterator iterator = bmmScorer.iterator();
+
+@Override
+public int score(LeafCollector collector, Bits acceptDocs, int min, 
int max)
+throws IOException {
+  max = Math.min(max, maxDoc);

Review Comment:
   Yup this is indeed optional and tests didn't fail without it. I've removed 
it.



##
lucene/core/src/java/org/apache/lucene/search/BooleanWeight.java:
##
@@ -191,6 +191,69 @@ public long cost() {
   // or null if it is not applicable
   // pkg-private for forcing use of BooleanScorer in tests
   BulkScorer optionalBulkScorer(LeafReaderContext context) throws IOException {
+if (scoreMode == ScoreMode.TOP_SCORES) {
+  if (query.getMinimumNumberShouldMatch() > 1 || weightedClauses.size() > 
2) {
+return null;
+  }
+
+  List optional = new ArrayList<>();
+  for (WeightedBooleanClause wc : weightedClauses) {
+Weight w = wc.weight;
+BooleanClause c = wc.clause;
+if (c.getOccur() != Occur.SHOULD) {
+  continue;
+}
+ScorerSupplier scorer = w.scorerSupplier(context);
+if (scorer != null) {
+  optional.add(scorer);
+}
+  }
+
+  if (optional.size() <= 1) {
+return null;
+  }
+
+  List optionalScorers = new ArrayList<>();
+  for (ScorerSupplier ss : optional) {
+optionalScorers.add(ss.get(Long.MAX_VALUE));
+  }
+
+  return new BulkScorer() {
+final Scorer bmmScorer = new 
BlockMaxMaxscoreScorer(BooleanWeight.this, optionalScorers);
+final int maxDoc = context.reader().maxDoc();
+final DocIdSetIterator iterator = bmmScorer.iterator();
+
+@Override
+public int score(LeafCollector collector, Bits acceptDocs, int min, 
int max)
+throws IOException {
+  max = Math.min(max, maxDoc);
+  collector.setScorer(bmmScorer);
+
+  for (int doc = min; doc < max; ) {
+int advancedDoc = iterator.advance(doc);
+if (advancedDoc == DocIdSetIterator.NO_MORE_DOCS) {
+  return DocIdSetIterator.NO_MORE_DOCS;
+} else if (advancedDoc >= max) {
+  return max;
+}

Review Comment:
   Updated.



##
lucene/core/src/java/org/apache/lucene/search/BooleanWeight.java:
##
@@ -191,6 +191,69 @@ public long cost() {
   // or null if it is not applicable
   // pkg-private for forcing use of BooleanScorer in tests
   BulkScorer optionalBulkScorer(LeafReaderContext context) throws IOException {
+if (scoreMode == ScoreMode.TOP_SCORES) {
+  if (query.getMinimumNumberShouldMatch() > 1 || weightedClauses.size() > 
2) {
+return null;
+  }
+
+  List optional = new ArrayList<>();
+  for (WeightedBooleanClause wc : weightedClauses) {
+Weight w = wc.weight;
+BooleanClause c = wc.clause;
+if (c.getOccur() != Occur.SHOULD) {
+  continue;
+}
+ScorerSupplier scorer = w.scorerSupplier(context);
+if (scorer != null) {
+  optional.add(scorer);
+}
+  }
+
+  if (optional.size() <= 1) {
+return null;
+  }
+
+  List optionalScorers = new ArrayList<>();
+  for (ScorerSupplier ss : optional) {
+optionalScorers.add(ss.get(Long.MAX_VALUE));
+  }
+
+  return new BulkScorer() {
+final Scorer bmmScorer = new 
BlockMaxMaxscoreScorer(BooleanWeight.this, optionalScorers);
+final int maxDoc = context.reader().maxDoc();
+final DocIdSetIterator iterator = bmmScorer.iterator();

[GitHub] [lucene] zacharymorn commented on pull request #1018: LUCENE-10480: Use BulkScorer to limit BMMScorer to only top-level disjunctions

2022-07-16 Thread GitBox


zacharymorn commented on PR #1018:
URL: https://github.com/apache/lucene/pull/1018#issuecomment-1186390917

   > Thanks for explaining the motivation for the dedicated bulk scorer, I left 
some suggestions.
   
   No problem and thanks for the suggestions! I have incorporated them and like 
how clean the bulk scorer looks now!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta opened a new issue, #48: test issue with component

2022-07-16 Thread GitBox


mocobeta opened a new issue, #48:
URL: https://github.com/apache/lucene-jira-archive/issues/48

   ### Description
   
   test
   
   ### Version and Environments
   
   _No response_
   
   ### Lucene Component
   
   component:module/analysis


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta closed issue #48: test issue with component

2022-07-16 Thread GitBox


mocobeta closed issue #48: test issue with component
URL: https://github.com/apache/lucene-jira-archive/issues/48


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn commented on pull request #1018: LUCENE-10480: Use BulkScorer to limit BMMScorer to only top-level disjunctions

2022-07-16 Thread GitBox


zacharymorn commented on PR #1018:
URL: https://github.com/apache/lucene/pull/1018#issuecomment-1186398587

   Here are the latest `wikinightly` benchmark results:
   
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
   BrowseDateSSDVFacets3.98 (34.1%)3.73 
(29.8%)   -6.2% ( -52% -   87%) 0.541
 OrHighMedDayTaxoFacets   24.64  (5.9%)   23.96  
(9.5%)   -2.7% ( -17% -   13%) 0.271
 TermDTSort  342.77  (7.8%)  336.36  
(4.7%)   -1.9% ( -13% -   11%) 0.359
BrowseRandomLabelSSDVFacets   20.43  (9.3%)   20.06  
(9.4%)   -1.8% ( -18% -   18%) 0.539
 TermBGroup1M1P   37.19  (7.0%)   36.72  
(5.2%)   -1.3% ( -12% -   11%) 0.521
   AndHighHighDayTaxoFacets   12.29  (3.1%)   12.13  
(2.9%)   -1.3% (  -7% -4%) 0.191
   MedTermDayTaxoFacets   75.53  (5.2%)   75.06  
(5.3%)   -0.6% ( -10% -   10%) 0.706
  TermMonthSort  351.78  (6.0%)  349.61  
(2.6%)   -0.6% (  -8% -8%) 0.675
 Fuzzy1   79.12  (2.5%)   78.71  
(2.4%)   -0.5% (  -5% -4%) 0.509
   IntervalsOrdered   13.21  (3.1%)   13.14  
(3.4%)   -0.5% (  -6% -6%) 0.625
 TermDateFacets   72.10  (5.6%)   71.78  
(5.5%)   -0.4% ( -10% -   11%) 0.797
  TermTitleSort  350.94  (6.0%)  349.80  
(2.8%)   -0.3% (  -8% -8%) 0.826
   PKLookup  322.25  (5.8%)  321.46  
(4.3%)   -0.2% (  -9% -   10%) 0.879
   SpanNear  166.41  (3.5%)  166.06  
(2.1%)   -0.2% (  -5% -5%) 0.821
   SloppyPhrase4.74  (4.4%)4.75  
(3.7%)0.1% (  -7% -8%) 0.942
   Term 3394.26  (5.0%) 3398.22  
(5.5%)0.1% (  -9% -   11%) 0.944
   AndMedOrHighHigh   70.98  (5.5%)   71.07  
(5.5%)0.1% ( -10% -   11%) 0.945
AndHighMedDayTaxoFacets  121.81  (2.5%)  122.12  
(2.3%)0.3% (  -4% -5%) 0.737
 Phrase   38.19  (2.5%)   38.29  
(2.2%)0.3% (  -4% -5%) 0.724
AndHighOrMedMed  120.53  (5.4%)  120.92  
(5.4%)0.3% (  -9% -   11%) 0.849
Respell   91.05  (2.9%)   91.55  
(2.5%)0.5% (  -4% -6%) 0.522
 Fuzzy2  120.74  (2.5%)  121.46  
(2.5%)0.6% (  -4% -5%) 0.453
AndHighHigh   99.32  (3.3%)  100.24  
(3.7%)0.9% (  -5% -8%) 0.403
 IntNRQ 1188.88  (3.2%) 1200.31  
(3.4%)1.0% (  -5% -7%) 0.361
   Wildcard  163.38  (7.0%)  165.12  
(4.5%)1.1% (  -9% -   13%) 0.566
 AndHighMed  156.13  (5.2%)  158.09  
(5.1%)1.3% (  -8% -   12%) 0.439
  TermDayOfYearSort  140.35  (3.1%)  142.36  
(4.7%)1.4% (  -6% -9%) 0.255
  BrowseDayOfYearSSDVFacets   26.19 (12.8%)   26.60 
(11.7%)1.6% ( -20% -   29%) 0.686
   TermGroup100   65.78  (2.5%)   66.85  
(3.7%)1.6% (  -4% -8%) 0.109
  BrowseMonthTaxoFacets   28.68 (34.4%)   29.16 
(37.1%)1.7% ( -52% -  111%) 0.883
Prefix3   85.54  (6.6%)   87.24  
(5.6%)2.0% (  -9% -   15%) 0.301
  BrowseDayOfYearTaxoFacets   28.90 (30.4%)   29.64 
(33.9%)2.6% ( -47% -   96%) 0.800
   TermGroup10K   40.11  (3.8%)   41.31  
(4.0%)3.0% (  -4% -   11%) 0.017
TermGroup1M   38.63  (3.8%)   39.82  
(3.7%)3.1% (  -4% -   10%) 0.009
   TermBGroup1M   46.33  (3.8%)   47.77  
(4.5%)3.1% (  -5% -   11%) 0.019
   BrowseDateTaxoFacets   28.50 (30.4%)   29.46 
(34.6%)3.4% ( -47% -   98%) 0.745
  BrowseMonthSSDVFacets   28.27 (14.7%)   29.70 
(15.4%)5.0% ( -21% -   41%) 0.292
BrowseRandomLabelTaxoFacets   28.78 (50.1%)   30.70 
(52.7%)6.7% ( -64% -  219%) 0.680
 OrHighHigh   25.55  (5.9%)   37.99  
(6.8%)   48.7% (  34% -   65%) 0.000
  OrHighMed   92.43  (6.4%)  210.19 
(11.3%)  127.4% ( 103% -  155%) 0.000
   ```
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_vers

[GitHub] [lucene] zacharymorn commented on pull request #1018: LUCENE-10480: Use BulkScorer to limit BMMScorer to only top-level disjunctions

2022-07-16 Thread GitBox


zacharymorn commented on PR #1018:
URL: https://github.com/apache/lucene/pull/1018#issuecomment-1186399338

   @jpountz If this approach to limiting BMM scorer to top-level disjunctions 
looks good to you, I can go ahead and update the corresponding tests to make 
this PR ready ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta opened a new pull request, #49: Enable mention to comment authors

2022-07-16 Thread GitBox


mocobeta opened a new pull request, #49:
URL: https://github.com/apache/lucene-jira-archive/pull/49

   #27 
   
   ![Screenshot from 2022-07-17 
14-03-02](https://user-images.githubusercontent.com/1825333/179384647-3384b750-83a2-447d-aa89-c44150cb8cc3.png)
   
   should be
   
   ![Screenshot from 2022-07-17 
14-03-13](https://user-images.githubusercontent.com/1825333/179384652-220b7d8d-ba2c-4ce5-b90b-ed517e37f3eb.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org