date:20220718

[GitHub] [lucene] jpountz commented on a diff in pull request #1018: LUCENE-10480: Use BulkScorer to limit BMMScorer to only top-level disjunctions

2022-07-18 Thread GitBox



jpountz commented on code in PR #1018:
URL: https://github.com/apache/lucene/pull/1018#discussion_r923041831


##
lucene/core/src/java/org/apache/lucene/search/BooleanWeight.java:
##
@@ -191,6 +191,66 @@ public long cost() {
   // or null if it is not applicable
   // pkg-private for forcing use of BooleanScorer in tests
   BulkScorer optionalBulkScorer(LeafReaderContext context) throws IOException {
+if (scoreMode == ScoreMode.TOP_SCORES) {
+  if (query.getMinimumNumberShouldMatch() > 1 || weightedClauses.size() > 
2) {
+return null;
+  }
+
+  List optional = new ArrayList<>();
+  for (WeightedBooleanClause wc : weightedClauses) {
+Weight w = wc.weight;
+BooleanClause c = wc.clause;
+if (c.getOccur() != Occur.SHOULD) {
+  continue;
+}
+ScorerSupplier scorer = w.scorerSupplier(context);
+if (scorer != null) {
+  optional.add(scorer);
+}
+  }
+
+  if (optional.size() <= 1) {
+return null;
+  }
+
+  List optionalScorers = new ArrayList<>();
+  for (ScorerSupplier ss : optional) {
+optionalScorers.add(ss.get(Long.MAX_VALUE));
+  }
+
+  return new BulkScorer() {
+final Scorer bmmScorer = new 
BlockMaxMaxscoreScorer(BooleanWeight.this, optionalScorers);
+final int maxDoc = context.reader().maxDoc();
+final DocIdSetIterator iterator = bmmScorer.iterator();
+
+@Override
+public int score(LeafCollector collector, Bits acceptDocs, int min, 
int max)
+throws IOException {
+  collector.setScorer(bmmScorer);
+
+  int doc = min;
+  while (true) {
+doc = iterator.advance(doc);

Review Comment:
   Yes, indeed, though we might be able to simplify it to look like below:
   
   ```java
  new BulkScorer() {
   final Scorer bmmScorer = new 
BlockMaxMaxscoreScorer(BooleanWeight.this, optionalScorers);
   final DocIdSetIterator iterator = bmmScorer.iterator();
   
  @Override
   public int score(LeafCollector collector, Bits acceptDocs, int min, 
int max)
   throws IOException {
 collector.setScorer(bmmScorer);
   
 int doc = bmmScorer.docID();
 if (doc < min) {
   doc = bmmScorer.advance(min);
 }
 while (doc < max) {
   if (acceptDocs == null || acceptDocs.get(doc)) {
 collector.collect(doc);
   }
   
   doc = bmmScorer.nextDoc();
 }
 return doc;
   }
  }
   ```
   
   The reason is that a consumer of the bulk scorer could do something like:
   
   ```java
   bulkScorer.score(collector, null, 0, 1000);
   bulkScorer.score(collector, null, 1000, 2000);
   ```
   
   If the last match of the first window is say 998 and the first match after 
the first window is 1005. Then we should make sure to score 1005 when scoring 
the second window before starting to advance.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-10497) Unify "Token" interface in Kuromoji and Nori

2022-07-18 Thread Tomoko Uchida (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida resolved LUCENE-10497.

Fix Version/s: 10.0 (main)
 Assignee: Tomoko Uchida
   Resolution: Fixed

> Unify "Token" interface in Kuromoji and Nori
> 
>
> Key: LUCENE-10497
> URL: https://issues.apache.org/jira/browse/LUCENE-10497
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Minor
> Fix For: 10.0 (main)
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> "Token" classes are the output of the viterbi algorithm in kuromoji and nori, 
> and its implementation is slightly different between them (Nori's Token class 
> looks more expressive or extendable).
> To proceed [LUCENE-10493], I think we need the unified "Token" class or at 
> least a common interface for it. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-10493) Can we unify the viterbi search logic in the tokenizers of kuromoji and nori?

2022-07-18 Thread Tomoko Uchida (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida resolved LUCENE-10493.

Fix Version/s: 10.0 (main)
 Assignee: Tomoko Uchida
   Resolution: Fixed

> Can we unify the viterbi search logic in the tokenizers of kuromoji and nori?
> -
>
> Key: LUCENE-10493
> URL: https://issues.apache.org/jira/browse/LUCENE-10493
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Fix For: 10.0 (main)
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> We now have common dictionary interfaces for kuromoji and nori 
> ([LUCENE-10393]). A natural question would be: is it possible to unify the 
> Japanese/Korean tokenizers? 
> The core methods of the two tokenizers are `parse()` and `backtrace()` to 
> calculate the minimum cost path by Viterbi search. I'd set the goal of this 
> issue to factoring out them into a separate class (in analysis-common) that 
> is shared between JapaneseTokenizer and KoreanTokenizer. 
> The algorithm to solve the minimum cost path itself is of course 
> language-agnostic, so I think it should be theoretically possible; the most 
> difficult part here might be the N-best path calculation - which is supported 
> only by JapaneseTokenizer and not by KoreanTokenizer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] luyuncheng commented on a diff in pull request #987: LUCENE-10627: Using CompositeByteBuf to Reduce Memory Copy

2022-07-18 Thread GitBox



luyuncheng commented on code in PR #987:
URL: https://github.com/apache/lucene/pull/987#discussion_r923100057


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/DeflateWithPresetDictCompressionMode.java:
##
@@ -163,12 +165,16 @@ private static class DeflateWithPresetDictCompressor 
extends Compressor {
 final Deflater compressor;
 final BugfixDeflater_JDK8252739 deflaterBugfix;
 byte[] compressed;
+byte[] bufferDict;
+byte[] bufferBlock;
 boolean closed;
 
 DeflateWithPresetDictCompressor(int level) {
   compressor = new Deflater(level, true);
   deflaterBugfix = BugfixDeflater_JDK8252739.createBugfix(compressor);
   compressed = new byte[64];
+  bufferDict = BytesRef.EMPTY_BYTES;
+  bufferBlock = BytesRef.EMPTY_BYTES;
 }
 
 private void doCompress(byte[] bytes, int off, int len, DataOutput out) 
throws IOException {

Review Comment:
   ++ ++，at 
[commits(ebcbd)](https://github.com/apache/lucene/blob/ebcbdd05fbf114ca15d04cd4a9172eb51fab9a1a/lucene/core/src/java/org/apache/lucene/codecs/lucene90/DeflateWithPresetDictCompressionMode.java)
 i removed it



##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/DeflateWithPresetDictCompressionMode.java:
##
@@ -199,23 +205,65 @@ private void doCompress(byte[] bytes, int off, int len, 
DataOutput out) throws I
   out.writeBytes(compressed, totalCount);
 }
 
+private void doCompress(ByteBuffer bytes, int len, DataOutput out) throws 
IOException {
+  if (len == 0) {
+out.writeVInt(0);
+return;
+  }
+  compressor.setInput(bytes);
+  compressor.finish();
+  if (compressor.needsInput()) {
+throw new IllegalStateException();
+  }
+
+  int totalCount = 0;
+  for (; ; ) {
+final int count =
+compressor.deflate(compressed, totalCount, compressed.length - 
totalCount);
+totalCount += count;
+assert totalCount <= compressed.length;
+if (compressor.finished()) {
+  break;
+} else {
+  compressed = ArrayUtil.grow(compressed);
+}
+  }
+
+  out.writeVInt(totalCount);
+  out.writeBytes(compressed, totalCount);
+}
+
 @Override
-public void compress(byte[] bytes, int off, int len, DataOutput out) 
throws IOException {
+public void compress(ByteBuffersDataInput buffersInput, DataOutput out) 
throws IOException {
+  final int len = (int) (buffersInput.size() - buffersInput.position());
+  final int end = (int) buffersInput.size();
   final int dictLength = len / (NUM_SUB_BLOCKS * DICT_SIZE_FACTOR);
   final int blockLength = (len - dictLength + NUM_SUB_BLOCKS - 1) / 
NUM_SUB_BLOCKS;
   out.writeVInt(dictLength);
   out.writeVInt(blockLength);
-  final int end = off + len;
 
   // Compress the dictionary first
   compressor.reset();
-  doCompress(bytes, off, dictLength, out);
+  bufferDict = ArrayUtil.growNoCopy(bufferDict, dictLength);
+  buffersInput.readBytes(bufferDict, 0, dictLength);
+  doCompress(bufferDict, 0, dictLength, out);
 
   // And then sub blocks
-  for (int start = off + dictLength; start < end; start += blockLength) {
+  for (int start = dictLength; start < end; start += blockLength) {
 compressor.reset();
-deflaterBugfix.setDictionary(bytes, off, dictLength);
-doCompress(bytes, start, Math.min(blockLength, off + len - start), 
out);
+deflaterBugfix.setDictionary(bufferDict, 0, dictLength);
+int l = Math.min(blockLength, len - start);
+// if [start,start + len] stay in one ByteBuffer, we can ignore memory 
copy
+// otherwise need to copy bytes into on continuous byte array
+ByteBuffer bb = buffersInput.sliceOne(start, l);

Review Comment:
   Nice suggestion, at 
[commits(ebcbd...)](https://github.com/apache/lucene/blob/ebcbdd05fbf114ca15d04cd4a9172eb51fab9a1a/lucene/core/src/java/org/apache/lucene/store/ByteBuffersDataInput.java#L175)
 i moved this logic into `ByteBuffersDataInput#readBytes`, in the method 
`readBytes`, it would determine whether use internal buffers or create a new 
buffer



##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsWriter.java:
##
@@ -519,7 +518,13 @@ private void 
copyOneDoc(Lucene90CompressingStoredFieldsReader reader, int docID)
 assert reader.getVersion() == VERSION_CURRENT;
 SerializedDocument doc = reader.document(docID);
 startDocument();
-bufferedDocs.copyBytes(doc.in, doc.length);
+
+if (doc.in instanceof ByteArrayDataInput) {
+  // reuse ByteArrayDataInput to reduce memory copy
+  bufferedDocs.copyBytes((ByteArrayDataInput) doc.in, doc.length);
+} else {
+  bufferedDocs.copyBytes(doc.in, doc.length);
+}

Review Comment:
   LGTM, i deleted this code at this PR, and i would open a new issue for 
reduce memory copy in copyOneDoc

[GitHub] [lucene] luyuncheng commented on a diff in pull request #987: LUCENE-10627: Using CompositeByteBuf to Reduce Memory Copy

2022-07-18 Thread GitBox



luyuncheng commented on code in PR #987:
URL: https://github.com/apache/lucene/pull/987#discussion_r923100057


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/DeflateWithPresetDictCompressionMode.java:
##
@@ -163,12 +165,16 @@ private static class DeflateWithPresetDictCompressor 
extends Compressor {
 final Deflater compressor;
 final BugfixDeflater_JDK8252739 deflaterBugfix;
 byte[] compressed;
+byte[] bufferDict;
+byte[] bufferBlock;
 boolean closed;
 
 DeflateWithPresetDictCompressor(int level) {
   compressor = new Deflater(level, true);
   deflaterBugfix = BugfixDeflater_JDK8252739.createBugfix(compressor);
   compressed = new byte[64];
+  bufferDict = BytesRef.EMPTY_BYTES;
+  bufferBlock = BytesRef.EMPTY_BYTES;
 }
 
 private void doCompress(byte[] bytes, int off, int len, DataOutput out) 
throws IOException {

Review Comment:
   ++ ，at 
[commits(ebcbd)](https://github.com/apache/lucene/blob/ebcbdd05fbf114ca15d04cd4a9172eb51fab9a1a/lucene/core/src/java/org/apache/lucene/codecs/lucene90/DeflateWithPresetDictCompressionMode.java)
 i removed it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] luyuncheng commented on pull request #987: LUCENE-10627: Using CompositeByteBuf to Reduce Memory Copy

2022-07-18 Thread GitBox



luyuncheng commented on PR #987:
URL: https://github.com/apache/lucene/pull/987#issuecomment-1186947659

   > Thanks for the iteration, the high-level idea looks good to me, I left 
some suggestions.
   
   @jpountz Thanks for your nice suggestions. at latest commits 
[ebcbdd](https://github.com/apache/lucene/commit/457bab5abfbaa3aabebe99a97e645b4f623be15f)
   1. change statement byteBuffers in Lucene90CompressingStoredFieldsWriter
   2. move sliceOne logic into ByteBuffersDataInput level and it would 
determine whether use internal buffers or create a new buffer
   3. delete DeflateWithPresetDictCompressionMode#doCompress method which param 
is byte[]
   4. delete copyOneDoc-> readBytes which instance ByteArrayDataInput optimize. 
and i would create a new issue to reduce memory copy in copyOneDoc


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mikemccand commented on issue #37: Why are some Jira issues completely missing?

2022-07-18 Thread GitBox



mikemccand commented on issue #37:
URL: 
https://github.com/apache/lucene-jira-archive/issues/37#issuecomment-1187013700

   > Some issues were probably moved to SOLR issues?
   
   Oooh I love that theory!
   
   But then, wouldn't they have left some digital breadcrumbs?  Initial email 
sent to the list as `LUCENE-`, to my email, etc.?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mikemccand commented on issue #37: Why are some Jira issues completely missing?

2022-07-18 Thread GitBox



mikemccand commented on issue #37:
URL: 
https://github.com/apache/lucene-jira-archive/issues/37#issuecomment-1187016028

   Whoa!  I found two of the issues!  Check it out: 
https://markmail.org/message/ntz5hxrm4lli57oe and 
https://markmail.org/message/d5t346iwopaqj45x
   
   Apparently it is possible to completely delete an issue entirely :)  That's 
a scary power but maybe it is limited to PMC members or committers?
   
   Phew, mystery solved.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mikemccand closed issue #37: Why are some Jira issues completely missing?

2022-07-18 Thread GitBox



mikemccand closed issue #37: Why are some Jira issues completely missing?
URL: https://github.com/apache/lucene-jira-archive/issues/37


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10633) Dynamic pruning for queries sorted by SORTED(_SET) field

2022-07-18 Thread Michael McCandless (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567913#comment-17567913
 ] 

Michael McCandless commented on LUCENE-10633:
-

{quote}I plan on opening a PR against luceneutil and I already opened 
LUCENE-10162 a while back about making this sort of things a more obvious 
choice. It also relates to [~gsmiller] 's work about running term-in-set 
queries using doc values, which would only help if doc values are enabled on 
the field.
{quote}
Awesome, thanks [~jpountz]!

> Dynamic pruning for queries sorted by SORTED(_SET) field
> 
>
> Key: LUCENE-10633
> URL: https://issues.apache.org/jira/browse/LUCENE-10633
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> LUCENE-9280 introduced the ability to dynamically prune non-competitive hits 
> when sorting by a numeric field, by leveraging the points index to skip 
> documents that do not compare better than the top of the priority queue 
> maintained by the field comparator.
> However queries sorted by a SORTED(_SET) field still look at all hits, which 
> is disappointing. Could we leverage the terms index to skip hits?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mikemccand opened a new issue, #53: Remove "module" for core components?

2022-07-18 Thread GitBox



mikemccand opened a new issue, #53:
URL: https://github.com/apache/lucene-jira-archive/issues/53

   ### Description
   
   I noticed we have components like `component:module/core/hnsw`.
   
   Could we remove the `module/` prefix for `core`?  It would make the core 
component strings shorter, which is meaningful/helpful since they are so 
frequently used?
   
   The other real modules (e.g. `component:module/facet`) I think should keep 
their `module/` prefix?
   
   Also, I like the issue template testing here!
   
   ### Version and Environments
   
   _No response_
   
   ### Lucene Component
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mikemccand commented on issue #53: Remove "module" for core components?

2022-07-18 Thread GitBox



mikemccand commented on issue #53:
URL: 
https://github.com/apache/lucene-jira-archive/issues/53#issuecomment-1187165708

   Also, could we maybe use `fix-version` and `affects-version` instead of 
`fixVersion` and `affectsVersion`?  I think it is more readable on quick glance 
with a `-`?
   
   I noticed there is also a `type:new_feature` -- maybe we can make that 
`type:new-feature` for consistency?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mikemccand opened a new issue, #54: Hyperlinks are sometimes not actual links on import

2022-07-18 Thread GitBox



mikemccand opened a new issue, #54:
URL: https://github.com/apache/lucene-jira-archive/issues/54

   ### Description
   
   I found this example comment, where commitbot had embedded a hyperlink, and 
in Jira it is a hyperlink, but in the imported GitHub issue (in my test repo), 
it is just plain text?
   
   
https://github.com/mikemccand/stargazers-migration-test/issues/62#issuecomment-1187146685
   
   ### Version and Environments
   
   _No response_
   
   ### Lucene Component
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10657) Reduce memory copy in DataOutput copyBytes

2022-07-18 Thread LuYunCheng (Jira)

LuYunCheng created LUCENE-10657:
---

 Summary: Reduce memory copy in DataOutput copyBytes
 Key: LUCENE-10657
 URL: https://issues.apache.org/jira/browse/LUCENE-10657
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/store
Reporter: LuYunCheng


This is derived from 
[LUCENE-10627|[https://github.com/apache/lucene/pull/987]|https://github.com/apache/lucene/pull/987].]

The abstract method `copyBytes` in DataOutput have to copy from input to a 
copyBuffer and then write into ByteBuffersDataOutput.blocks, i think there is 
unnecessary, we can override it, copy directly from input into output.

with override this method,
 # Reduce memory copy in `Lucene90CompressingStoredFieldsWriter#copyOneDoc` -> 
`bufferdDocs.copyBytes(DataInput input)`
 # Reduce memory copy in `Lucene90CompoundFormat.writeCompoundFile` -> 
`data.copyBytes` when input is `BufferedChecksumIndexinput` and output is 
`ByteBuffersDataOutput`
 # Reduce memory `IndexWriter#copySegmentAsIs` ->CopyFrom -> copyBytes

 
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] luyuncheng opened a new pull request, #1034: LUCENE-10657: Reduce memory copy in DataOutput copyBytes

2022-07-18 Thread GitBox



luyuncheng opened a new pull request, #1034:
URL: https://github.com/apache/lucene/pull/1034

   This is derived from 
[LUCENE-10627](https://issues.apache.org/jira/browse/LUCENE-10627) at #987 
   Jira: https://issues.apache.org/jira/browse/LUCENE-10657
   
   The abstract method `copyBytes` in DataOutput have to copy from input to a 
copyBuffer and then write into ByteBuffersDataOutput.blocks, i think there is 
unnecessary, we can override it, copy directly from input into output.
   
   with override this method, we can:
   
   1. Reduce memory copy in `Lucene90CompressingStoredFieldsWriter#copyOneDoc` 
-> `bufferdDocs.copyBytes(DataInput input)`
   2. Reduce memory copy in `Lucene90CompoundFormat.writeCompoundFile` -> 
`data.copyBytes` when input is `BufferedChecksumIndexinput` and output is 
`ByteBuffersDataOutput`
   3. Reduce memory `IndexWriter#copySegmentAsIs` ->CopyFrom -> copyBytes
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10657) Reduce memory copy in DataOutput copyBytes

2022-07-18 Thread LuYunCheng (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LuYunCheng updated LUCENE-10657:

Description: 
This is derived from [LUCENE-10627|[https://github.com/apache/lucene/pull/987]]

Code: [https://github.com/apache/lucene/pull/1034] 

The abstract method `copyBytes` in DataOutput have to copy from input to a 
copyBuffer and then write into ByteBuffersDataOutput.blocks, i think there is 
unnecessary, we can override it, copy directly from input into output.

with override this method,
 # Reduce memory copy in `Lucene90CompressingStoredFieldsWriter#copyOneDoc` -> 
`bufferdDocs.copyBytes(DataInput input)`
 # Reduce memory copy in `Lucene90CompoundFormat.writeCompoundFile` -> 
`data.copyBytes` when input is `BufferedChecksumIndexinput` and output is 
`ByteBuffersDataOutput`
 # Reduce memory `IndexWriter#copySegmentAsIs` ->CopyFrom -> copyBytes

 
 

  was:
This is derived from 
[LUCENE-10627|[https://github.com/apache/lucene/pull/987]|https://github.com/apache/lucene/pull/987].]

The abstract method `copyBytes` in DataOutput have to copy from input to a 
copyBuffer and then write into ByteBuffersDataOutput.blocks, i think there is 
unnecessary, we can override it, copy directly from input into output.

with override this method,
 # Reduce memory copy in `Lucene90CompressingStoredFieldsWriter#copyOneDoc` -> 
`bufferdDocs.copyBytes(DataInput input)`
 # Reduce memory copy in `Lucene90CompoundFormat.writeCompoundFile` -> 
`data.copyBytes` when input is `BufferedChecksumIndexinput` and output is 
`ByteBuffersDataOutput`
 # Reduce memory `IndexWriter#copySegmentAsIs` ->CopyFrom -> copyBytes

 
 


> Reduce memory copy in DataOutput copyBytes
> --
>
> Key: LUCENE-10657
> URL: https://issues.apache.org/jira/browse/LUCENE-10657
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/store
>Reporter: LuYunCheng
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is derived from 
> [LUCENE-10627|[https://github.com/apache/lucene/pull/987]]
> Code: [https://github.com/apache/lucene/pull/1034] 
> The abstract method `copyBytes` in DataOutput have to copy from input to a 
> copyBuffer and then write into ByteBuffersDataOutput.blocks, i think there is 
> unnecessary, we can override it, copy directly from input into output.
> with override this method,
>  # Reduce memory copy in `Lucene90CompressingStoredFieldsWriter#copyOneDoc` 
> -> `bufferdDocs.copyBytes(DataInput input)`
>  # Reduce memory copy in `Lucene90CompoundFormat.writeCompoundFile` -> 
> `data.copyBytes` when input is `BufferedChecksumIndexinput` and output is 
> `ByteBuffersDataOutput`
>  # Reduce memory `IndexWriter#copySegmentAsIs` ->CopyFrom -> copyBytes
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] luyuncheng commented on a diff in pull request #987: LUCENE-10627: Using CompositeByteBuf to Reduce Memory Copy

2022-07-18 Thread GitBox



luyuncheng commented on code in PR #987:
URL: https://github.com/apache/lucene/pull/987#discussion_r923284236


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsWriter.java:
##
@@ -519,7 +518,13 @@ private void 
copyOneDoc(Lucene90CompressingStoredFieldsReader reader, int docID)
 assert reader.getVersion() == VERSION_CURRENT;
 SerializedDocument doc = reader.document(docID);
 startDocument();
-bufferedDocs.copyBytes(doc.in, doc.length);
+
+if (doc.in instanceof ByteArrayDataInput) {
+  // reuse ByteArrayDataInput to reduce memory copy
+  bufferedDocs.copyBytes((ByteArrayDataInput) doc.in, doc.length);
+} else {
+  bufferedDocs.copyBytes(doc.in, doc.length);
+}

Review Comment:
   > I think that we could avoid this `instanceof` check by overriding 
`ByteBuffersDataOutput#copyBytes` to read directly into its internal buffers 
when they are not direct (ie. backed by a `byte[]`)? (Maybe in a separate 
change?)
   
   @jpountz  thanks for your advice! i think overriding 
`ByteBuffersDataOutput#copyBytes` is a great idea, it can reduce many method do 
memory copy
   
   I opened a new issue: LUCENE-10657 #1034
   i think when we move this logic into `ByteBuffersDataOutput#copyBytes` it 
can reduce more memory copy method:
   
   1. Reduce memory copy in Lucene90CompressingStoredFieldsWriter#copyOneDoc -> 
bufferdDocs.copyBytes(DataInput input)
   2. Reduce memory copy in Lucene90CompoundFormat.writeCompoundFile -> 
data.copyBytes when input is BufferedChecksumIndexinput and output is 
ByteBuffersDataOutput
   3. Reduce memory IndexWriter#copySegmentAsIs ->CopyFrom -> copyBytes



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz merged pull request #1025: LUCENE-10649: Fix failures in TestDemoParallelLeafReader

2022-07-18 Thread GitBox



jpountz merged PR #1025:
URL: https://github.com/apache/lucene/pull/1025


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10649) Failure in TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField

2022-07-18 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567968#comment-17567968
 ] 

ASF subversion and git services commented on LUCENE-10649:
--

Commit 30a7c52e6c7ed4ffd67a64a13c3f3b25b34853d5 in lucene's branch 
refs/heads/main from Vigya Sharma
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=30a7c52e6c7 ]

LUCENE-10649: Fix failures in TestDemoParallelLeafReader (#1025)



> Failure in TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField
> ---
>
> Key: LUCENE-10649
> URL: https://issues.apache.org/jira/browse/LUCENE-10649
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Vigya Sharma
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Failing Build Link: 
> [https://jenkins.thetaphi.de/job/Lucene-main-Linux/35617/testReport/junit/org.apache.lucene.index/TestDemoParallelLeafReader/testRandomMultipleSchemaGensSameField/]
> Repro:
> {code:java}
> gradlew test --tests 
> TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField 
> -Dtests.seed=A7496D7D3957981A -Dtests.multiplier=3 -Dtests.locale=sr-Latn-BA 
> -Dtests.timezone=Etc/GMT-7 -Dtests.asserts=true -Dtests.file.encoding=UTF-8 
> {code}
> Error:
> {code:java}
> java.lang.AssertionError: expected:<103> but was:<2147483647>
>     at 
> __randomizedtesting.SeedInfo.seed([A7496D7D3957981A:F71866BCCEA1C903]:0)
>     at org.junit.Assert.fail(Assert.java:89)
>     at org.junit.Assert.failNotEquals(Assert.java:835)
>     at org.junit.Assert.assertEquals(Assert.java:647)
>     at org.junit.Assert.assertEquals(Assert.java:633)
>     at 
> org.apache.lucene.index.TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField(TestDemoParallelLeafReader.java:1347)
>     at 
> java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mocobeta merged pull request #52: Show issue title for linked issues

2022-07-18 Thread GitBox



mocobeta merged PR #52:
URL: https://github.com/apache/lucene-jira-archive/pull/52


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10649) Failure in TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField

2022-07-18 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567972#comment-17567972
 ] 

ASF subversion and git services commented on LUCENE-10649:
--

Commit 41f7618535448e14365f26aebdf6db443a2d0cea in lucene's branch 
refs/heads/branch_9x from Vigya Sharma
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=41f76185354 ]

LUCENE-10649: Fix failures in TestDemoParallelLeafReader (#1025)



> Failure in TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField
> ---
>
> Key: LUCENE-10649
> URL: https://issues.apache.org/jira/browse/LUCENE-10649
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Vigya Sharma
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Failing Build Link: 
> [https://jenkins.thetaphi.de/job/Lucene-main-Linux/35617/testReport/junit/org.apache.lucene.index/TestDemoParallelLeafReader/testRandomMultipleSchemaGensSameField/]
> Repro:
> {code:java}
> gradlew test --tests 
> TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField 
> -Dtests.seed=A7496D7D3957981A -Dtests.multiplier=3 -Dtests.locale=sr-Latn-BA 
> -Dtests.timezone=Etc/GMT-7 -Dtests.asserts=true -Dtests.file.encoding=UTF-8 
> {code}
> Error:
> {code:java}
> java.lang.AssertionError: expected:<103> but was:<2147483647>
>     at 
> __randomizedtesting.SeedInfo.seed([A7496D7D3957981A:F71866BCCEA1C903]:0)
>     at org.junit.Assert.fail(Assert.java:89)
>     at org.junit.Assert.failNotEquals(Assert.java:835)
>     at org.junit.Assert.assertEquals(Assert.java:647)
>     at org.junit.Assert.assertEquals(Assert.java:633)
>     at 
> org.apache.lucene.index.TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField(TestDemoParallelLeafReader.java:1347)
>     at 
> java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on pull request #992: LUCENE-10592 Build HNSW Graph on indexing

2022-07-18 Thread GitBox



mayya-sharipova commented on PR #992:
URL: https://github.com/apache/lucene/pull/992#issuecomment-1187314554

   @jpountz I am wondering if you have further comments for this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mocobeta commented on issue #53: Remove "module" for core components?

2022-07-18 Thread GitBox



mocobeta commented on issue #53:
URL: 
https://github.com/apache/lucene-jira-archive/issues/53#issuecomment-1187457057

   > Could we remove the `module/` prefix for `core`? It would make the core 
component strings shorter, which is meaningful/helpful since they are so 
frequently used?
   
   We can omit `module` from it though, I would slightly prefer to have 
`module` for all components for consistency. This is a bit longer but we'll use 
the auto-complete feature when adding/editing labels anyway.
   I agree that it looks redundant, perhaps it would be better to change the 
common label prefix from `component` to `module` like  `module:core/index`, 
`module:analysis`, `module:sandbox`? `component` comes from Jira's `Components` 
field, then it does not really mean that much on GitHub?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mocobeta commented on issue #53: Remove "module" for core components?

2022-07-18 Thread GitBox



mocobeta commented on issue #53:
URL: 
https://github.com/apache/lucene-jira-archive/issues/53#issuecomment-1187483046

   > Also, could we maybe use `fix-version` and `affects-version` instead of 
`fixVersion` and `affectsVersion`? I think it is more readable on quick glance 
with a `-`?
   > 
   > I noticed there is also a `type:new_feature` -- maybe we can make that 
`type:new-feature` for consistency?
   
   Sure, I have no opinion about the label format and named them without much 
consideration. I will update these lines.
   
https://github.com/apache/lucene-jira-archive/blob/e3ec8c748bc6e2887190f4b7c93288da0d31b0af/migration/src/jira2github_import.py#L158-L168


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mikemccand commented on issue #53: Remove "module" for core components?

2022-07-18 Thread GitBox



mikemccand commented on issue #53:
URL: 
https://github.com/apache/lucene-jira-archive/issues/53#issuecomment-1187491522

   > I agree that it looks redundant, perhaps it would be better to change the 
common label prefix from `component` to `module` like `module:core/index`, 
`module:analysis`, `module:sandbox`? `component` comes from Jira's `Components` 
field, then it does not really mean that much on GitHub?
   
   +1, I like that solution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on a diff in pull request #1034: LUCENE-10657: Reduce memory copy in DataOutput copyBytes

2022-07-18 Thread GitBox



jpountz commented on code in PR #1034:
URL: https://github.com/apache/lucene/pull/1034#discussion_r923411342


##
lucene/core/src/java/org/apache/lucene/store/ByteBuffersDataOutput.java:
##
@@ -309,6 +309,24 @@ public byte[] toArrayCopy() {
 return arr;
   }
 
+  @Override
+  public void copyBytes(DataInput input, long numBytes) throws IOException {
+assert numBytes >= 0 : "numBytes=" + numBytes;
+int length = (int) numBytes;
+while (length > 0) {
+  if (!currentBlock.hasRemaining()) {
+appendBlock();
+  }
+
+  int chunk = Math.min(currentBlock.remaining(), length);
+  final int pos = currentBlock.position();
+  byte[] blockArray = currentBlock.array();

Review Comment:
   This only works if the ByteBuffer is not direct, maybe we can break the loop 
when a non-direct buffer is encountered and fall back to `super.copyBytes` for 
remaining bytes?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on a diff in pull request #1022: LUCENE-10653: Heapify in BMMScorer

2022-07-18 Thread GitBox



jpountz commented on code in PR #1022:
URL: https://github.com/apache/lucene/pull/1022#discussion_r923424214


##
lucene/core/src/test/org/apache/lucene/search/TestDisiPriorityQueue.java:
##
@@ -0,0 +1,155 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Comparator;
+import java.util.PrimitiveIterator.OfInt;
+import java.util.Random;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.tests.util.LuceneTestCase;
+
+public class TestDisiPriorityQueue extends LuceneTestCase {
+
+  public void testRandom() throws Exception {
+Random r = random();
+
+int size = r.nextInt(TEST_NIGHTLY ? 1000 : 10);
+DisiWrapper[] all = new DisiWrapper[size];
+for (int i = 0; i < size; i++) {
+  DocIdSetIterator it = randomDisi(r);
+  DisiWrapper w = wrapper(it);
+  all[i] = w;
+}
+
+DisiPriorityQueue pq = new DisiPriorityQueue(size);
+if (r.nextBoolean()) {
+  for (DisiWrapper w : all) {
+pq.add(w);
+  }
+} else {
+  if (r.nextInt(10) < 2) {
+int len = random().nextInt(1, size);
+for (int i = 0; i < len; i++) {
+  pq.add(all[i]);
+}
+pq.addAll(all, len, size - len);
+  } else {
+pq.addAll(all, 0, size);
+  }
+}
+
+while (pq.size() > 0) {
+  Arrays.sort(all, Comparator.comparingInt(w -> w.doc));
+  DisiWrapper top = pq.top();
+  assertEquals(all[0].doc, top.doc);
+  if (top.iterator.nextDoc() == DocIdSetIterator.NO_MORE_DOCS) {

Review Comment:
   you need to set `top.doc = top.iterator.nextDoc()` otherwise the PQ never 
gets reordered



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz merged pull request #1019: Synchronize FieldInfos#verifyFieldInfos.

2022-07-18 Thread GitBox



jpountz merged PR #1019:
URL: https://github.com/apache/lucene/pull/1019


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] iverase commented on pull request #1017: LUCENE-10654: Add new ShapeDocValuesField for LatLonShape and XYShape

2022-07-18 Thread GitBox



iverase commented on PR #1017:
URL: https://github.com/apache/lucene/pull/1017#issuecomment-1187570816

   Hey @nknize!
   
   I went through the PR and my feeling is that it is still half-cooked and I 
worry about adding a new encoding into core that it is not properly worked on. 
If later we need to change something we will have a hard time doing it once it 
is released. My suggestion is to either add it to the sandbox where there are 
no back-compact requirements or wait for another release so you have more time 
to work on it.
   
   More focus feedback:
   
   - Early versions of XYShape where using encoded rectangles to perform 
bounding box queries. That proved to be wrong 
(https://github.com/apache/lucene-solr/pull/865) because the encoding is not 
linear, therefore the spatial relationships between geometries are not 
preserved in the encoded space. This PR seems to go back to that approach for 
XYShape doc values which is incorrect.
   
   - Lucene does not support any geo faceting, therefore for lucene users the 
real value for the doc values is the ability to perform IndexOrDocValues 
queries to speed their queries workload, still the current PR seems to be 
focusing on supporting faceting, e.g by adding centroid values. IMHO If we want 
to add those facets we should implement geo faceting in lucene alongside or we 
are not providing value to lucene users and testing would be very hard.
   
   - While the Codec API makes it easy to make changes across minors, we don't 
have the same flexibility for things that are done on top of the codec API like 
this binary encoding. Maybe we should take some more time to review the 
encoding instead of rushing it into 9.3. For instance, we used the index 
creation major version of the index to change the way facets store the taxonomy 
array, but this only allows changes on new major versions.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-jira-archive] mocobeta opened a new pull request, #55: Refine label texts

2022-07-18 Thread GitBox



mocobeta opened a new pull request, #55:
URL: https://github.com/apache/lucene-jira-archive/pull/55

   This refines label texts as suggested in #53.
   
   1. Rename `component:module/xyz` to `module:xyz`
   2. Rename `fixVersion:x.x.x` to `fix-version:x.x.x`
   3. Rename `affectsVersion:x.x.x` to `affects-version:x.x.x`
   4. Remove `type:new_feature`, then map `New Feature` Issue Type to 
`type:enhancement`
  - I would reduce supported issue types instead of having 
`type:new-feature`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] hcqs33 commented on pull request #1028: Fix error in TieredMergePolicy

2022-07-18 Thread GitBox



hcqs33 commented on PR #1028:
URL: https://github.com/apache/lucene/pull/1028#issuecomment-1187639318

   > This looks good to me, can you add a CHANGES entry under version `9.3` so 
that users know about this bug fix?
   
   Sure. I have added a CHANGES entry for this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10651) SimpleQueryParser stack overflow for large nested queries.

2022-07-18 Thread Michael McCandless (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17568068#comment-17568068
 ] 

Michael McCandless commented on LUCENE-10651:
-

It's a bit depressing that {{SimpleQueryParser}} consumes Java's stack in 
proportion to the number of clauses.  Though, it is "simple", and just 
recursing to parse another clause is indeed simple.  Too bad we seem to have 
lost tail recursion from the Lisp days ...

This is really a bug/imitation in {{{}SimpleQueryParser{}}}.  E.g. a user may 
increase that clause limit and then try to parse a {{BooleanQuery}} that should 
work (doesn't have too many clauses) yet hits {{{}StackOverflowException{}}}.  
I wonder if our other query parsers have this problem too?

But I like your fix – it prevents a {{StackOverflowException}} when the 
returned Query would have failed with {{TooManyClauses}} anyways.  It's at 
least progress not perfection?

> SimpleQueryParser stack overflow for large nested queries.
> --
>
> Key: LUCENE-10651
> URL: https://issues.apache.org/jira/browse/LUCENE-10651
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 9.1, 8.10, 9.2, 9.3
>Reporter: Marc Handalian
>Priority: Major
>
> The OpenSearch project received an issue [1] where stack overflow can occur 
> for large nested boolean queries during rewrite.  In trying to reproduce this 
> error I've also encountered SO during parsing where queries expand beyond the 
> default 1024 clause limit.  This unit test will fail with SO:
> {code:java}
> public void testSimpleQueryParserWithTooManyClauses() {
>   StringBuilder queryString = new StringBuilder("foo");
>   for (int i = 0; i < 1024; i++) {
> queryString.append(" | bar").append(i).append(" + baz");
>   }
>   expectThrows(IndexSearcher.TooManyClauses.class, () -> 
> parse(queryString.toString()));
> }
>  {code}
> I would expect this case to also fail with TooManyClauses, is my 
> understanding correct?  If so, I've attempted a fix [2] that during parsing 
> increments a counter whenever a clause is added.
>  [1] [https://github.com/opensearch-project/OpenSearch/issues/3760]
>  [2] 
> [https://github.com/mch2/lucene/commit/6a558f17f448b92ae4cf8c43e0b759ff7425acdf]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] nknize commented on pull request #1017: LUCENE-10654: Add new ShapeDocValuesField for LatLonShape and XYShape

2022-07-18 Thread GitBox



nknize commented on PR #1017:
URL: https://github.com/apache/lucene/pull/1017#issuecomment-1187657941

   > Early versions of XYShape where using encoded rectangles to perform 
bounding box queries. That proved to be wrong
   
   That has to do with the query not the field. This does not affect the doc 
value format thus shouldn't prevent the PR. Also, the doc value queries are 
using the same relate logic as the index queries so they are consistent.
   
   >  Lucene does not support any geo faceting
   
   Correct. Solr implements faceting and since Solr is built on lucene we need 
a doc value format to support Solr faceting over the LatLonShape format.
   
   > still the current PR seems to be focusing on supporting faceting,
   
   The PR has nothing to do with faceting since Lucene doesn't implement 
faceting. The PR is 100% focused on the doc value format and BoundingBox 
relations only.
   
   > Maybe we should take some more time to review the encoding instead of 
rushing it into 9.3 My suggestion is to either add it to the sandbox where 
there are no back-compact requirements
   
   The encoding is 100% a new field and marked as experimental.  It is in core 
because since the sandbox module is now in an explicit `.sandbox` package, 
refactoring to sandbox would require refactoring core class modifiers and 
methods from package private to public so sandbox classes can extend them. This 
breaks the API. 
   
   The key take way here is that the coordinate encoding is the same as the 
index format that won't change unless we decide to change `XYEncodingUtils` or 
`GeoEncodingUtils`, at which time we will have to support BWC for much more 
than the doc value format. Furthermore, the field is marked as experimental 
(the proper mechanism to communicate that formats might change); there are 
currently far more critical classes marked as experimental (e.g., 
[Lucene80DocValuesFormat](https://github.com/apache/lucene/blob/216e38a1596cbc6a036d92e68bc26ee483cd4d2a/lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene80/Lucene80DocValuesFormat.java#L138))
 that make this less risky for 9.3.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] luyuncheng commented on a diff in pull request #1034: LUCENE-10657: Reduce memory copy in DataOutput copyBytes

2022-07-18 Thread GitBox



luyuncheng commented on code in PR #1034:
URL: https://github.com/apache/lucene/pull/1034#discussion_r923536318


##
lucene/core/src/java/org/apache/lucene/store/ByteBuffersDataOutput.java:
##
@@ -309,6 +309,24 @@ public byte[] toArrayCopy() {
 return arr;
   }
 
+  @Override
+  public void copyBytes(DataInput input, long numBytes) throws IOException {
+assert numBytes >= 0 : "numBytes=" + numBytes;
+int length = (int) numBytes;
+while (length > 0) {
+  if (!currentBlock.hasRemaining()) {
+appendBlock();
+  }
+
+  int chunk = Math.min(currentBlock.remaining(), length);
+  final int pos = currentBlock.position();
+  byte[] blockArray = currentBlock.array();

Review Comment:
   ++, Thanks for your advice, i added a check, if currentBlock is direct, then 
break the loop and fall back to `super.copyBytes` for remaining bytes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] luyuncheng commented on pull request #1034: LUCENE-10657: Reduce memory copy in DataOutput copyBytes

2022-07-18 Thread GitBox



luyuncheng commented on PR #1034:
URL: https://github.com/apache/lucene/pull/1034#issuecomment-1187675144

   > The logic makes sense to me, can you add unit tests?
   
   Thanks for your review, in commits 
[926dd](https://github.com/luyuncheng/lucene/commit/926dd0c159c314430a6ace3ec55e50f7ee155b56)
 i added check for the direct memory, and i added unit tests for different 
allocator in copyBytes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10633) Dynamic pruning for queries sorted by SORTED(_SET) field

2022-07-18 Thread Adrien Grand (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17568109#comment-17568109
 ] 

Adrien Grand commented on LUCENE-10633:
---

I opened https://github.com/mikemccand/luceneutil/pull/185.

> Dynamic pruning for queries sorted by SORTED(_SET) field
> 
>
> Key: LUCENE-10633
> URL: https://issues.apache.org/jira/browse/LUCENE-10633
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> LUCENE-9280 introduced the ability to dynamically prune non-competitive hits 
> when sorting by a numeric field, by leveraging the points index to skip 
> documents that do not compare better than the top of the priority queue 
> maintained by the field comparator.
> However queries sorted by a SORTED(_SET) field still look at all hits, which 
> is disappointing. Could we leverage the terms index to skip hits?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10651) SimpleQueryParser stack overflow for large nested queries.

2022-07-18 Thread Adrien Grand (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17568114#comment-17568114
 ] 

Adrien Grand commented on LUCENE-10651:
---

bq. But I like your fix – it prevents a StackOverflowException when the 
returned Query would have failed with TooManyClauses anyways.

I like this approach too.

> SimpleQueryParser stack overflow for large nested queries.
> --
>
> Key: LUCENE-10651
> URL: https://issues.apache.org/jira/browse/LUCENE-10651
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 9.1, 8.10, 9.2, 9.3
>Reporter: Marc Handalian
>Priority: Major
>
> The OpenSearch project received an issue [1] where stack overflow can occur 
> for large nested boolean queries during rewrite.  In trying to reproduce this 
> error I've also encountered SO during parsing where queries expand beyond the 
> default 1024 clause limit.  This unit test will fail with SO:
> {code:java}
> public void testSimpleQueryParserWithTooManyClauses() {
>   StringBuilder queryString = new StringBuilder("foo");
>   for (int i = 0; i < 1024; i++) {
> queryString.append(" | bar").append(i).append(" + baz");
>   }
>   expectThrows(IndexSearcher.TooManyClauses.class, () -> 
> parse(queryString.toString()));
> }
>  {code}
> I would expect this case to also fail with TooManyClauses, is my 
> understanding correct?  If so, I've attempted a fix [2] that during parsing 
> increments a counter whenever a clause is added.
>  [1] [https://github.com/opensearch-project/OpenSearch/issues/3760]
>  [2] 
> [https://github.com/mch2/lucene/commit/6a558f17f448b92ae4cf8c43e0b759ff7425acdf]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10657) CopyBytes now saves one memory copy on ByteBuffersDataOutput

2022-07-18 Thread LuYunCheng (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LuYunCheng updated LUCENE-10657:

Summary: CopyBytes now saves one memory copy on ByteBuffersDataOutput  
(was: Reduce memory copy in DataOutput copyBytes)

> CopyBytes now saves one memory copy on ByteBuffersDataOutput
> 
>
> Key: LUCENE-10657
> URL: https://issues.apache.org/jira/browse/LUCENE-10657
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/store
>Reporter: LuYunCheng
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> This is derived from 
> [LUCENE-10627|[https://github.com/apache/lucene/pull/987]]
> Code: [https://github.com/apache/lucene/pull/1034] 
> The abstract method `copyBytes` in DataOutput have to copy from input to a 
> copyBuffer and then write into ByteBuffersDataOutput.blocks, i think there is 
> unnecessary, we can override it, copy directly from input into output.
> with override this method,
>  # Reduce memory copy in `Lucene90CompressingStoredFieldsWriter#copyOneDoc` 
> -> `bufferdDocs.copyBytes(DataInput input)`
>  # Reduce memory copy in `Lucene90CompoundFormat.writeCompoundFile` -> 
> `data.copyBytes` when input is `BufferedChecksumIndexinput` and output is 
> `ByteBuffersDataOutput`
>  # Reduce memory `IndexWriter#copySegmentAsIs` ->CopyFrom -> copyBytes
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10651) SimpleQueryParser stack overflow for large nested queries.

2022-07-18 Thread Adrien Grand (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17568116#comment-17568116
 ] 

Adrien Grand commented on LUCENE-10651:
---

I'm not familiar with the simple query parser, but we seem to create sub state 
objects in {{consumeSubQuery}}, do we need to add the number of nested clauses 
of these sub states to the top-level state to properly count the overall number 
of clauses?

> SimpleQueryParser stack overflow for large nested queries.
> --
>
> Key: LUCENE-10651
> URL: https://issues.apache.org/jira/browse/LUCENE-10651
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 9.1, 8.10, 9.2, 9.3
>Reporter: Marc Handalian
>Priority: Major
>
> The OpenSearch project received an issue [1] where stack overflow can occur 
> for large nested boolean queries during rewrite.  In trying to reproduce this 
> error I've also encountered SO during parsing where queries expand beyond the 
> default 1024 clause limit.  This unit test will fail with SO:
> {code:java}
> public void testSimpleQueryParserWithTooManyClauses() {
>   StringBuilder queryString = new StringBuilder("foo");
>   for (int i = 0; i < 1024; i++) {
> queryString.append(" | bar").append(i).append(" + baz");
>   }
>   expectThrows(IndexSearcher.TooManyClauses.class, () -> 
> parse(queryString.toString()));
> }
>  {code}
> I would expect this case to also fail with TooManyClauses, is my 
> understanding correct?  If so, I've attempted a fix [2] that during parsing 
> increments a counter whenever a clause is added.
>  [1] [https://github.com/opensearch-project/OpenSearch/issues/3760]
>  [2] 
> [https://github.com/mch2/lucene/commit/6a558f17f448b92ae4cf8c43e0b759ff7425acdf]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] luyuncheng commented on pull request #1034: LUCENE-10657: CopyBytes now saves one memory copy on ByteBuffersDataOutput

2022-07-18 Thread GitBox



luyuncheng commented on PR #1034:
URL: https://github.com/apache/lucene/pull/1034#issuecomment-1187753914

   > This looks good to me, can you add a CHANGES entry that copyBytes now 
saves one memory copy on ByteBuffersDataOutput?
   
   Thanks for your review and suggestions, I have been updated the CHANGES entry


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] Yuti-G opened a new pull request, #1035: LUCENE-10652: Add a top-n range faceting example to RangeFacetsExample

2022-07-18 Thread GitBox



Yuti-G opened a new pull request, #1035:
URL: https://github.com/apache/lucene/pull/1035

   ### Description 
   
   In [LUCENE-10614](https://issues.apache.org/jira/browse/LUCENE-10614), we 
modified the behavior of getTopChildren to actually return top-n ranges ordered 
by count. The original behavior of getTopChildren in RangeFacetsCounts was to 
return all ranges ordered by constructor-specified range order, and this 
behavior is now retained in the getAllChildren API 
([LUCENE-10550](https://issues.apache.org/jira/browse/LUCENE-10550)).
   
   Therefore, it would be helpful to add an example in RangeFacetsExample to 
demo this change. I replaced the original example of getTopChildren with 
getAllChildren, and added a new example of getTopChildren.
   
   https://issues.apache.org/jira/browse/LUCENE-10652
   
   ### Tests
   Added new unit test
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on a diff in pull request #1034: LUCENE-10657: CopyBytes now saves one memory copy on ByteBuffersDataOutput

2022-07-18 Thread GitBox



uschindler commented on code in PR #1034:
URL: https://github.com/apache/lucene/pull/1034#discussion_r923639294


##
lucene/core/src/java/org/apache/lucene/store/ByteBuffersDataOutput.java:
##
@@ -309,6 +309,24 @@ public byte[] toArrayCopy() {
 return arr;
   }
 
+  @Override
+  public void copyBytes(DataInput input, long numBytes) throws IOException {
+assert numBytes >= 0 : "numBytes=" + numBytes;
+int length = (int) numBytes;
+while (length > 0) {
+  if (!currentBlock.hasRemaining()) {
+appendBlock();
+  }
+
+  int chunk = Math.min(currentBlock.remaining(), length);
+  final int pos = currentBlock.position();
+  byte[] blockArray = currentBlock.array();

Review Comment:
   There's another problem: If the block has an array, it will always start at 
0. But if the ByteBuffer is a slice of another one, the "live" data may start 
at a lter position.
   
   To have it correct use `currentBlock.array()` and  
`currentBlock.arrayOffset()` to get the correct slice. The position must be 
added afterwards. So the following needs to be corrected to:
   
   ```java
   input.readBytes(currentBlock.array(), 
Math.addExact(currentBlock.arrayOffset(), currentBlock.position()), chunk);
   ```
   (the previous lines are obsolete)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on a diff in pull request #1034: LUCENE-10657: CopyBytes now saves one memory copy on ByteBuffersDataOutput

2022-07-18 Thread GitBox



uschindler commented on code in PR #1034:
URL: https://github.com/apache/lucene/pull/1034#discussion_r923639294


##
lucene/core/src/java/org/apache/lucene/store/ByteBuffersDataOutput.java:
##
@@ -309,6 +309,24 @@ public byte[] toArrayCopy() {
 return arr;
   }
 
+  @Override
+  public void copyBytes(DataInput input, long numBytes) throws IOException {
+assert numBytes >= 0 : "numBytes=" + numBytes;
+int length = (int) numBytes;
+while (length > 0) {
+  if (!currentBlock.hasRemaining()) {
+appendBlock();
+  }
+
+  int chunk = Math.min(currentBlock.remaining(), length);
+  final int pos = currentBlock.position();
+  byte[] blockArray = currentBlock.array();

Review Comment:
   There's another problem: If the block has an array, it will may not always 
start at 0. If the ByteBuffer is a slice of another one, the "live" data may 
start at a later offset. position != arrayOffset, that are 2 different 
coordinates. This may be important if the input is a memory mapped index input 
or similar.
   
   To have it correct use `currentBlock.array()` and  
`currentBlock.arrayOffset()` to get the correct slice. The position must be 
added afterwards. So the following needs to be corrected to:
   
   ```java
   input.readBytes(currentBlock.array(), 
Math.addExact(currentBlock.arrayOffset(), currentBlock.position()), chunk);
   ```
   (the previous lines are obsolete)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on a diff in pull request #1034: LUCENE-10657: CopyBytes now saves one memory copy on ByteBuffersDataOutput

2022-07-18 Thread GitBox



uschindler commented on code in PR #1034:
URL: https://github.com/apache/lucene/pull/1034#discussion_r923642311


##
lucene/core/src/java/org/apache/lucene/store/ByteBuffersDataOutput.java:
##
@@ -309,6 +309,30 @@ public byte[] toArrayCopy() {
 return arr;
   }
 
+  @Override
+  public void copyBytes(DataInput input, long numBytes) throws IOException {
+assert numBytes >= 0 : "numBytes=" + numBytes;
+int length = (int) numBytes;
+while (length > 0) {
+  if (!currentBlock.hasRemaining()) {
+appendBlock();
+  }
+  if (currentBlock.isDirect()) {

Review Comment:
   this should be `false == currentBlock.hasArray()`
   
   This is because direct and has array are somehow opposites, but thats just 
an implementation details. So whenever a buffer does not have an array we need 
to fallback to the super method.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on a diff in pull request #1034: LUCENE-10657: CopyBytes now saves one memory copy on ByteBuffersDataOutput

2022-07-18 Thread GitBox



uschindler commented on code in PR #1034:
URL: https://github.com/apache/lucene/pull/1034#discussion_r923642311


##
lucene/core/src/java/org/apache/lucene/store/ByteBuffersDataOutput.java:
##
@@ -309,6 +309,30 @@ public byte[] toArrayCopy() {
 return arr;
   }
 
+  @Override
+  public void copyBytes(DataInput input, long numBytes) throws IOException {
+assert numBytes >= 0 : "numBytes=" + numBytes;
+int length = (int) numBytes;
+while (length > 0) {
+  if (!currentBlock.hasRemaining()) {
+appendBlock();
+  }
+  if (currentBlock.isDirect()) {

Review Comment:
   this should be `false == currentBlock.hasArray()`
   
   This is because direct and has array are somehow opposites, but thats just 
an implementation detail. So whenever a buffer does *not* have an array, we 
need to fallback to the super method. It does not matter if it is direct or 
not. There could be a third type of buffers :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] luyuncheng commented on a diff in pull request #1034: LUCENE-10657: CopyBytes now saves one memory copy on ByteBuffersDataOutput

2022-07-18 Thread GitBox



luyuncheng commented on code in PR #1034:
URL: https://github.com/apache/lucene/pull/1034#discussion_r923662970


##
lucene/core/src/java/org/apache/lucene/store/ByteBuffersDataOutput.java:
##
@@ -309,6 +309,24 @@ public byte[] toArrayCopy() {
 return arr;
   }
 
+  @Override
+  public void copyBytes(DataInput input, long numBytes) throws IOException {
+assert numBytes >= 0 : "numBytes=" + numBytes;
+int length = (int) numBytes;
+while (length > 0) {
+  if (!currentBlock.hasRemaining()) {
+appendBlock();
+  }
+
+  int chunk = Math.min(currentBlock.remaining(), length);
+  final int pos = currentBlock.position();
+  byte[] blockArray = currentBlock.array();

Review Comment:
   > There's another problem: If the block has an array, it will may not always 
start at 0. If the ByteBuffer is a slice of another one, the "live" data may 
start at a later offset. position != arrayOffset, that are 2 different 
coordinates. This may be important if the input is a memory mapped index input 
or similar.
   > 
   > To have it correct use `currentBlock.array()` and 
`currentBlock.arrayOffset()` to get the correct slice. The position must be 
added afterwards. So the following needs to be corrected to:
   > 
   > ```java
   > input.readBytes(currentBlock.array(), 
Math.addExact(currentBlock.arrayOffset(), currentBlock.position()), chunk);
   > ```
   > 
   
   Thanks for this advice !! i fixed it
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] luyuncheng commented on a diff in pull request #1034: LUCENE-10657: CopyBytes now saves one memory copy on ByteBuffersDataOutput

2022-07-18 Thread GitBox



luyuncheng commented on code in PR #1034:
URL: https://github.com/apache/lucene/pull/1034#discussion_r923663218


##
lucene/core/src/java/org/apache/lucene/store/ByteBuffersDataOutput.java:
##
@@ -309,6 +309,30 @@ public byte[] toArrayCopy() {
 return arr;
   }
 
+  @Override
+  public void copyBytes(DataInput input, long numBytes) throws IOException {
+assert numBytes >= 0 : "numBytes=" + numBytes;
+int length = (int) numBytes;
+while (length > 0) {
+  if (!currentBlock.hasRemaining()) {
+appendBlock();
+  }
+  if (currentBlock.isDirect()) {

Review Comment:
   ++, thanks for it. i fixed it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #947: LUCENE-10577: enable quantization of HNSW vectors to 8 bits

2022-07-18 Thread GitBox



mayya-sharipova commented on code in PR #947:
URL: https://github.com/apache/lucene/pull/947#discussion_r923672692


##
lucene/core/src/java/org/apache/lucene/codecs/lucene93/Lucene93HnswVectorsReader.java:
##
@@ -169,7 +173,12 @@ private void validateFieldEntry(FieldInfo info, FieldEntry 
fieldEntry) {
   + fieldEntry.dimension);
 }
 
-long numBytes = (long) fieldEntry.size() * dimension * Float.BYTES;
+long numBytes;
+switch (fieldEntry.vectorEncoding) {
+  case BYTE -> numBytes = (long) fieldEntry.size() * dimension;
+  case FLOAT32 -> numBytes = (long) fieldEntry.size() * dimension * 
Float.BYTES;
+  default -> throw new AssertionError("unknown vector encoding " + 
fieldEntry.vectorEncoding);

Review Comment:
   Should we also update the error message that follows for `BYTE` case?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #947: LUCENE-10577: enable quantization of HNSW vectors to 8 bits

2022-07-18 Thread GitBox



mayya-sharipova commented on code in PR #947:
URL: https://github.com/apache/lucene/pull/947#discussion_r923800118


##
lucene/core/src/java/org/apache/lucene/util/VectorUtil.java:
##
@@ -213,4 +213,38 @@ public static void add(float[] u, float[] v) {
   u[i] += v[i];
 }
   }
+
+  /**
+   * Dot product score computed over signed bytes, scaled to be in [0, 1].
+   *
+   * @param a bytes containing a vector
+   * @param aOffset offset of the vector in a
+   * @param b bytes containing another vector, of the same dimension
+   * @param len the length (aka dimension) of the vectors
+   * @param bOffset offset of the vector in b
+   * @return the value of the similarity function applied to the two vectors
+   */
+  public static float dotProductScore(BytesRef a, int aOffset, BytesRef b, int 
bOffset, int len) {
+int total = 0;
+for (int i = 0; i < len; i++) {
+  total += a.bytes[aOffset++] * b.bytes[bOffset++];
+}
+// divide by 2 * 2^14 (maximum absolute value of product of 2 signed 
bytes) * len
+return (1 + total) / (float) (len * (1 << 15));

Review Comment:
   To make scores non-negative should we do instead:
   `total / (float) (len * (1 << 15))  + 1`?
   
   I am also wondering why in the comments we say `// divide by 2 * 2^14`, but 
in the calculation we use `1 << 15`? Should it be `1 << 14` instead?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #947: LUCENE-10577: enable quantization of HNSW vectors to 8 bits

2022-07-18 Thread GitBox



mayya-sharipova commented on code in PR #947:
URL: https://github.com/apache/lucene/pull/947#discussion_r923807979


##
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java:
##
@@ -89,4 +123,67 @@ public abstract TopDocs search(
   public KnnVectorsReader getMergeInstance() {
 return this;
   }
+
+  /** {@link #searchExhaustively} */
+  protected static TopDocs exhaustiveSearch(
+  VectorValues vectorValues,
+  DocIdSetIterator acceptDocs,
+  VectorSimilarityFunction similarityFunction,
+  float[] target,
+  int k)
+  throws IOException {
+HitQueue queue = new HitQueue(k, true);
+ScoreDoc topDoc = queue.top();
+int doc;
+while ((doc = acceptDocs.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
+  int vectorDoc = vectorValues.advance(doc);
+  assert vectorDoc == doc;
+  float score = similarityFunction.compare(vectorValues.vectorValue(), 
target);
+  if (score >= topDoc.score) {
+topDoc.score = score;
+topDoc.doc = doc;
+topDoc = queue.updateTop();
+  }
+}
+return topDocsFromHitQueue(queue, acceptDocs.cost());
+  }
+
+  /** {@link #searchExhaustively} */
+  protected static TopDocs exhaustiveSearch(
+  VectorValues vectorValues,
+  DocIdSetIterator acceptDocs,
+  VectorSimilarityFunction similarityFunction,

Review Comment:
   looks like `similarityFunction` is not necessary here, as we always use 
`dotProductScore`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta opened a new pull request, #1036: LUCENE-10557: Refine issue label texts

2022-07-18 Thread GitBox



mocobeta opened a new pull request, #1036:
URL: https://github.com/apache/lucene/pull/1036

   Follow-up of #1024.
   This refines issue label texts as suggested in 
https://github.com/apache/lucene-jira-archive/issues/53


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta merged pull request #1036: LUCENE-10557: Refine issue label texts

2022-07-18 Thread GitBox



mocobeta merged PR #1036:
URL: https://github.com/apache/lucene/pull/1036


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-07-18 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17568321#comment-17568321
 ] 

ASF subversion and git services commented on LUCENE-10557:
--

Commit 781edf442b200145e4fc3fc59de554b7a0c0b57b in lucene's branch 
refs/heads/main from Tomoko Uchida
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=781edf442b2 ]

LUCENE-10557: Refine issue label texts (#1036)



> Migrate to GitHub issue from Jira
> -
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: Screen Shot 2022-06-29 at 11.02.35 AM.png, 
> image-2022-06-29-13-36-57-365.png, screenshot-1.png
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * (/) Get a consensus about the migration among committers
>  * (/) Choose issues that should be moved to GitHub - We'll migrate all 
> issues towards an atomic switch to GitHub if no major technical obstacles 
> show up.
>  ** Discussion thread 
> [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12]
>  ** -Conclusion for now: We don't migrate any issues. Only new issues should 
> be opened on GitHub.-
>  ** Write a prototype migration script - the decision could be made on that. 
> Things to consider:
>  *** version numbers - labels or milestones?
>  *** add a comment/ prepend a link to the source Jira issue on github side,
>  *** add a comment/ prepend a link on the jira side to the new issue on 
> github side (for people who access jira from blogs, mailing list archives and 
> other sources that will have stale links),
>  *** convert cross-issue automatic links in comments/ descriptions (as 
> suggested by Robert),
>  *** strategy to deal with sub-issues (hierarchies),
>  *** maybe prefix (or postfix) the issue title on github side with the 
> original LUCENE-XYZ key so that it is easier to search for a particular issue 
> there?
>  *** how to deal with user IDs (author, reporter, commenters)? Do they have 
> to be github users? Will information about people not registered on github be 
> lost?
>  *** create an extra mapping file of old-issue-new-issue URLs for any 
> potential future uses.
>  *** what to do with issue numbers in git/svn commits? These could be 
> rewritten but it'd change the entire git history tree - I don't think this is 
> practical, while doable.
> * Prepare a complete migration tool
> ** See https://github.com/apache/lucene-jira-archive/issues/5 
> * Build the convention for issue label/milestone management
>  ** See [https://github.com/apache/lucene-jira-archive/issues/6]
>  ** Do some experiments on a sandbox repository 
> [https://github.com/mocobeta/sandbox-lucene-10557]
>  ** Make documentation for metadata (label/milestone) management 
>  * (/) Enable Github issue on the lucene's repository
>  ** Raise an issue on INFRA
>  ** (Create an issue-only private repository for sensitive issues if it's 
> needed and allowed)
>  ** Set a mail hook to 
> [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to 
> the general mail group name)
>  * Set a schedule for migration
>  ** See [https://github.com/apache/lucene-jira-archive/issues/7]
>  ** Give some time to committers to play around with issues/labels/milestones 
> before the actual migration
>  ** Make an announcement on the mail lists
>  ** Show some text messages when opening a new Jira issue (in issue template?)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] zacharymorn commented on a diff in pull request #1018: LUCENE-10480: Use BulkScorer to limit BMMScorer to only top-level disjunctions

2022-07-18 Thread GitBox



zacharymorn commented on code in PR #1018:
URL: https://github.com/apache/lucene/pull/1018#discussion_r924065392


##
lucene/core/src/java/org/apache/lucene/search/BooleanWeight.java:
##
@@ -191,6 +191,66 @@ public long cost() {
   // or null if it is not applicable
   // pkg-private for forcing use of BooleanScorer in tests
   BulkScorer optionalBulkScorer(LeafReaderContext context) throws IOException {
+if (scoreMode == ScoreMode.TOP_SCORES) {
+  if (query.getMinimumNumberShouldMatch() > 1 || weightedClauses.size() > 
2) {
+return null;
+  }
+
+  List optional = new ArrayList<>();
+  for (WeightedBooleanClause wc : weightedClauses) {
+Weight w = wc.weight;
+BooleanClause c = wc.clause;
+if (c.getOccur() != Occur.SHOULD) {
+  continue;
+}
+ScorerSupplier scorer = w.scorerSupplier(context);
+if (scorer != null) {
+  optional.add(scorer);
+}
+  }
+
+  if (optional.size() <= 1) {
+return null;
+  }
+
+  List optionalScorers = new ArrayList<>();
+  for (ScorerSupplier ss : optional) {
+optionalScorers.add(ss.get(Long.MAX_VALUE));
+  }
+
+  return new BulkScorer() {
+final Scorer bmmScorer = new 
BlockMaxMaxscoreScorer(BooleanWeight.this, optionalScorers);
+final int maxDoc = context.reader().maxDoc();
+final DocIdSetIterator iterator = bmmScorer.iterator();
+
+@Override
+public int score(LeafCollector collector, Bits acceptDocs, int min, 
int max)
+throws IOException {
+  collector.setScorer(bmmScorer);
+
+  int doc = min;
+  while (true) {
+doc = iterator.advance(doc);

Review Comment:
   > If the last match of the first window is say 998 and the first match after 
the first window is 1005. Then we should make sure to score 1005 when scoring 
the second window before starting to advance.
   
   Thanks for the explanation, it makes sense! I thought the old implementation 
would take care of this boundary case as it initiated `doc` with `min` before 
advancing, but the behavior is indeed undefined if the previous call to 
`bulkScorer#score` already advanced its internal scorer past the next `min` 
used. I've updated the code with the above solution. 



##
lucene/core/src/java/org/apache/lucene/search/BooleanWeight.java:
##
@@ -191,6 +191,66 @@ public long cost() {
   // or null if it is not applicable
   // pkg-private for forcing use of BooleanScorer in tests
   BulkScorer optionalBulkScorer(LeafReaderContext context) throws IOException {
+if (scoreMode == ScoreMode.TOP_SCORES) {
+  if (query.getMinimumNumberShouldMatch() > 1 || weightedClauses.size() > 
2) {
+return null;
+  }
+
+  List optional = new ArrayList<>();
+  for (WeightedBooleanClause wc : weightedClauses) {
+Weight w = wc.weight;
+BooleanClause c = wc.clause;
+if (c.getOccur() != Occur.SHOULD) {
+  continue;
+}
+ScorerSupplier scorer = w.scorerSupplier(context);
+if (scorer != null) {
+  optional.add(scorer);
+}
+  }
+
+  if (optional.size() <= 1) {
+return null;
+  }
+
+  List optionalScorers = new ArrayList<>();
+  for (ScorerSupplier ss : optional) {
+optionalScorers.add(ss.get(Long.MAX_VALUE));
+  }
+
+  return new BulkScorer() {
+final Scorer bmmScorer = new 
BlockMaxMaxscoreScorer(BooleanWeight.this, optionalScorers);
+final int maxDoc = context.reader().maxDoc();
+final DocIdSetIterator iterator = bmmScorer.iterator();
+
+@Override
+public int score(LeafCollector collector, Bits acceptDocs, int min, 
int max)
+throws IOException {
+  collector.setScorer(bmmScorer);
+
+  int doc = min;
+  while (true) {
+doc = iterator.advance(doc);
+
+if (doc >= max) {
+  return doc;
+}
+
+if (acceptDocs == null || acceptDocs.get(doc)) {
+  collector.collect(doc);
+}
+
+doc++;
+  }
+}
+
+@Override
+public long cost() {
+  return maxDoc;

Review Comment:
   Updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] zacharymorn commented on pull request #1018: LUCENE-10480: Use BulkScorer to limit BMMScorer to only top-level disjunctions

2022-07-18 Thread GitBox



zacharymorn commented on PR #1018:
URL: https://github.com/apache/lucene/pull/1018#issuecomment-1188612201

   Here are the latest benchmark results: 
   
   `wikinightly` 
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
 TermBGroup1M1P   55.21  (7.3%)   53.75  
(5.5%)   -2.6% ( -14% -   11%) 0.200
   PKLookup  330.25  (4.3%)  326.53  
(3.9%)   -1.1% (  -8% -7%) 0.387
 TermDateFacets   71.88  (4.7%)   71.13  
(4.8%)   -1.0% ( -10% -8%) 0.487
   MedTermDayTaxoFacets   77.44  (4.5%)   76.67  
(4.8%)   -1.0% (  -9% -8%) 0.502
 IntNRQ 1216.26  (2.5%) 1207.48  
(2.8%)   -0.7% (  -5% -4%) 0.387
   AndHighHighDayTaxoFacets   13.63  (2.2%)   13.55  
(2.6%)   -0.5% (  -5% -4%) 0.471
   SpanNear   28.99  (3.4%)   28.86  
(2.6%)   -0.5% (  -6% -5%) 0.621
 Fuzzy1  142.65  (1.4%)  142.01  
(1.5%)   -0.4% (  -3% -2%) 0.333
 TermDTSort  142.65  (3.1%)  142.02  
(2.9%)   -0.4% (  -6% -5%) 0.644
Respell   96.15  (1.5%)   95.76  
(2.0%)   -0.4% (  -3% -3%) 0.483
   TermBGroup1M   52.81  (3.0%)   52.60  
(4.1%)   -0.4% (  -7% -6%) 0.731
  TermDayOfYearSort  259.79  (3.1%)  258.78  
(2.8%)   -0.4% (  -6% -5%) 0.681
  TermMonthSort  275.05  (6.2%)  274.05  
(7.0%)   -0.4% ( -12% -   13%) 0.861
   TermGroup10K   43.13  (2.8%)   42.98  
(3.8%)   -0.4% (  -6% -6%) 0.731
   Term 2665.46  (4.2%) 2656.78  
(4.1%)   -0.3% (  -8% -8%) 0.803
   TermGroup100   66.52  (3.4%)   66.35  
(4.2%)   -0.3% (  -7% -7%) 0.834
TermGroup1M   30.69  (2.5%)   30.62  
(3.5%)   -0.3% (  -6% -5%) 0.792
  TermTitleSort  190.21  (6.2%)  189.75  
(7.0%)   -0.2% ( -12% -   13%) 0.908
 Fuzzy2   83.28  (1.7%)   83.17  
(1.9%)   -0.1% (  -3% -3%) 0.809
AndHighOrMedMed   73.32  (5.9%)   73.41  
(5.7%)0.1% ( -10% -   12%) 0.946
   IntervalsOrdered4.63  (4.0%)4.64  
(4.0%)0.2% (  -7% -8%) 0.877
AndHighMedDayTaxoFacets   48.94  (2.1%)   49.06  
(2.6%)0.2% (  -4% -5%) 0.743
   AndMedOrHighHigh   72.55  (5.1%)   72.74  
(4.7%)0.3% (  -9% -   10%) 0.863
  BrowseDayOfYearSSDVFacets   26.43  (5.6%)   26.51  
(4.2%)0.3% (  -9% -   10%) 0.837
 Phrase   53.31  (4.7%)   53.49  
(4.9%)0.3% (  -8% -   10%) 0.820
Prefix3  416.08  (9.2%)  417.57  
(9.1%)0.4% ( -16% -   20%) 0.901
  BrowseMonthSSDVFacets   30.34 (11.6%)   30.58 
(12.5%)0.8% ( -20% -   28%) 0.836
   SloppyPhrase4.71  (3.7%)4.75  
(3.6%)0.9% (  -6% -8%) 0.427
BrowseRandomLabelSSDVFacets   20.64  (6.6%)   20.83  
(5.7%)0.9% ( -10% -   14%) 0.630
 AndHighMed  111.69  (3.5%)  113.66  
(3.3%)1.8% (  -4% -8%) 0.101
AndHighHigh  102.15  (3.4%)  104.05  
(3.3%)1.9% (  -4% -8%) 0.079
   Wildcard  162.58  (7.6%)  165.88  
(5.6%)2.0% ( -10% -   16%) 0.337
 OrHighMedDayTaxoFacets   22.45  (5.7%)   22.96  
(4.7%)2.3% (  -7% -   13%) 0.174
   BrowseDateSSDVFacets4.24 (32.2%)4.44 
(30.1%)4.7% ( -43% -   98%) 0.630
  BrowseDayOfYearTaxoFacets   27.90 (31.8%)   29.47 
(33.8%)5.6% ( -45% -  104%) 0.586
   BrowseDateTaxoFacets   27.82 (31.8%)   29.39 
(33.9%)5.7% ( -45% -  104%) 0.587
 OrHighHigh   21.30  (5.9%)   22.64  
(5.3%)6.3% (  -4% -   18%) 0.000
BrowseRandomLabelTaxoFacets   28.61 (48.5%)   30.61 
(50.4%)7.0% ( -61% -  205%) 0.656
  BrowseMonthTaxoFacets   27.81 (33.3%)   30.05 
(38.0%)8.1% ( -47% -  118%) 0.475
  OrHighMed  122.60  (6.1%)  234.50  
(7.2%)   91.3% (  73% -  111%) 0.000
   ```
   ```
   TaskQPS baseline  StdDevQPS 
my_modified

[GitHub] [lucene] zacharymorn commented on pull request #1018: LUCENE-10480: Use BulkScorer to limit BMMScorer to only top-level disjunctions

2022-07-18 Thread GitBox



zacharymorn commented on PR #1018:
URL: https://github.com/apache/lucene/pull/1018#issuecomment-1188668774

   > It appears this implementation would bring a -4% performance impact to 
OrXNotY tasks in general.
   
   I limited the implementation to pure disjunctions here 
https://github.com/apache/lucene/pull/1018/commits/f4fdfea5d4f7df5df42aea474f71738eefbb68c6
 as well and it does seems to help:
   
   `wikimedium.10M.nostopwords.tasks ` results:
   
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
   OrNotHighMed 1581.93  (2.4%) 1549.94  
(3.3%)   -2.0% (  -7% -3%) 0.027
   OrHighNotMed 1991.40  (3.6%) 1957.40  
(4.7%)   -1.7% (  -9% -6%) 0.199
  OrHighNotHigh 1785.23  (3.3%) 1756.53  
(3.5%)   -1.6% (  -8% -5%) 0.134
  OrNotHighHigh 2021.35  (3.8%) 1997.89  
(4.3%)   -1.2% (  -8% -7%) 0.362
   BrowseDateSSDVFacets4.64 (31.5%)4.59 
(30.3%)   -1.0% ( -47% -   88%) 0.922
 IntNRQ  134.80  (0.3%)  133.52  
(3.8%)   -0.9% (  -5% -3%) 0.274
   OrNotHighLow 2953.48  (5.5%) 2925.66  
(4.9%)   -0.9% ( -10% -9%) 0.566
MedTerm 3755.74  (4.3%) 3728.74  
(4.9%)   -0.7% (  -9% -8%) 0.623
 AndHighLow 2170.32  (4.1%) 2154.98  
(3.5%)   -0.7% (  -7% -7%) 0.556
  LowPhrase  213.41  (2.1%)  212.05  
(1.8%)   -0.6% (  -4% -3%) 0.312
 AndHighMed  405.12  (4.3%)  402.86  
(4.8%)   -0.6% (  -9% -9%) 0.701
   HighTerm 2262.86  (4.6%) 2252.96  
(4.7%)   -0.4% (  -9% -9%) 0.765
   OrHighNotLow 1674.08  (3.8%) 1666.78  
(5.0%)   -0.4% (  -8% -8%) 0.755
MedSpanNear  185.05  (3.0%)  184.50  
(3.1%)   -0.3% (  -6% -6%) 0.762
  BrowseDayOfYearSSDVFacets   26.80 (12.2%)   26.73 
(11.9%)   -0.3% ( -21% -   27%) 0.940
  MedPhrase  250.33  (2.1%)  249.81  
(2.3%)   -0.2% (  -4% -4%) 0.767
   HighSpanNear   58.57  (3.8%)   58.53  
(4.1%)   -0.1% (  -7% -8%) 0.958
AndHighMedDayTaxoFacets   93.86  (1.9%)   93.82  
(1.7%)   -0.1% (  -3% -3%) 0.927
   HighIntervalsOrdered   14.41  (5.3%)   14.41  
(5.8%)   -0.0% ( -10% -   11%) 0.984
LowSpanNear   56.31  (2.7%)   56.38  
(2.4%)0.1% (  -4% -5%) 0.880
 HighPhrase   52.42  (3.2%)   52.54  
(2.8%)0.2% (  -5% -6%) 0.815
AndHighHigh  107.35  (3.7%)  107.65  
(3.7%)0.3% (  -6% -7%) 0.812
MedIntervalsOrdered   13.19  (2.3%)   13.24  
(2.0%)0.4% (  -3% -4%) 0.584
LowTerm 3849.77  (3.4%) 3865.43  
(3.4%)0.4% (  -6% -7%) 0.706
LowIntervalsOrdered   12.28  (4.9%)   12.33  
(5.4%)0.4% (  -9% -   11%) 0.797
   AndHighHighDayTaxoFacets   22.30  (2.4%)   22.39  
(2.3%)0.4% (  -4% -5%) 0.565
Prefix3  129.48  (6.0%)  130.23  
(5.1%)0.6% (  -9% -   12%) 0.745
  HighTermMonthSort  198.37  (6.9%)  199.63  
(6.8%)0.6% ( -12% -   15%) 0.768
   PKLookup  321.51  (3.9%)  323.87  
(4.8%)0.7% (  -7% -9%) 0.595
MedSloppyPhrase   91.15  (2.2%)   91.95  
(2.2%)0.9% (  -3% -5%) 0.214
Respell   86.99  (2.1%)   87.84  
(2.5%)1.0% (  -3% -5%) 0.174
 Fuzzy1  118.73  (2.0%)  120.09  
(2.1%)1.1% (  -2% -5%) 0.084
 Fuzzy2   46.39  (2.2%)   46.92  
(2.8%)1.1% (  -3% -6%) 0.153
LowSloppyPhrase   23.79  (2.4%)   24.08  
(2.6%)1.2% (  -3% -6%) 0.122
 TermDTSort  262.23  (6.6%)  265.58  
(7.0%)1.3% ( -11% -   15%) 0.552
  BrowseMonthSSDVFacets   29.37 (12.5%)   29.75 
(13.7%)1.3% ( -22% -   31%) 0.753
  HighTermDayOfYearSort  226.43  (9.4%)  229.75  
(9.4%)1.5% ( -15% -   22%) 0.622
   HighSloppyPhrase7.13  (4.1%)7.24  
(3.7%)1.6% (  -6% -9%) 0.206
   Wildcar

56 matches

Mail list logo