[GitHub] [lucene] jpountz commented on a diff in pull request #1018: LUCENE-10480: Use BulkScorer to limit BMMScorer to only top-level disjunctions
jpountz commented on code in PR #1018: URL: https://github.com/apache/lucene/pull/1018#discussion_r923041831 ## lucene/core/src/java/org/apache/lucene/search/BooleanWeight.java: ## @@ -191,6 +191,66 @@ public long cost() { // or null if it is not applicable // pkg-private for forcing use of BooleanScorer in tests BulkScorer optionalBulkScorer(LeafReaderContext context) throws IOException { +if (scoreMode == ScoreMode.TOP_SCORES) { + if (query.getMinimumNumberShouldMatch() > 1 || weightedClauses.size() > 2) { +return null; + } + + List optional = new ArrayList<>(); + for (WeightedBooleanClause wc : weightedClauses) { +Weight w = wc.weight; +BooleanClause c = wc.clause; +if (c.getOccur() != Occur.SHOULD) { + continue; +} +ScorerSupplier scorer = w.scorerSupplier(context); +if (scorer != null) { + optional.add(scorer); +} + } + + if (optional.size() <= 1) { +return null; + } + + List optionalScorers = new ArrayList<>(); + for (ScorerSupplier ss : optional) { +optionalScorers.add(ss.get(Long.MAX_VALUE)); + } + + return new BulkScorer() { +final Scorer bmmScorer = new BlockMaxMaxscoreScorer(BooleanWeight.this, optionalScorers); +final int maxDoc = context.reader().maxDoc(); +final DocIdSetIterator iterator = bmmScorer.iterator(); + +@Override +public int score(LeafCollector collector, Bits acceptDocs, int min, int max) +throws IOException { + collector.setScorer(bmmScorer); + + int doc = min; + while (true) { +doc = iterator.advance(doc); Review Comment: Yes, indeed, though we might be able to simplify it to look like below: ```java new BulkScorer() { final Scorer bmmScorer = new BlockMaxMaxscoreScorer(BooleanWeight.this, optionalScorers); final DocIdSetIterator iterator = bmmScorer.iterator(); @Override public int score(LeafCollector collector, Bits acceptDocs, int min, int max) throws IOException { collector.setScorer(bmmScorer); int doc = bmmScorer.docID(); if (doc < min) { doc = bmmScorer.advance(min); } while (doc < max) { if (acceptDocs == null || acceptDocs.get(doc)) { collector.collect(doc); } doc = bmmScorer.nextDoc(); } return doc; } } ``` The reason is that a consumer of the bulk scorer could do something like: ```java bulkScorer.score(collector, null, 0, 1000); bulkScorer.score(collector, null, 1000, 2000); ``` If the last match of the first window is say 998 and the first match after the first window is 1005. Then we should make sure to score 1005 when scoring the second window before starting to advance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10497) Unify "Token" interface in Kuromoji and Nori
[ https://issues.apache.org/jira/browse/LUCENE-10497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomoko Uchida resolved LUCENE-10497. Fix Version/s: 10.0 (main) Assignee: Tomoko Uchida Resolution: Fixed > Unify "Token" interface in Kuromoji and Nori > > > Key: LUCENE-10497 > URL: https://issues.apache.org/jira/browse/LUCENE-10497 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Minor > Fix For: 10.0 (main) > > Time Spent: 20m > Remaining Estimate: 0h > > "Token" classes are the output of the viterbi algorithm in kuromoji and nori, > and its implementation is slightly different between them (Nori's Token class > looks more expressive or extendable). > To proceed [LUCENE-10493], I think we need the unified "Token" class or at > least a common interface for it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10493) Can we unify the viterbi search logic in the tokenizers of kuromoji and nori?
[ https://issues.apache.org/jira/browse/LUCENE-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomoko Uchida resolved LUCENE-10493. Fix Version/s: 10.0 (main) Assignee: Tomoko Uchida Resolution: Fixed > Can we unify the viterbi search logic in the tokenizers of kuromoji and nori? > - > > Key: LUCENE-10493 > URL: https://issues.apache.org/jira/browse/LUCENE-10493 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Fix For: 10.0 (main) > > Time Spent: 4h 20m > Remaining Estimate: 0h > > We now have common dictionary interfaces for kuromoji and nori > ([LUCENE-10393]). A natural question would be: is it possible to unify the > Japanese/Korean tokenizers? > The core methods of the two tokenizers are `parse()` and `backtrace()` to > calculate the minimum cost path by Viterbi search. I'd set the goal of this > issue to factoring out them into a separate class (in analysis-common) that > is shared between JapaneseTokenizer and KoreanTokenizer. > The algorithm to solve the minimum cost path itself is of course > language-agnostic, so I think it should be theoretically possible; the most > difficult part here might be the N-best path calculation - which is supported > only by JapaneseTokenizer and not by KoreanTokenizer. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] luyuncheng commented on a diff in pull request #987: LUCENE-10627: Using CompositeByteBuf to Reduce Memory Copy
luyuncheng commented on code in PR #987: URL: https://github.com/apache/lucene/pull/987#discussion_r923100057 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/DeflateWithPresetDictCompressionMode.java: ## @@ -163,12 +165,16 @@ private static class DeflateWithPresetDictCompressor extends Compressor { final Deflater compressor; final BugfixDeflater_JDK8252739 deflaterBugfix; byte[] compressed; +byte[] bufferDict; +byte[] bufferBlock; boolean closed; DeflateWithPresetDictCompressor(int level) { compressor = new Deflater(level, true); deflaterBugfix = BugfixDeflater_JDK8252739.createBugfix(compressor); compressed = new byte[64]; + bufferDict = BytesRef.EMPTY_BYTES; + bufferBlock = BytesRef.EMPTY_BYTES; } private void doCompress(byte[] bytes, int off, int len, DataOutput out) throws IOException { Review Comment: ++ ++,at [commits(ebcbd)](https://github.com/apache/lucene/blob/ebcbdd05fbf114ca15d04cd4a9172eb51fab9a1a/lucene/core/src/java/org/apache/lucene/codecs/lucene90/DeflateWithPresetDictCompressionMode.java) i removed it ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/DeflateWithPresetDictCompressionMode.java: ## @@ -199,23 +205,65 @@ private void doCompress(byte[] bytes, int off, int len, DataOutput out) throws I out.writeBytes(compressed, totalCount); } +private void doCompress(ByteBuffer bytes, int len, DataOutput out) throws IOException { + if (len == 0) { +out.writeVInt(0); +return; + } + compressor.setInput(bytes); + compressor.finish(); + if (compressor.needsInput()) { +throw new IllegalStateException(); + } + + int totalCount = 0; + for (; ; ) { +final int count = +compressor.deflate(compressed, totalCount, compressed.length - totalCount); +totalCount += count; +assert totalCount <= compressed.length; +if (compressor.finished()) { + break; +} else { + compressed = ArrayUtil.grow(compressed); +} + } + + out.writeVInt(totalCount); + out.writeBytes(compressed, totalCount); +} + @Override -public void compress(byte[] bytes, int off, int len, DataOutput out) throws IOException { +public void compress(ByteBuffersDataInput buffersInput, DataOutput out) throws IOException { + final int len = (int) (buffersInput.size() - buffersInput.position()); + final int end = (int) buffersInput.size(); final int dictLength = len / (NUM_SUB_BLOCKS * DICT_SIZE_FACTOR); final int blockLength = (len - dictLength + NUM_SUB_BLOCKS - 1) / NUM_SUB_BLOCKS; out.writeVInt(dictLength); out.writeVInt(blockLength); - final int end = off + len; // Compress the dictionary first compressor.reset(); - doCompress(bytes, off, dictLength, out); + bufferDict = ArrayUtil.growNoCopy(bufferDict, dictLength); + buffersInput.readBytes(bufferDict, 0, dictLength); + doCompress(bufferDict, 0, dictLength, out); // And then sub blocks - for (int start = off + dictLength; start < end; start += blockLength) { + for (int start = dictLength; start < end; start += blockLength) { compressor.reset(); -deflaterBugfix.setDictionary(bytes, off, dictLength); -doCompress(bytes, start, Math.min(blockLength, off + len - start), out); +deflaterBugfix.setDictionary(bufferDict, 0, dictLength); +int l = Math.min(blockLength, len - start); +// if [start,start + len] stay in one ByteBuffer, we can ignore memory copy +// otherwise need to copy bytes into on continuous byte array +ByteBuffer bb = buffersInput.sliceOne(start, l); Review Comment: Nice suggestion, at [commits(ebcbd...)](https://github.com/apache/lucene/blob/ebcbdd05fbf114ca15d04cd4a9172eb51fab9a1a/lucene/core/src/java/org/apache/lucene/store/ByteBuffersDataInput.java#L175) i moved this logic into `ByteBuffersDataInput#readBytes`, in the method `readBytes`, it would determine whether use internal buffers or create a new buffer ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsWriter.java: ## @@ -519,7 +518,13 @@ private void copyOneDoc(Lucene90CompressingStoredFieldsReader reader, int docID) assert reader.getVersion() == VERSION_CURRENT; SerializedDocument doc = reader.document(docID); startDocument(); -bufferedDocs.copyBytes(doc.in, doc.length); + +if (doc.in instanceof ByteArrayDataInput) { + // reuse ByteArrayDataInput to reduce memory copy + bufferedDocs.copyBytes((ByteArrayDataInput) doc.in, doc.length); +} else { + bufferedDocs.copyBytes(doc.in, doc.length); +} Review Comment: LGTM, i deleted this code at this PR, and i would open a new issue for reduce memory copy in copyOneDoc
[GitHub] [lucene] luyuncheng commented on a diff in pull request #987: LUCENE-10627: Using CompositeByteBuf to Reduce Memory Copy
luyuncheng commented on code in PR #987: URL: https://github.com/apache/lucene/pull/987#discussion_r923100057 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/DeflateWithPresetDictCompressionMode.java: ## @@ -163,12 +165,16 @@ private static class DeflateWithPresetDictCompressor extends Compressor { final Deflater compressor; final BugfixDeflater_JDK8252739 deflaterBugfix; byte[] compressed; +byte[] bufferDict; +byte[] bufferBlock; boolean closed; DeflateWithPresetDictCompressor(int level) { compressor = new Deflater(level, true); deflaterBugfix = BugfixDeflater_JDK8252739.createBugfix(compressor); compressed = new byte[64]; + bufferDict = BytesRef.EMPTY_BYTES; + bufferBlock = BytesRef.EMPTY_BYTES; } private void doCompress(byte[] bytes, int off, int len, DataOutput out) throws IOException { Review Comment: ++ ,at [commits(ebcbd)](https://github.com/apache/lucene/blob/ebcbdd05fbf114ca15d04cd4a9172eb51fab9a1a/lucene/core/src/java/org/apache/lucene/codecs/lucene90/DeflateWithPresetDictCompressionMode.java) i removed it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] luyuncheng commented on pull request #987: LUCENE-10627: Using CompositeByteBuf to Reduce Memory Copy
luyuncheng commented on PR #987: URL: https://github.com/apache/lucene/pull/987#issuecomment-1186947659 > Thanks for the iteration, the high-level idea looks good to me, I left some suggestions. @jpountz Thanks for your nice suggestions. at latest commits [ebcbdd](https://github.com/apache/lucene/commit/457bab5abfbaa3aabebe99a97e645b4f623be15f) 1. change statement byteBuffers in Lucene90CompressingStoredFieldsWriter 2. move sliceOne logic into ByteBuffersDataInput level and it would determine whether use internal buffers or create a new buffer 3. delete DeflateWithPresetDictCompressionMode#doCompress method which param is byte[] 4. delete copyOneDoc-> readBytes which instance ByteArrayDataInput optimize. and i would create a new issue to reduce memory copy in copyOneDoc -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand commented on issue #37: Why are some Jira issues completely missing?
mikemccand commented on issue #37: URL: https://github.com/apache/lucene-jira-archive/issues/37#issuecomment-1187013700 > Some issues were probably moved to SOLR issues? Oooh I love that theory! But then, wouldn't they have left some digital breadcrumbs? Initial email sent to the list as `LUCENE-`, to my email, etc.? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand commented on issue #37: Why are some Jira issues completely missing?
mikemccand commented on issue #37: URL: https://github.com/apache/lucene-jira-archive/issues/37#issuecomment-1187016028 Whoa! I found two of the issues! Check it out: https://markmail.org/message/ntz5hxrm4lli57oe and https://markmail.org/message/d5t346iwopaqj45x Apparently it is possible to completely delete an issue entirely :) That's a scary power but maybe it is limited to PMC members or committers? Phew, mystery solved. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand closed issue #37: Why are some Jira issues completely missing?
mikemccand closed issue #37: Why are some Jira issues completely missing? URL: https://github.com/apache/lucene-jira-archive/issues/37 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10633) Dynamic pruning for queries sorted by SORTED(_SET) field
[ https://issues.apache.org/jira/browse/LUCENE-10633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567913#comment-17567913 ] Michael McCandless commented on LUCENE-10633: - {quote}I plan on opening a PR against luceneutil and I already opened LUCENE-10162 a while back about making this sort of things a more obvious choice. It also relates to [~gsmiller] 's work about running term-in-set queries using doc values, which would only help if doc values are enabled on the field. {quote} Awesome, thanks [~jpountz]! > Dynamic pruning for queries sorted by SORTED(_SET) field > > > Key: LUCENE-10633 > URL: https://issues.apache.org/jira/browse/LUCENE-10633 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > LUCENE-9280 introduced the ability to dynamically prune non-competitive hits > when sorting by a numeric field, by leveraging the points index to skip > documents that do not compare better than the top of the priority queue > maintained by the field comparator. > However queries sorted by a SORTED(_SET) field still look at all hits, which > is disappointing. Could we leverage the terms index to skip hits? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand opened a new issue, #53: Remove "module" for core components?
mikemccand opened a new issue, #53: URL: https://github.com/apache/lucene-jira-archive/issues/53 ### Description I noticed we have components like `component:module/core/hnsw`. Could we remove the `module/` prefix for `core`? It would make the core component strings shorter, which is meaningful/helpful since they are so frequently used? The other real modules (e.g. `component:module/facet`) I think should keep their `module/` prefix? Also, I like the issue template testing here! ### Version and Environments _No response_ ### Lucene Component _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand commented on issue #53: Remove "module" for core components?
mikemccand commented on issue #53: URL: https://github.com/apache/lucene-jira-archive/issues/53#issuecomment-1187165708 Also, could we maybe use `fix-version` and `affects-version` instead of `fixVersion` and `affectsVersion`? I think it is more readable on quick glance with a `-`? I noticed there is also a `type:new_feature` -- maybe we can make that `type:new-feature` for consistency? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand opened a new issue, #54: Hyperlinks are sometimes not actual links on import
mikemccand opened a new issue, #54: URL: https://github.com/apache/lucene-jira-archive/issues/54 ### Description I found this example comment, where commitbot had embedded a hyperlink, and in Jira it is a hyperlink, but in the imported GitHub issue (in my test repo), it is just plain text? https://github.com/mikemccand/stargazers-migration-test/issues/62#issuecomment-1187146685 ### Version and Environments _No response_ ### Lucene Component _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10657) Reduce memory copy in DataOutput copyBytes
LuYunCheng created LUCENE-10657: --- Summary: Reduce memory copy in DataOutput copyBytes Key: LUCENE-10657 URL: https://issues.apache.org/jira/browse/LUCENE-10657 Project: Lucene - Core Issue Type: Improvement Components: core/store Reporter: LuYunCheng This is derived from [LUCENE-10627|[https://github.com/apache/lucene/pull/987]|https://github.com/apache/lucene/pull/987].] The abstract method `copyBytes` in DataOutput have to copy from input to a copyBuffer and then write into ByteBuffersDataOutput.blocks, i think there is unnecessary, we can override it, copy directly from input into output. with override this method, # Reduce memory copy in `Lucene90CompressingStoredFieldsWriter#copyOneDoc` -> `bufferdDocs.copyBytes(DataInput input)` # Reduce memory copy in `Lucene90CompoundFormat.writeCompoundFile` -> `data.copyBytes` when input is `BufferedChecksumIndexinput` and output is `ByteBuffersDataOutput` # Reduce memory `IndexWriter#copySegmentAsIs` ->CopyFrom -> copyBytes -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] luyuncheng opened a new pull request, #1034: LUCENE-10657: Reduce memory copy in DataOutput copyBytes
luyuncheng opened a new pull request, #1034: URL: https://github.com/apache/lucene/pull/1034 This is derived from [LUCENE-10627](https://issues.apache.org/jira/browse/LUCENE-10627) at #987 Jira: https://issues.apache.org/jira/browse/LUCENE-10657 The abstract method `copyBytes` in DataOutput have to copy from input to a copyBuffer and then write into ByteBuffersDataOutput.blocks, i think there is unnecessary, we can override it, copy directly from input into output. with override this method, we can: 1. Reduce memory copy in `Lucene90CompressingStoredFieldsWriter#copyOneDoc` -> `bufferdDocs.copyBytes(DataInput input)` 2. Reduce memory copy in `Lucene90CompoundFormat.writeCompoundFile` -> `data.copyBytes` when input is `BufferedChecksumIndexinput` and output is `ByteBuffersDataOutput` 3. Reduce memory `IndexWriter#copySegmentAsIs` ->CopyFrom -> copyBytes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10657) Reduce memory copy in DataOutput copyBytes
[ https://issues.apache.org/jira/browse/LUCENE-10657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LuYunCheng updated LUCENE-10657: Description: This is derived from [LUCENE-10627|[https://github.com/apache/lucene/pull/987]] Code: [https://github.com/apache/lucene/pull/1034] The abstract method `copyBytes` in DataOutput have to copy from input to a copyBuffer and then write into ByteBuffersDataOutput.blocks, i think there is unnecessary, we can override it, copy directly from input into output. with override this method, # Reduce memory copy in `Lucene90CompressingStoredFieldsWriter#copyOneDoc` -> `bufferdDocs.copyBytes(DataInput input)` # Reduce memory copy in `Lucene90CompoundFormat.writeCompoundFile` -> `data.copyBytes` when input is `BufferedChecksumIndexinput` and output is `ByteBuffersDataOutput` # Reduce memory `IndexWriter#copySegmentAsIs` ->CopyFrom -> copyBytes was: This is derived from [LUCENE-10627|[https://github.com/apache/lucene/pull/987]|https://github.com/apache/lucene/pull/987].] The abstract method `copyBytes` in DataOutput have to copy from input to a copyBuffer and then write into ByteBuffersDataOutput.blocks, i think there is unnecessary, we can override it, copy directly from input into output. with override this method, # Reduce memory copy in `Lucene90CompressingStoredFieldsWriter#copyOneDoc` -> `bufferdDocs.copyBytes(DataInput input)` # Reduce memory copy in `Lucene90CompoundFormat.writeCompoundFile` -> `data.copyBytes` when input is `BufferedChecksumIndexinput` and output is `ByteBuffersDataOutput` # Reduce memory `IndexWriter#copySegmentAsIs` ->CopyFrom -> copyBytes > Reduce memory copy in DataOutput copyBytes > -- > > Key: LUCENE-10657 > URL: https://issues.apache.org/jira/browse/LUCENE-10657 > Project: Lucene - Core > Issue Type: Improvement > Components: core/store >Reporter: LuYunCheng >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > This is derived from > [LUCENE-10627|[https://github.com/apache/lucene/pull/987]] > Code: [https://github.com/apache/lucene/pull/1034] > The abstract method `copyBytes` in DataOutput have to copy from input to a > copyBuffer and then write into ByteBuffersDataOutput.blocks, i think there is > unnecessary, we can override it, copy directly from input into output. > with override this method, > # Reduce memory copy in `Lucene90CompressingStoredFieldsWriter#copyOneDoc` > -> `bufferdDocs.copyBytes(DataInput input)` > # Reduce memory copy in `Lucene90CompoundFormat.writeCompoundFile` -> > `data.copyBytes` when input is `BufferedChecksumIndexinput` and output is > `ByteBuffersDataOutput` > # Reduce memory `IndexWriter#copySegmentAsIs` ->CopyFrom -> copyBytes > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] luyuncheng commented on a diff in pull request #987: LUCENE-10627: Using CompositeByteBuf to Reduce Memory Copy
luyuncheng commented on code in PR #987: URL: https://github.com/apache/lucene/pull/987#discussion_r923284236 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsWriter.java: ## @@ -519,7 +518,13 @@ private void copyOneDoc(Lucene90CompressingStoredFieldsReader reader, int docID) assert reader.getVersion() == VERSION_CURRENT; SerializedDocument doc = reader.document(docID); startDocument(); -bufferedDocs.copyBytes(doc.in, doc.length); + +if (doc.in instanceof ByteArrayDataInput) { + // reuse ByteArrayDataInput to reduce memory copy + bufferedDocs.copyBytes((ByteArrayDataInput) doc.in, doc.length); +} else { + bufferedDocs.copyBytes(doc.in, doc.length); +} Review Comment: > I think that we could avoid this `instanceof` check by overriding `ByteBuffersDataOutput#copyBytes` to read directly into its internal buffers when they are not direct (ie. backed by a `byte[]`)? (Maybe in a separate change?) @jpountz thanks for your advice! i think overriding `ByteBuffersDataOutput#copyBytes` is a great idea, it can reduce many method do memory copy I opened a new issue: LUCENE-10657 #1034 i think when we move this logic into `ByteBuffersDataOutput#copyBytes` it can reduce more memory copy method: 1. Reduce memory copy in Lucene90CompressingStoredFieldsWriter#copyOneDoc -> bufferdDocs.copyBytes(DataInput input) 2. Reduce memory copy in Lucene90CompoundFormat.writeCompoundFile -> data.copyBytes when input is BufferedChecksumIndexinput and output is ByteBuffersDataOutput 3. Reduce memory IndexWriter#copySegmentAsIs ->CopyFrom -> copyBytes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz merged pull request #1025: LUCENE-10649: Fix failures in TestDemoParallelLeafReader
jpountz merged PR #1025: URL: https://github.com/apache/lucene/pull/1025 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10649) Failure in TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField
[ https://issues.apache.org/jira/browse/LUCENE-10649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567968#comment-17567968 ] ASF subversion and git services commented on LUCENE-10649: -- Commit 30a7c52e6c7ed4ffd67a64a13c3f3b25b34853d5 in lucene's branch refs/heads/main from Vigya Sharma [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=30a7c52e6c7 ] LUCENE-10649: Fix failures in TestDemoParallelLeafReader (#1025) > Failure in TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField > --- > > Key: LUCENE-10649 > URL: https://issues.apache.org/jira/browse/LUCENE-10649 > Project: Lucene - Core > Issue Type: Bug >Reporter: Vigya Sharma >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > Failing Build Link: > [https://jenkins.thetaphi.de/job/Lucene-main-Linux/35617/testReport/junit/org.apache.lucene.index/TestDemoParallelLeafReader/testRandomMultipleSchemaGensSameField/] > Repro: > {code:java} > gradlew test --tests > TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField > -Dtests.seed=A7496D7D3957981A -Dtests.multiplier=3 -Dtests.locale=sr-Latn-BA > -Dtests.timezone=Etc/GMT-7 -Dtests.asserts=true -Dtests.file.encoding=UTF-8 > {code} > Error: > {code:java} > java.lang.AssertionError: expected:<103> but was:<2147483647> > at > __randomizedtesting.SeedInfo.seed([A7496D7D3957981A:F71866BCCEA1C903]:0) > at org.junit.Assert.fail(Assert.java:89) > at org.junit.Assert.failNotEquals(Assert.java:835) > at org.junit.Assert.assertEquals(Assert.java:647) > at org.junit.Assert.assertEquals(Assert.java:633) > at > org.apache.lucene.index.TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField(TestDemoParallelLeafReader.java:1347) > at > java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta merged pull request #52: Show issue title for linked issues
mocobeta merged PR #52: URL: https://github.com/apache/lucene-jira-archive/pull/52 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10649) Failure in TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField
[ https://issues.apache.org/jira/browse/LUCENE-10649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567972#comment-17567972 ] ASF subversion and git services commented on LUCENE-10649: -- Commit 41f7618535448e14365f26aebdf6db443a2d0cea in lucene's branch refs/heads/branch_9x from Vigya Sharma [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=41f76185354 ] LUCENE-10649: Fix failures in TestDemoParallelLeafReader (#1025) > Failure in TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField > --- > > Key: LUCENE-10649 > URL: https://issues.apache.org/jira/browse/LUCENE-10649 > Project: Lucene - Core > Issue Type: Bug >Reporter: Vigya Sharma >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > Failing Build Link: > [https://jenkins.thetaphi.de/job/Lucene-main-Linux/35617/testReport/junit/org.apache.lucene.index/TestDemoParallelLeafReader/testRandomMultipleSchemaGensSameField/] > Repro: > {code:java} > gradlew test --tests > TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField > -Dtests.seed=A7496D7D3957981A -Dtests.multiplier=3 -Dtests.locale=sr-Latn-BA > -Dtests.timezone=Etc/GMT-7 -Dtests.asserts=true -Dtests.file.encoding=UTF-8 > {code} > Error: > {code:java} > java.lang.AssertionError: expected:<103> but was:<2147483647> > at > __randomizedtesting.SeedInfo.seed([A7496D7D3957981A:F71866BCCEA1C903]:0) > at org.junit.Assert.fail(Assert.java:89) > at org.junit.Assert.failNotEquals(Assert.java:835) > at org.junit.Assert.assertEquals(Assert.java:647) > at org.junit.Assert.assertEquals(Assert.java:633) > at > org.apache.lucene.index.TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField(TestDemoParallelLeafReader.java:1347) > at > java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on pull request #992: LUCENE-10592 Build HNSW Graph on indexing
mayya-sharipova commented on PR #992: URL: https://github.com/apache/lucene/pull/992#issuecomment-1187314554 @jpountz I am wondering if you have further comments for this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on issue #53: Remove "module" for core components?
mocobeta commented on issue #53: URL: https://github.com/apache/lucene-jira-archive/issues/53#issuecomment-1187457057 > Could we remove the `module/` prefix for `core`? It would make the core component strings shorter, which is meaningful/helpful since they are so frequently used? We can omit `module` from it though, I would slightly prefer to have `module` for all components for consistency. This is a bit longer but we'll use the auto-complete feature when adding/editing labels anyway. I agree that it looks redundant, perhaps it would be better to change the common label prefix from `component` to `module` like `module:core/index`, `module:analysis`, `module:sandbox`? `component` comes from Jira's `Components` field, then it does not really mean that much on GitHub? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on issue #53: Remove "module" for core components?
mocobeta commented on issue #53: URL: https://github.com/apache/lucene-jira-archive/issues/53#issuecomment-1187483046 > Also, could we maybe use `fix-version` and `affects-version` instead of `fixVersion` and `affectsVersion`? I think it is more readable on quick glance with a `-`? > > I noticed there is also a `type:new_feature` -- maybe we can make that `type:new-feature` for consistency? Sure, I have no opinion about the label format and named them without much consideration. I will update these lines. https://github.com/apache/lucene-jira-archive/blob/e3ec8c748bc6e2887190f4b7c93288da0d31b0af/migration/src/jira2github_import.py#L158-L168 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand commented on issue #53: Remove "module" for core components?
mikemccand commented on issue #53: URL: https://github.com/apache/lucene-jira-archive/issues/53#issuecomment-1187491522 > I agree that it looks redundant, perhaps it would be better to change the common label prefix from `component` to `module` like `module:core/index`, `module:analysis`, `module:sandbox`? `component` comes from Jira's `Components` field, then it does not really mean that much on GitHub? +1, I like that solution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #1034: LUCENE-10657: Reduce memory copy in DataOutput copyBytes
jpountz commented on code in PR #1034: URL: https://github.com/apache/lucene/pull/1034#discussion_r923411342 ## lucene/core/src/java/org/apache/lucene/store/ByteBuffersDataOutput.java: ## @@ -309,6 +309,24 @@ public byte[] toArrayCopy() { return arr; } + @Override + public void copyBytes(DataInput input, long numBytes) throws IOException { +assert numBytes >= 0 : "numBytes=" + numBytes; +int length = (int) numBytes; +while (length > 0) { + if (!currentBlock.hasRemaining()) { +appendBlock(); + } + + int chunk = Math.min(currentBlock.remaining(), length); + final int pos = currentBlock.position(); + byte[] blockArray = currentBlock.array(); Review Comment: This only works if the ByteBuffer is not direct, maybe we can break the loop when a non-direct buffer is encountered and fall back to `super.copyBytes` for remaining bytes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #1022: LUCENE-10653: Heapify in BMMScorer
jpountz commented on code in PR #1022: URL: https://github.com/apache/lucene/pull/1022#discussion_r923424214 ## lucene/core/src/test/org/apache/lucene/search/TestDisiPriorityQueue.java: ## @@ -0,0 +1,155 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search; + +import java.io.IOException; +import java.util.Arrays; +import java.util.Comparator; +import java.util.PrimitiveIterator.OfInt; +import java.util.Random; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.tests.util.LuceneTestCase; + +public class TestDisiPriorityQueue extends LuceneTestCase { + + public void testRandom() throws Exception { +Random r = random(); + +int size = r.nextInt(TEST_NIGHTLY ? 1000 : 10); +DisiWrapper[] all = new DisiWrapper[size]; +for (int i = 0; i < size; i++) { + DocIdSetIterator it = randomDisi(r); + DisiWrapper w = wrapper(it); + all[i] = w; +} + +DisiPriorityQueue pq = new DisiPriorityQueue(size); +if (r.nextBoolean()) { + for (DisiWrapper w : all) { +pq.add(w); + } +} else { + if (r.nextInt(10) < 2) { +int len = random().nextInt(1, size); +for (int i = 0; i < len; i++) { + pq.add(all[i]); +} +pq.addAll(all, len, size - len); + } else { +pq.addAll(all, 0, size); + } +} + +while (pq.size() > 0) { + Arrays.sort(all, Comparator.comparingInt(w -> w.doc)); + DisiWrapper top = pq.top(); + assertEquals(all[0].doc, top.doc); + if (top.iterator.nextDoc() == DocIdSetIterator.NO_MORE_DOCS) { Review Comment: you need to set `top.doc = top.iterator.nextDoc()` otherwise the PQ never gets reordered -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz merged pull request #1019: Synchronize FieldInfos#verifyFieldInfos.
jpountz merged PR #1019: URL: https://github.com/apache/lucene/pull/1019 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase commented on pull request #1017: LUCENE-10654: Add new ShapeDocValuesField for LatLonShape and XYShape
iverase commented on PR #1017: URL: https://github.com/apache/lucene/pull/1017#issuecomment-1187570816 Hey @nknize! I went through the PR and my feeling is that it is still half-cooked and I worry about adding a new encoding into core that it is not properly worked on. If later we need to change something we will have a hard time doing it once it is released. My suggestion is to either add it to the sandbox where there are no back-compact requirements or wait for another release so you have more time to work on it. More focus feedback: - Early versions of XYShape where using encoded rectangles to perform bounding box queries. That proved to be wrong (https://github.com/apache/lucene-solr/pull/865) because the encoding is not linear, therefore the spatial relationships between geometries are not preserved in the encoded space. This PR seems to go back to that approach for XYShape doc values which is incorrect. - Lucene does not support any geo faceting, therefore for lucene users the real value for the doc values is the ability to perform IndexOrDocValues queries to speed their queries workload, still the current PR seems to be focusing on supporting faceting, e.g by adding centroid values. IMHO If we want to add those facets we should implement geo faceting in lucene alongside or we are not providing value to lucene users and testing would be very hard. - While the Codec API makes it easy to make changes across minors, we don't have the same flexibility for things that are done on top of the codec API like this binary encoding. Maybe we should take some more time to review the encoding instead of rushing it into 9.3. For instance, we used the index creation major version of the index to change the way facets store the taxonomy array, but this only allows changes on new major versions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta opened a new pull request, #55: Refine label texts
mocobeta opened a new pull request, #55: URL: https://github.com/apache/lucene-jira-archive/pull/55 This refines label texts as suggested in #53. 1. Rename `component:module/xyz` to `module:xyz` 2. Rename `fixVersion:x.x.x` to `fix-version:x.x.x` 3. Rename `affectsVersion:x.x.x` to `affects-version:x.x.x` 4. Remove `type:new_feature`, then map `New Feature` Issue Type to `type:enhancement` - I would reduce supported issue types instead of having `type:new-feature`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] hcqs33 commented on pull request #1028: Fix error in TieredMergePolicy
hcqs33 commented on PR #1028: URL: https://github.com/apache/lucene/pull/1028#issuecomment-1187639318 > This looks good to me, can you add a CHANGES entry under version `9.3` so that users know about this bug fix? Sure. I have added a CHANGES entry for this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10651) SimpleQueryParser stack overflow for large nested queries.
[ https://issues.apache.org/jira/browse/LUCENE-10651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17568068#comment-17568068 ] Michael McCandless commented on LUCENE-10651: - It's a bit depressing that {{SimpleQueryParser}} consumes Java's stack in proportion to the number of clauses. Though, it is "simple", and just recursing to parse another clause is indeed simple. Too bad we seem to have lost tail recursion from the Lisp days ... This is really a bug/imitation in {{{}SimpleQueryParser{}}}. E.g. a user may increase that clause limit and then try to parse a {{BooleanQuery}} that should work (doesn't have too many clauses) yet hits {{{}StackOverflowException{}}}. I wonder if our other query parsers have this problem too? But I like your fix – it prevents a {{StackOverflowException}} when the returned Query would have failed with {{TooManyClauses}} anyways. It's at least progress not perfection? > SimpleQueryParser stack overflow for large nested queries. > -- > > Key: LUCENE-10651 > URL: https://issues.apache.org/jira/browse/LUCENE-10651 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 9.1, 8.10, 9.2, 9.3 >Reporter: Marc Handalian >Priority: Major > > The OpenSearch project received an issue [1] where stack overflow can occur > for large nested boolean queries during rewrite. In trying to reproduce this > error I've also encountered SO during parsing where queries expand beyond the > default 1024 clause limit. This unit test will fail with SO: > {code:java} > public void testSimpleQueryParserWithTooManyClauses() { > StringBuilder queryString = new StringBuilder("foo"); > for (int i = 0; i < 1024; i++) { > queryString.append(" | bar").append(i).append(" + baz"); > } > expectThrows(IndexSearcher.TooManyClauses.class, () -> > parse(queryString.toString())); > } > {code} > I would expect this case to also fail with TooManyClauses, is my > understanding correct? If so, I've attempted a fix [2] that during parsing > increments a counter whenever a clause is added. > [1] [https://github.com/opensearch-project/OpenSearch/issues/3760] > [2] > [https://github.com/mch2/lucene/commit/6a558f17f448b92ae4cf8c43e0b759ff7425acdf] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] nknize commented on pull request #1017: LUCENE-10654: Add new ShapeDocValuesField for LatLonShape and XYShape
nknize commented on PR #1017: URL: https://github.com/apache/lucene/pull/1017#issuecomment-1187657941 > Early versions of XYShape where using encoded rectangles to perform bounding box queries. That proved to be wrong That has to do with the query not the field. This does not affect the doc value format thus shouldn't prevent the PR. Also, the doc value queries are using the same relate logic as the index queries so they are consistent. > Lucene does not support any geo faceting Correct. Solr implements faceting and since Solr is built on lucene we need a doc value format to support Solr faceting over the LatLonShape format. > still the current PR seems to be focusing on supporting faceting, The PR has nothing to do with faceting since Lucene doesn't implement faceting. The PR is 100% focused on the doc value format and BoundingBox relations only. > Maybe we should take some more time to review the encoding instead of rushing it into 9.3 My suggestion is to either add it to the sandbox where there are no back-compact requirements The encoding is 100% a new field and marked as experimental. It is in core because since the sandbox module is now in an explicit `.sandbox` package, refactoring to sandbox would require refactoring core class modifiers and methods from package private to public so sandbox classes can extend them. This breaks the API. The key take way here is that the coordinate encoding is the same as the index format that won't change unless we decide to change `XYEncodingUtils` or `GeoEncodingUtils`, at which time we will have to support BWC for much more than the doc value format. Furthermore, the field is marked as experimental (the proper mechanism to communicate that formats might change); there are currently far more critical classes marked as experimental (e.g., [Lucene80DocValuesFormat](https://github.com/apache/lucene/blob/216e38a1596cbc6a036d92e68bc26ee483cd4d2a/lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene80/Lucene80DocValuesFormat.java#L138)) that make this less risky for 9.3. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] luyuncheng commented on a diff in pull request #1034: LUCENE-10657: Reduce memory copy in DataOutput copyBytes
luyuncheng commented on code in PR #1034: URL: https://github.com/apache/lucene/pull/1034#discussion_r923536318 ## lucene/core/src/java/org/apache/lucene/store/ByteBuffersDataOutput.java: ## @@ -309,6 +309,24 @@ public byte[] toArrayCopy() { return arr; } + @Override + public void copyBytes(DataInput input, long numBytes) throws IOException { +assert numBytes >= 0 : "numBytes=" + numBytes; +int length = (int) numBytes; +while (length > 0) { + if (!currentBlock.hasRemaining()) { +appendBlock(); + } + + int chunk = Math.min(currentBlock.remaining(), length); + final int pos = currentBlock.position(); + byte[] blockArray = currentBlock.array(); Review Comment: ++, Thanks for your advice, i added a check, if currentBlock is direct, then break the loop and fall back to `super.copyBytes` for remaining bytes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] luyuncheng commented on pull request #1034: LUCENE-10657: Reduce memory copy in DataOutput copyBytes
luyuncheng commented on PR #1034: URL: https://github.com/apache/lucene/pull/1034#issuecomment-1187675144 > The logic makes sense to me, can you add unit tests? Thanks for your review, in commits [926dd](https://github.com/luyuncheng/lucene/commit/926dd0c159c314430a6ace3ec55e50f7ee155b56) i added check for the direct memory, and i added unit tests for different allocator in copyBytes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10633) Dynamic pruning for queries sorted by SORTED(_SET) field
[ https://issues.apache.org/jira/browse/LUCENE-10633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17568109#comment-17568109 ] Adrien Grand commented on LUCENE-10633: --- I opened https://github.com/mikemccand/luceneutil/pull/185. > Dynamic pruning for queries sorted by SORTED(_SET) field > > > Key: LUCENE-10633 > URL: https://issues.apache.org/jira/browse/LUCENE-10633 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > LUCENE-9280 introduced the ability to dynamically prune non-competitive hits > when sorting by a numeric field, by leveraging the points index to skip > documents that do not compare better than the top of the priority queue > maintained by the field comparator. > However queries sorted by a SORTED(_SET) field still look at all hits, which > is disappointing. Could we leverage the terms index to skip hits? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10651) SimpleQueryParser stack overflow for large nested queries.
[ https://issues.apache.org/jira/browse/LUCENE-10651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17568114#comment-17568114 ] Adrien Grand commented on LUCENE-10651: --- bq. But I like your fix – it prevents a StackOverflowException when the returned Query would have failed with TooManyClauses anyways. I like this approach too. > SimpleQueryParser stack overflow for large nested queries. > -- > > Key: LUCENE-10651 > URL: https://issues.apache.org/jira/browse/LUCENE-10651 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 9.1, 8.10, 9.2, 9.3 >Reporter: Marc Handalian >Priority: Major > > The OpenSearch project received an issue [1] where stack overflow can occur > for large nested boolean queries during rewrite. In trying to reproduce this > error I've also encountered SO during parsing where queries expand beyond the > default 1024 clause limit. This unit test will fail with SO: > {code:java} > public void testSimpleQueryParserWithTooManyClauses() { > StringBuilder queryString = new StringBuilder("foo"); > for (int i = 0; i < 1024; i++) { > queryString.append(" | bar").append(i).append(" + baz"); > } > expectThrows(IndexSearcher.TooManyClauses.class, () -> > parse(queryString.toString())); > } > {code} > I would expect this case to also fail with TooManyClauses, is my > understanding correct? If so, I've attempted a fix [2] that during parsing > increments a counter whenever a clause is added. > [1] [https://github.com/opensearch-project/OpenSearch/issues/3760] > [2] > [https://github.com/mch2/lucene/commit/6a558f17f448b92ae4cf8c43e0b759ff7425acdf] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10657) CopyBytes now saves one memory copy on ByteBuffersDataOutput
[ https://issues.apache.org/jira/browse/LUCENE-10657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LuYunCheng updated LUCENE-10657: Summary: CopyBytes now saves one memory copy on ByteBuffersDataOutput (was: Reduce memory copy in DataOutput copyBytes) > CopyBytes now saves one memory copy on ByteBuffersDataOutput > > > Key: LUCENE-10657 > URL: https://issues.apache.org/jira/browse/LUCENE-10657 > Project: Lucene - Core > Issue Type: Improvement > Components: core/store >Reporter: LuYunCheng >Priority: Major > Time Spent: 40m > Remaining Estimate: 0h > > This is derived from > [LUCENE-10627|[https://github.com/apache/lucene/pull/987]] > Code: [https://github.com/apache/lucene/pull/1034] > The abstract method `copyBytes` in DataOutput have to copy from input to a > copyBuffer and then write into ByteBuffersDataOutput.blocks, i think there is > unnecessary, we can override it, copy directly from input into output. > with override this method, > # Reduce memory copy in `Lucene90CompressingStoredFieldsWriter#copyOneDoc` > -> `bufferdDocs.copyBytes(DataInput input)` > # Reduce memory copy in `Lucene90CompoundFormat.writeCompoundFile` -> > `data.copyBytes` when input is `BufferedChecksumIndexinput` and output is > `ByteBuffersDataOutput` > # Reduce memory `IndexWriter#copySegmentAsIs` ->CopyFrom -> copyBytes > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10651) SimpleQueryParser stack overflow for large nested queries.
[ https://issues.apache.org/jira/browse/LUCENE-10651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17568116#comment-17568116 ] Adrien Grand commented on LUCENE-10651: --- I'm not familiar with the simple query parser, but we seem to create sub state objects in {{consumeSubQuery}}, do we need to add the number of nested clauses of these sub states to the top-level state to properly count the overall number of clauses? > SimpleQueryParser stack overflow for large nested queries. > -- > > Key: LUCENE-10651 > URL: https://issues.apache.org/jira/browse/LUCENE-10651 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 9.1, 8.10, 9.2, 9.3 >Reporter: Marc Handalian >Priority: Major > > The OpenSearch project received an issue [1] where stack overflow can occur > for large nested boolean queries during rewrite. In trying to reproduce this > error I've also encountered SO during parsing where queries expand beyond the > default 1024 clause limit. This unit test will fail with SO: > {code:java} > public void testSimpleQueryParserWithTooManyClauses() { > StringBuilder queryString = new StringBuilder("foo"); > for (int i = 0; i < 1024; i++) { > queryString.append(" | bar").append(i).append(" + baz"); > } > expectThrows(IndexSearcher.TooManyClauses.class, () -> > parse(queryString.toString())); > } > {code} > I would expect this case to also fail with TooManyClauses, is my > understanding correct? If so, I've attempted a fix [2] that during parsing > increments a counter whenever a clause is added. > [1] [https://github.com/opensearch-project/OpenSearch/issues/3760] > [2] > [https://github.com/mch2/lucene/commit/6a558f17f448b92ae4cf8c43e0b759ff7425acdf] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] luyuncheng commented on pull request #1034: LUCENE-10657: CopyBytes now saves one memory copy on ByteBuffersDataOutput
luyuncheng commented on PR #1034: URL: https://github.com/apache/lucene/pull/1034#issuecomment-1187753914 > This looks good to me, can you add a CHANGES entry that copyBytes now saves one memory copy on ByteBuffersDataOutput? Thanks for your review and suggestions, I have been updated the CHANGES entry -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] Yuti-G opened a new pull request, #1035: LUCENE-10652: Add a top-n range faceting example to RangeFacetsExample
Yuti-G opened a new pull request, #1035: URL: https://github.com/apache/lucene/pull/1035 ### Description In [LUCENE-10614](https://issues.apache.org/jira/browse/LUCENE-10614), we modified the behavior of getTopChildren to actually return top-n ranges ordered by count. The original behavior of getTopChildren in RangeFacetsCounts was to return all ranges ordered by constructor-specified range order, and this behavior is now retained in the getAllChildren API ([LUCENE-10550](https://issues.apache.org/jira/browse/LUCENE-10550)). Therefore, it would be helpful to add an example in RangeFacetsExample to demo this change. I replaced the original example of getTopChildren with getAllChildren, and added a new example of getTopChildren. https://issues.apache.org/jira/browse/LUCENE-10652 ### Tests Added new unit test -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on a diff in pull request #1034: LUCENE-10657: CopyBytes now saves one memory copy on ByteBuffersDataOutput
uschindler commented on code in PR #1034: URL: https://github.com/apache/lucene/pull/1034#discussion_r923639294 ## lucene/core/src/java/org/apache/lucene/store/ByteBuffersDataOutput.java: ## @@ -309,6 +309,24 @@ public byte[] toArrayCopy() { return arr; } + @Override + public void copyBytes(DataInput input, long numBytes) throws IOException { +assert numBytes >= 0 : "numBytes=" + numBytes; +int length = (int) numBytes; +while (length > 0) { + if (!currentBlock.hasRemaining()) { +appendBlock(); + } + + int chunk = Math.min(currentBlock.remaining(), length); + final int pos = currentBlock.position(); + byte[] blockArray = currentBlock.array(); Review Comment: There's another problem: If the block has an array, it will always start at 0. But if the ByteBuffer is a slice of another one, the "live" data may start at a lter position. To have it correct use `currentBlock.array()` and `currentBlock.arrayOffset()` to get the correct slice. The position must be added afterwards. So the following needs to be corrected to: ```java input.readBytes(currentBlock.array(), Math.addExact(currentBlock.arrayOffset(), currentBlock.position()), chunk); ``` (the previous lines are obsolete) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on a diff in pull request #1034: LUCENE-10657: CopyBytes now saves one memory copy on ByteBuffersDataOutput
uschindler commented on code in PR #1034: URL: https://github.com/apache/lucene/pull/1034#discussion_r923639294 ## lucene/core/src/java/org/apache/lucene/store/ByteBuffersDataOutput.java: ## @@ -309,6 +309,24 @@ public byte[] toArrayCopy() { return arr; } + @Override + public void copyBytes(DataInput input, long numBytes) throws IOException { +assert numBytes >= 0 : "numBytes=" + numBytes; +int length = (int) numBytes; +while (length > 0) { + if (!currentBlock.hasRemaining()) { +appendBlock(); + } + + int chunk = Math.min(currentBlock.remaining(), length); + final int pos = currentBlock.position(); + byte[] blockArray = currentBlock.array(); Review Comment: There's another problem: If the block has an array, it will may not always start at 0. If the ByteBuffer is a slice of another one, the "live" data may start at a later offset. position != arrayOffset, that are 2 different coordinates. This may be important if the input is a memory mapped index input or similar. To have it correct use `currentBlock.array()` and `currentBlock.arrayOffset()` to get the correct slice. The position must be added afterwards. So the following needs to be corrected to: ```java input.readBytes(currentBlock.array(), Math.addExact(currentBlock.arrayOffset(), currentBlock.position()), chunk); ``` (the previous lines are obsolete) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on a diff in pull request #1034: LUCENE-10657: CopyBytes now saves one memory copy on ByteBuffersDataOutput
uschindler commented on code in PR #1034: URL: https://github.com/apache/lucene/pull/1034#discussion_r923642311 ## lucene/core/src/java/org/apache/lucene/store/ByteBuffersDataOutput.java: ## @@ -309,6 +309,30 @@ public byte[] toArrayCopy() { return arr; } + @Override + public void copyBytes(DataInput input, long numBytes) throws IOException { +assert numBytes >= 0 : "numBytes=" + numBytes; +int length = (int) numBytes; +while (length > 0) { + if (!currentBlock.hasRemaining()) { +appendBlock(); + } + if (currentBlock.isDirect()) { Review Comment: this should be `false == currentBlock.hasArray()` This is because direct and has array are somehow opposites, but thats just an implementation details. So whenever a buffer does not have an array we need to fallback to the super method. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on a diff in pull request #1034: LUCENE-10657: CopyBytes now saves one memory copy on ByteBuffersDataOutput
uschindler commented on code in PR #1034: URL: https://github.com/apache/lucene/pull/1034#discussion_r923642311 ## lucene/core/src/java/org/apache/lucene/store/ByteBuffersDataOutput.java: ## @@ -309,6 +309,30 @@ public byte[] toArrayCopy() { return arr; } + @Override + public void copyBytes(DataInput input, long numBytes) throws IOException { +assert numBytes >= 0 : "numBytes=" + numBytes; +int length = (int) numBytes; +while (length > 0) { + if (!currentBlock.hasRemaining()) { +appendBlock(); + } + if (currentBlock.isDirect()) { Review Comment: this should be `false == currentBlock.hasArray()` This is because direct and has array are somehow opposites, but thats just an implementation detail. So whenever a buffer does *not* have an array, we need to fallback to the super method. It does not matter if it is direct or not. There could be a third type of buffers :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] luyuncheng commented on a diff in pull request #1034: LUCENE-10657: CopyBytes now saves one memory copy on ByteBuffersDataOutput
luyuncheng commented on code in PR #1034: URL: https://github.com/apache/lucene/pull/1034#discussion_r923662970 ## lucene/core/src/java/org/apache/lucene/store/ByteBuffersDataOutput.java: ## @@ -309,6 +309,24 @@ public byte[] toArrayCopy() { return arr; } + @Override + public void copyBytes(DataInput input, long numBytes) throws IOException { +assert numBytes >= 0 : "numBytes=" + numBytes; +int length = (int) numBytes; +while (length > 0) { + if (!currentBlock.hasRemaining()) { +appendBlock(); + } + + int chunk = Math.min(currentBlock.remaining(), length); + final int pos = currentBlock.position(); + byte[] blockArray = currentBlock.array(); Review Comment: > There's another problem: If the block has an array, it will may not always start at 0. If the ByteBuffer is a slice of another one, the "live" data may start at a later offset. position != arrayOffset, that are 2 different coordinates. This may be important if the input is a memory mapped index input or similar. > > To have it correct use `currentBlock.array()` and `currentBlock.arrayOffset()` to get the correct slice. The position must be added afterwards. So the following needs to be corrected to: > > ```java > input.readBytes(currentBlock.array(), Math.addExact(currentBlock.arrayOffset(), currentBlock.position()), chunk); > ``` > Thanks for this advice !! i fixed it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] luyuncheng commented on a diff in pull request #1034: LUCENE-10657: CopyBytes now saves one memory copy on ByteBuffersDataOutput
luyuncheng commented on code in PR #1034: URL: https://github.com/apache/lucene/pull/1034#discussion_r923663218 ## lucene/core/src/java/org/apache/lucene/store/ByteBuffersDataOutput.java: ## @@ -309,6 +309,30 @@ public byte[] toArrayCopy() { return arr; } + @Override + public void copyBytes(DataInput input, long numBytes) throws IOException { +assert numBytes >= 0 : "numBytes=" + numBytes; +int length = (int) numBytes; +while (length > 0) { + if (!currentBlock.hasRemaining()) { +appendBlock(); + } + if (currentBlock.isDirect()) { Review Comment: ++, thanks for it. i fixed it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #947: LUCENE-10577: enable quantization of HNSW vectors to 8 bits
mayya-sharipova commented on code in PR #947: URL: https://github.com/apache/lucene/pull/947#discussion_r923672692 ## lucene/core/src/java/org/apache/lucene/codecs/lucene93/Lucene93HnswVectorsReader.java: ## @@ -169,7 +173,12 @@ private void validateFieldEntry(FieldInfo info, FieldEntry fieldEntry) { + fieldEntry.dimension); } -long numBytes = (long) fieldEntry.size() * dimension * Float.BYTES; +long numBytes; +switch (fieldEntry.vectorEncoding) { + case BYTE -> numBytes = (long) fieldEntry.size() * dimension; + case FLOAT32 -> numBytes = (long) fieldEntry.size() * dimension * Float.BYTES; + default -> throw new AssertionError("unknown vector encoding " + fieldEntry.vectorEncoding); Review Comment: Should we also update the error message that follows for `BYTE` case? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #947: LUCENE-10577: enable quantization of HNSW vectors to 8 bits
mayya-sharipova commented on code in PR #947: URL: https://github.com/apache/lucene/pull/947#discussion_r923800118 ## lucene/core/src/java/org/apache/lucene/util/VectorUtil.java: ## @@ -213,4 +213,38 @@ public static void add(float[] u, float[] v) { u[i] += v[i]; } } + + /** + * Dot product score computed over signed bytes, scaled to be in [0, 1]. + * + * @param a bytes containing a vector + * @param aOffset offset of the vector in a + * @param b bytes containing another vector, of the same dimension + * @param len the length (aka dimension) of the vectors + * @param bOffset offset of the vector in b + * @return the value of the similarity function applied to the two vectors + */ + public static float dotProductScore(BytesRef a, int aOffset, BytesRef b, int bOffset, int len) { +int total = 0; +for (int i = 0; i < len; i++) { + total += a.bytes[aOffset++] * b.bytes[bOffset++]; +} +// divide by 2 * 2^14 (maximum absolute value of product of 2 signed bytes) * len +return (1 + total) / (float) (len * (1 << 15)); Review Comment: To make scores non-negative should we do instead: `total / (float) (len * (1 << 15)) + 1`? I am also wondering why in the comments we say `// divide by 2 * 2^14`, but in the calculation we use `1 << 15`? Should it be `1 << 14` instead? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #947: LUCENE-10577: enable quantization of HNSW vectors to 8 bits
mayya-sharipova commented on code in PR #947: URL: https://github.com/apache/lucene/pull/947#discussion_r923807979 ## lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java: ## @@ -89,4 +123,67 @@ public abstract TopDocs search( public KnnVectorsReader getMergeInstance() { return this; } + + /** {@link #searchExhaustively} */ + protected static TopDocs exhaustiveSearch( + VectorValues vectorValues, + DocIdSetIterator acceptDocs, + VectorSimilarityFunction similarityFunction, + float[] target, + int k) + throws IOException { +HitQueue queue = new HitQueue(k, true); +ScoreDoc topDoc = queue.top(); +int doc; +while ((doc = acceptDocs.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) { + int vectorDoc = vectorValues.advance(doc); + assert vectorDoc == doc; + float score = similarityFunction.compare(vectorValues.vectorValue(), target); + if (score >= topDoc.score) { +topDoc.score = score; +topDoc.doc = doc; +topDoc = queue.updateTop(); + } +} +return topDocsFromHitQueue(queue, acceptDocs.cost()); + } + + /** {@link #searchExhaustively} */ + protected static TopDocs exhaustiveSearch( + VectorValues vectorValues, + DocIdSetIterator acceptDocs, + VectorSimilarityFunction similarityFunction, Review Comment: looks like `similarityFunction` is not necessary here, as we always use `dotProductScore` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta opened a new pull request, #1036: LUCENE-10557: Refine issue label texts
mocobeta opened a new pull request, #1036: URL: https://github.com/apache/lucene/pull/1036 Follow-up of #1024. This refines issue label texts as suggested in https://github.com/apache/lucene-jira-archive/issues/53 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta merged pull request #1036: LUCENE-10557: Refine issue label texts
mocobeta merged PR #1036: URL: https://github.com/apache/lucene/pull/1036 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira
[ https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17568321#comment-17568321 ] ASF subversion and git services commented on LUCENE-10557: -- Commit 781edf442b200145e4fc3fc59de554b7a0c0b57b in lucene's branch refs/heads/main from Tomoko Uchida [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=781edf442b2 ] LUCENE-10557: Refine issue label texts (#1036) > Migrate to GitHub issue from Jira > - > > Key: LUCENE-10557 > URL: https://issues.apache.org/jira/browse/LUCENE-10557 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Attachments: Screen Shot 2022-06-29 at 11.02.35 AM.png, > image-2022-06-29-13-36-57-365.png, screenshot-1.png > > Time Spent: 1h 40m > Remaining Estimate: 0h > > A few (not the majority) Apache projects already use the GitHub issue instead > of Jira. For example, > Airflow: [https://github.com/apache/airflow/issues] > BookKeeper: [https://github.com/apache/bookkeeper/issues] > So I think it'd be technically possible that we move to GitHub issue. I have > little knowledge of how to proceed with it, I'd like to discuss whether we > should migrate to it, and if so, how to smoothly handle the migration. > The major tasks would be: > * (/) Get a consensus about the migration among committers > * (/) Choose issues that should be moved to GitHub - We'll migrate all > issues towards an atomic switch to GitHub if no major technical obstacles > show up. > ** Discussion thread > [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12] > ** -Conclusion for now: We don't migrate any issues. Only new issues should > be opened on GitHub.- > ** Write a prototype migration script - the decision could be made on that. > Things to consider: > *** version numbers - labels or milestones? > *** add a comment/ prepend a link to the source Jira issue on github side, > *** add a comment/ prepend a link on the jira side to the new issue on > github side (for people who access jira from blogs, mailing list archives and > other sources that will have stale links), > *** convert cross-issue automatic links in comments/ descriptions (as > suggested by Robert), > *** strategy to deal with sub-issues (hierarchies), > *** maybe prefix (or postfix) the issue title on github side with the > original LUCENE-XYZ key so that it is easier to search for a particular issue > there? > *** how to deal with user IDs (author, reporter, commenters)? Do they have > to be github users? Will information about people not registered on github be > lost? > *** create an extra mapping file of old-issue-new-issue URLs for any > potential future uses. > *** what to do with issue numbers in git/svn commits? These could be > rewritten but it'd change the entire git history tree - I don't think this is > practical, while doable. > * Prepare a complete migration tool > ** See https://github.com/apache/lucene-jira-archive/issues/5 > * Build the convention for issue label/milestone management > ** See [https://github.com/apache/lucene-jira-archive/issues/6] > ** Do some experiments on a sandbox repository > [https://github.com/mocobeta/sandbox-lucene-10557] > ** Make documentation for metadata (label/milestone) management > * (/) Enable Github issue on the lucene's repository > ** Raise an issue on INFRA > ** (Create an issue-only private repository for sensitive issues if it's > needed and allowed) > ** Set a mail hook to > [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to > the general mail group name) > * Set a schedule for migration > ** See [https://github.com/apache/lucene-jira-archive/issues/7] > ** Give some time to committers to play around with issues/labels/milestones > before the actual migration > ** Make an announcement on the mail lists > ** Show some text messages when opening a new Jira issue (in issue template?) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on a diff in pull request #1018: LUCENE-10480: Use BulkScorer to limit BMMScorer to only top-level disjunctions
zacharymorn commented on code in PR #1018: URL: https://github.com/apache/lucene/pull/1018#discussion_r924065392 ## lucene/core/src/java/org/apache/lucene/search/BooleanWeight.java: ## @@ -191,6 +191,66 @@ public long cost() { // or null if it is not applicable // pkg-private for forcing use of BooleanScorer in tests BulkScorer optionalBulkScorer(LeafReaderContext context) throws IOException { +if (scoreMode == ScoreMode.TOP_SCORES) { + if (query.getMinimumNumberShouldMatch() > 1 || weightedClauses.size() > 2) { +return null; + } + + List optional = new ArrayList<>(); + for (WeightedBooleanClause wc : weightedClauses) { +Weight w = wc.weight; +BooleanClause c = wc.clause; +if (c.getOccur() != Occur.SHOULD) { + continue; +} +ScorerSupplier scorer = w.scorerSupplier(context); +if (scorer != null) { + optional.add(scorer); +} + } + + if (optional.size() <= 1) { +return null; + } + + List optionalScorers = new ArrayList<>(); + for (ScorerSupplier ss : optional) { +optionalScorers.add(ss.get(Long.MAX_VALUE)); + } + + return new BulkScorer() { +final Scorer bmmScorer = new BlockMaxMaxscoreScorer(BooleanWeight.this, optionalScorers); +final int maxDoc = context.reader().maxDoc(); +final DocIdSetIterator iterator = bmmScorer.iterator(); + +@Override +public int score(LeafCollector collector, Bits acceptDocs, int min, int max) +throws IOException { + collector.setScorer(bmmScorer); + + int doc = min; + while (true) { +doc = iterator.advance(doc); Review Comment: > If the last match of the first window is say 998 and the first match after the first window is 1005. Then we should make sure to score 1005 when scoring the second window before starting to advance. Thanks for the explanation, it makes sense! I thought the old implementation would take care of this boundary case as it initiated `doc` with `min` before advancing, but the behavior is indeed undefined if the previous call to `bulkScorer#score` already advanced its internal scorer past the next `min` used. I've updated the code with the above solution. ## lucene/core/src/java/org/apache/lucene/search/BooleanWeight.java: ## @@ -191,6 +191,66 @@ public long cost() { // or null if it is not applicable // pkg-private for forcing use of BooleanScorer in tests BulkScorer optionalBulkScorer(LeafReaderContext context) throws IOException { +if (scoreMode == ScoreMode.TOP_SCORES) { + if (query.getMinimumNumberShouldMatch() > 1 || weightedClauses.size() > 2) { +return null; + } + + List optional = new ArrayList<>(); + for (WeightedBooleanClause wc : weightedClauses) { +Weight w = wc.weight; +BooleanClause c = wc.clause; +if (c.getOccur() != Occur.SHOULD) { + continue; +} +ScorerSupplier scorer = w.scorerSupplier(context); +if (scorer != null) { + optional.add(scorer); +} + } + + if (optional.size() <= 1) { +return null; + } + + List optionalScorers = new ArrayList<>(); + for (ScorerSupplier ss : optional) { +optionalScorers.add(ss.get(Long.MAX_VALUE)); + } + + return new BulkScorer() { +final Scorer bmmScorer = new BlockMaxMaxscoreScorer(BooleanWeight.this, optionalScorers); +final int maxDoc = context.reader().maxDoc(); +final DocIdSetIterator iterator = bmmScorer.iterator(); + +@Override +public int score(LeafCollector collector, Bits acceptDocs, int min, int max) +throws IOException { + collector.setScorer(bmmScorer); + + int doc = min; + while (true) { +doc = iterator.advance(doc); + +if (doc >= max) { + return doc; +} + +if (acceptDocs == null || acceptDocs.get(doc)) { + collector.collect(doc); +} + +doc++; + } +} + +@Override +public long cost() { + return maxDoc; Review Comment: Updated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on pull request #1018: LUCENE-10480: Use BulkScorer to limit BMMScorer to only top-level disjunctions
zacharymorn commented on PR #1018: URL: https://github.com/apache/lucene/pull/1018#issuecomment-1188612201 Here are the latest benchmark results: `wikinightly` ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value TermBGroup1M1P 55.21 (7.3%) 53.75 (5.5%) -2.6% ( -14% - 11%) 0.200 PKLookup 330.25 (4.3%) 326.53 (3.9%) -1.1% ( -8% -7%) 0.387 TermDateFacets 71.88 (4.7%) 71.13 (4.8%) -1.0% ( -10% -8%) 0.487 MedTermDayTaxoFacets 77.44 (4.5%) 76.67 (4.8%) -1.0% ( -9% -8%) 0.502 IntNRQ 1216.26 (2.5%) 1207.48 (2.8%) -0.7% ( -5% -4%) 0.387 AndHighHighDayTaxoFacets 13.63 (2.2%) 13.55 (2.6%) -0.5% ( -5% -4%) 0.471 SpanNear 28.99 (3.4%) 28.86 (2.6%) -0.5% ( -6% -5%) 0.621 Fuzzy1 142.65 (1.4%) 142.01 (1.5%) -0.4% ( -3% -2%) 0.333 TermDTSort 142.65 (3.1%) 142.02 (2.9%) -0.4% ( -6% -5%) 0.644 Respell 96.15 (1.5%) 95.76 (2.0%) -0.4% ( -3% -3%) 0.483 TermBGroup1M 52.81 (3.0%) 52.60 (4.1%) -0.4% ( -7% -6%) 0.731 TermDayOfYearSort 259.79 (3.1%) 258.78 (2.8%) -0.4% ( -6% -5%) 0.681 TermMonthSort 275.05 (6.2%) 274.05 (7.0%) -0.4% ( -12% - 13%) 0.861 TermGroup10K 43.13 (2.8%) 42.98 (3.8%) -0.4% ( -6% -6%) 0.731 Term 2665.46 (4.2%) 2656.78 (4.1%) -0.3% ( -8% -8%) 0.803 TermGroup100 66.52 (3.4%) 66.35 (4.2%) -0.3% ( -7% -7%) 0.834 TermGroup1M 30.69 (2.5%) 30.62 (3.5%) -0.3% ( -6% -5%) 0.792 TermTitleSort 190.21 (6.2%) 189.75 (7.0%) -0.2% ( -12% - 13%) 0.908 Fuzzy2 83.28 (1.7%) 83.17 (1.9%) -0.1% ( -3% -3%) 0.809 AndHighOrMedMed 73.32 (5.9%) 73.41 (5.7%)0.1% ( -10% - 12%) 0.946 IntervalsOrdered4.63 (4.0%)4.64 (4.0%)0.2% ( -7% -8%) 0.877 AndHighMedDayTaxoFacets 48.94 (2.1%) 49.06 (2.6%)0.2% ( -4% -5%) 0.743 AndMedOrHighHigh 72.55 (5.1%) 72.74 (4.7%)0.3% ( -9% - 10%) 0.863 BrowseDayOfYearSSDVFacets 26.43 (5.6%) 26.51 (4.2%)0.3% ( -9% - 10%) 0.837 Phrase 53.31 (4.7%) 53.49 (4.9%)0.3% ( -8% - 10%) 0.820 Prefix3 416.08 (9.2%) 417.57 (9.1%)0.4% ( -16% - 20%) 0.901 BrowseMonthSSDVFacets 30.34 (11.6%) 30.58 (12.5%)0.8% ( -20% - 28%) 0.836 SloppyPhrase4.71 (3.7%)4.75 (3.6%)0.9% ( -6% -8%) 0.427 BrowseRandomLabelSSDVFacets 20.64 (6.6%) 20.83 (5.7%)0.9% ( -10% - 14%) 0.630 AndHighMed 111.69 (3.5%) 113.66 (3.3%)1.8% ( -4% -8%) 0.101 AndHighHigh 102.15 (3.4%) 104.05 (3.3%)1.9% ( -4% -8%) 0.079 Wildcard 162.58 (7.6%) 165.88 (5.6%)2.0% ( -10% - 16%) 0.337 OrHighMedDayTaxoFacets 22.45 (5.7%) 22.96 (4.7%)2.3% ( -7% - 13%) 0.174 BrowseDateSSDVFacets4.24 (32.2%)4.44 (30.1%)4.7% ( -43% - 98%) 0.630 BrowseDayOfYearTaxoFacets 27.90 (31.8%) 29.47 (33.8%)5.6% ( -45% - 104%) 0.586 BrowseDateTaxoFacets 27.82 (31.8%) 29.39 (33.9%)5.7% ( -45% - 104%) 0.587 OrHighHigh 21.30 (5.9%) 22.64 (5.3%)6.3% ( -4% - 18%) 0.000 BrowseRandomLabelTaxoFacets 28.61 (48.5%) 30.61 (50.4%)7.0% ( -61% - 205%) 0.656 BrowseMonthTaxoFacets 27.81 (33.3%) 30.05 (38.0%)8.1% ( -47% - 118%) 0.475 OrHighMed 122.60 (6.1%) 234.50 (7.2%) 91.3% ( 73% - 111%) 0.000 ``` ``` TaskQPS baseline StdDevQPS my_modified
[GitHub] [lucene] zacharymorn commented on pull request #1018: LUCENE-10480: Use BulkScorer to limit BMMScorer to only top-level disjunctions
zacharymorn commented on PR #1018: URL: https://github.com/apache/lucene/pull/1018#issuecomment-1188668774 > It appears this implementation would bring a -4% performance impact to OrXNotY tasks in general. I limited the implementation to pure disjunctions here https://github.com/apache/lucene/pull/1018/commits/f4fdfea5d4f7df5df42aea474f71738eefbb68c6 as well and it does seems to help: `wikimedium.10M.nostopwords.tasks ` results: ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value OrNotHighMed 1581.93 (2.4%) 1549.94 (3.3%) -2.0% ( -7% -3%) 0.027 OrHighNotMed 1991.40 (3.6%) 1957.40 (4.7%) -1.7% ( -9% -6%) 0.199 OrHighNotHigh 1785.23 (3.3%) 1756.53 (3.5%) -1.6% ( -8% -5%) 0.134 OrNotHighHigh 2021.35 (3.8%) 1997.89 (4.3%) -1.2% ( -8% -7%) 0.362 BrowseDateSSDVFacets4.64 (31.5%)4.59 (30.3%) -1.0% ( -47% - 88%) 0.922 IntNRQ 134.80 (0.3%) 133.52 (3.8%) -0.9% ( -5% -3%) 0.274 OrNotHighLow 2953.48 (5.5%) 2925.66 (4.9%) -0.9% ( -10% -9%) 0.566 MedTerm 3755.74 (4.3%) 3728.74 (4.9%) -0.7% ( -9% -8%) 0.623 AndHighLow 2170.32 (4.1%) 2154.98 (3.5%) -0.7% ( -7% -7%) 0.556 LowPhrase 213.41 (2.1%) 212.05 (1.8%) -0.6% ( -4% -3%) 0.312 AndHighMed 405.12 (4.3%) 402.86 (4.8%) -0.6% ( -9% -9%) 0.701 HighTerm 2262.86 (4.6%) 2252.96 (4.7%) -0.4% ( -9% -9%) 0.765 OrHighNotLow 1674.08 (3.8%) 1666.78 (5.0%) -0.4% ( -8% -8%) 0.755 MedSpanNear 185.05 (3.0%) 184.50 (3.1%) -0.3% ( -6% -6%) 0.762 BrowseDayOfYearSSDVFacets 26.80 (12.2%) 26.73 (11.9%) -0.3% ( -21% - 27%) 0.940 MedPhrase 250.33 (2.1%) 249.81 (2.3%) -0.2% ( -4% -4%) 0.767 HighSpanNear 58.57 (3.8%) 58.53 (4.1%) -0.1% ( -7% -8%) 0.958 AndHighMedDayTaxoFacets 93.86 (1.9%) 93.82 (1.7%) -0.1% ( -3% -3%) 0.927 HighIntervalsOrdered 14.41 (5.3%) 14.41 (5.8%) -0.0% ( -10% - 11%) 0.984 LowSpanNear 56.31 (2.7%) 56.38 (2.4%)0.1% ( -4% -5%) 0.880 HighPhrase 52.42 (3.2%) 52.54 (2.8%)0.2% ( -5% -6%) 0.815 AndHighHigh 107.35 (3.7%) 107.65 (3.7%)0.3% ( -6% -7%) 0.812 MedIntervalsOrdered 13.19 (2.3%) 13.24 (2.0%)0.4% ( -3% -4%) 0.584 LowTerm 3849.77 (3.4%) 3865.43 (3.4%)0.4% ( -6% -7%) 0.706 LowIntervalsOrdered 12.28 (4.9%) 12.33 (5.4%)0.4% ( -9% - 11%) 0.797 AndHighHighDayTaxoFacets 22.30 (2.4%) 22.39 (2.3%)0.4% ( -4% -5%) 0.565 Prefix3 129.48 (6.0%) 130.23 (5.1%)0.6% ( -9% - 12%) 0.745 HighTermMonthSort 198.37 (6.9%) 199.63 (6.8%)0.6% ( -12% - 15%) 0.768 PKLookup 321.51 (3.9%) 323.87 (4.8%)0.7% ( -7% -9%) 0.595 MedSloppyPhrase 91.15 (2.2%) 91.95 (2.2%)0.9% ( -3% -5%) 0.214 Respell 86.99 (2.1%) 87.84 (2.5%)1.0% ( -3% -5%) 0.174 Fuzzy1 118.73 (2.0%) 120.09 (2.1%)1.1% ( -2% -5%) 0.084 Fuzzy2 46.39 (2.2%) 46.92 (2.8%)1.1% ( -3% -6%) 0.153 LowSloppyPhrase 23.79 (2.4%) 24.08 (2.6%)1.2% ( -3% -6%) 0.122 TermDTSort 262.23 (6.6%) 265.58 (7.0%)1.3% ( -11% - 15%) 0.552 BrowseMonthSSDVFacets 29.37 (12.5%) 29.75 (13.7%)1.3% ( -22% - 31%) 0.753 HighTermDayOfYearSort 226.43 (9.4%) 229.75 (9.4%)1.5% ( -15% - 22%) 0.622 HighSloppyPhrase7.13 (4.1%)7.24 (3.7%)1.6% ( -6% -9%) 0.206 Wildcar