Re: [I] Move group-varint encoding/decoding logic to DataOutput/DataInput? [lucene]
easyice commented on issue #12826: URL: https://github.com/apache/lucene/issues/12826#issuecomment-1823961753 The wikimediumall benchmark reslut looks fine on java 21: java 21 ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value IntNRQ7.32 (10.5%)7.17 (10.5%) -2.1% ( -20% - 21%) 0.528 HighTermTitleBDVSort4.24 (6.7%)4.18 (7.4%) -1.4% ( -14% - 13%) 0.521 PKLookup 97.77 (3.0%) 96.77 (4.1%) -1.0% ( -7% -6%) 0.372 Respell 31.39 (2.0%) 31.25 (2.4%) -0.5% ( -4% -3%) 0.504 HighSpanNear2.06 (3.3%)2.06 (2.9%)0.1% ( -5% -6%) 0.946 AndHighHigh 12.99 (3.0%) 13.01 (2.5%)0.2% ( -5% -5%) 0.835 Fuzzy1 37.99 (2.1%) 38.09 (2.1%)0.3% ( -3% -4%) 0.674 HighSloppyPhrase8.97 (3.3%)9.02 (3.4%)0.6% ( -5% -7%) 0.573 HighIntervalsOrdered 13.16 (3.4%) 13.24 (3.3%)0.6% ( -5% -7%) 0.544 MedSpanNear8.66 (2.1%)8.72 (1.9%)0.7% ( -3% -4%) 0.300 LowIntervalsOrdered 63.85 (2.2%) 64.40 (2.7%)0.9% ( -3% -5%) 0.264 MedIntervalsOrdered1.00 (2.9%)1.01 (2.8%)0.9% ( -4% -6%) 0.301 LowSpanNear 12.79 (3.0%) 12.92 (2.2%)1.0% ( -4% -6%) 0.235 AndHighMed 26.44 (3.2%) 26.72 (3.1%)1.1% ( -5% -7%) 0.283 LowSloppyPhrase5.84 (2.4%)5.90 (3.1%)1.1% ( -4% -6%) 0.214 HighPhrase7.72 (3.9%)7.81 (4.7%)1.1% ( -7% - 10%) 0.421 OrHighHigh 16.44 (4.0%) 16.63 (4.2%)1.1% ( -6% -9%) 0.388 AndHighLow 412.43 (4.0%) 417.26 (3.2%)1.2% ( -5% -8%) 0.307 HighTermDayOfYearSort 142.91 (3.9%) 144.72 (3.1%)1.3% ( -5% -8%) 0.256 OrHighNotMed 309.12 (4.9%) 313.44 (3.6%)1.4% ( -6% - 10%) 0.307 MedSloppyPhrase 11.26 (3.1%) 11.43 (3.6%)1.5% ( -5% -8%) 0.169 OrHighLow 177.96 (3.6%) 181.05 (3.1%)1.7% ( -4% -8%) 0.102 OrHighNotLow 214.08 (4.5%) 217.91 (4.2%)1.8% ( -6% - 10%) 0.194 OrNotHighHigh 179.87 (4.5%) 183.13 (3.8%)1.8% ( -6% - 10%) 0.171 MedPhrase 96.44 (4.8%) 98.24 (5.0%)1.9% ( -7% - 12%) 0.231 Fuzzy2 25.19 (2.6%) 25.67 (3.1%)1.9% ( -3% -7%) 0.041 HighTermTitleSort 82.34 (3.9%) 84.06 (3.6%)2.1% ( -5% -9%) 0.076 LowPhrase 44.99 (4.5%) 45.94 (4.7%)2.1% ( -6% - 11%) 0.144 OrHighNotHigh 139.88 (4.8%) 142.89 (4.6%)2.2% ( -6% - 12%) 0.144 OrNotHighLow 269.48 (2.8%) 275.65 (2.7%)2.3% ( -3% -7%) 0.008 OrHighMed 48.08 (9.1%) 49.22 (7.5%)2.4% ( -13% - 20%) 0.369 OrNotHighMed 140.40 (6.3%) 143.99 (6.4%)2.6% ( -9% - 16%) 0.202 TermDTSort 78.01 (5.5%) 80.15 (5.7%)2.7% ( -8% - 14%) 0.125 Wildcard 40.16 (2.6%) 41.47 (3.0%)3.3% ( -2% -9%) 0.000 HighTerm 302.83 (6.1%) 313.49 (4.7%)3.5% ( -6% - 15%) 0.041 MedTerm 235.66 (6.7%) 245.00 (7.5%)4.0% ( -9% - 19%) 0.078 LowTerm 245.23 (8.6%) 255.28 (10.4%)4.1% ( -13% - 25%) 0.176 HighTermMonthSort 1037.99 (6.6%) 1089.92 (6.2%)5.0% ( -7% - 19%) 0.013 Prefix3 56.06 (3.7%) 60.53 (3.6%)8.0% ( 0% - 15%) 0.000 ``` but a bit regression in java 17: round 1 ``` TaskQPS baseline StdDevQPS
Re: [PR] Simplify advancing on postings/impacts enums [lucene]
jpountz commented on code in PR #12810: URL: https://github.com/apache/lucene/pull/12810#discussion_r1403031948 ## lucene/core/src/java/org/apache/lucene/codecs/MultiLevelSkipListReader.java: ## @@ -63,7 +63,7 @@ public abstract class MultiLevelSkipListReader implements Closeable { protected int[] skipDoc; /** Doc id of last read skip entry with docId <= target. */ - private int lastDoc; + private int lastDoc = -1; Review Comment: Another good point. This one felt a bit less obvious to me because of how it's used internally by unversioned codecs, e.g. SimpleText, but your point about external codecs makes sense, I'll rename. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Simplify advancing on postings/impacts enums [lucene]
jpountz commented on code in PR #12810: URL: https://github.com/apache/lucene/pull/12810#discussion_r1403032682 ## lucene/core/src/java/org/apache/lucene/codecs/MultiLevelSkipListReader.java: ## @@ -63,7 +63,7 @@ public abstract class MultiLevelSkipListReader implements Closeable { protected int[] skipDoc; /** Doc id of last read skip entry with docId <= target. */ - private int lastDoc; + private int lastDoc = -1; Review Comment: I pushed a rename. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Move group-varint encoding/decoding logic to DataOutput/DataInput? [lucene]
jpountz commented on issue #12826: URL: https://github.com/apache/lucene/issues/12826#issuecomment-1823968702 Thanks for testing, I was wondering about the virtual call too. In theory, group-varint works well because you can start decoding the next group before being fully done with the current one, but this only works if reading a group gets inlined so that the compiler can figure it out. The annoying part is that reading a sequence of vints would ideally go into an int[] but in our case we need a long[] because this is what postings are using (for different reasons, it helps them better take advantage of auto vectorization). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Improve DirectReader java doc [lucene]
gf2121 merged PR #12835: URL: https://github.com/apache/lucene/pull/12835 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Move group-varint encoding/decoding logic to DataOutput/DataInput? [lucene]
easyice commented on issue #12826: URL: https://github.com/apache/lucene/issues/12826#issuecomment-1824101886 Thanks for explaining! you are right, if we call `DataInput.readVIntGroup`, the function `DataInput.readVIntGroupLong`(the same as `GroupVIntReader#readLong`) is not inlined, But if we switch to `readVIntGroupFully`, the `DataInput.readVIntGroupLong` can be inlined. **baseline**, use `GroupVIntReader#readValues` , the `GroupVIntReader#readLong` is inlined:  **approach_1**, switch to `DataInput.readVIntGroup`, the `DataInput.readVIntGroupLong` is NOT inlined:  **approach_2**, switch to `DataInput.readVIntGroupFully`, the `DataInput.readVIntGroupLong` is inlined:  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for index sorting with document blocks [lucene]
s1monw commented on code in PR #12829: URL: https://github.com/apache/lucene/pull/12829#discussion_r1403174609 ## lucene/core/src/test/org/apache/lucene/index/TestIndexSorting.java: ## @@ -3173,4 +3173,184 @@ public void testSortDocsAndFreqsAndPositionsAndOffsets() throws IOException { reader.close(); dir.close(); } + + public void testBlockIsMissingParentField() throws IOException { +try (Directory dir = newDirectory()) { + IndexWriterConfig iwc = new IndexWriterConfig(new MockAnalyzer(random())); + String parentField = "parent"; + Sort indexSort = new Sort(parentField, new SortField("foo", SortField.Type.INT)); + iwc.setIndexSort(indexSort); + try (IndexWriter writer = new IndexWriter(dir, iwc)) { +List runnabels = +Arrays.asList( +() -> { + IllegalArgumentException ex = + expectThrows( + IllegalArgumentException.class, + () -> { +writer.addDocuments(Arrays.asList(new Document(), new Document())); + }); + assertEquals( + "the last document in the block must contain a numeric doc values field named: parent", + ex.getMessage()); +}, +() -> { + IllegalArgumentException ex = + expectThrows( + IllegalArgumentException.class, + () -> { +Document doc = new Document(); +doc.add(new NumericDocValuesField("parent", 0)); +writer.addDocuments(Arrays.asList(doc, new Document())); + }); + assertEquals( + "only the last document in the block must contain a numeric doc values field named: parent", Review Comment: I am not sure what the semantics of this would be. I think we should not violate the semantics of the `addDocuments` API that guarantees the insert order being preserved across merges etc. from that perspective I think it's actually good that we don't do that? Also if you rely on a certain order you should sort it ahead of time or at retrieval time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for index sorting with document blocks [lucene]
s1monw commented on code in PR #12829: URL: https://github.com/apache/lucene/pull/12829#discussion_r1403178782 ## lucene/core/src/test/org/apache/lucene/index/TestIndexSorting.java: ## @@ -3173,4 +3173,184 @@ public void testSortDocsAndFreqsAndPositionsAndOffsets() throws IOException { reader.close(); dir.close(); } + + public void testBlockIsMissingParentField() throws IOException { +try (Directory dir = newDirectory()) { + IndexWriterConfig iwc = new IndexWriterConfig(new MockAnalyzer(random())); + String parentField = "parent"; + Sort indexSort = new Sort(parentField, new SortField("foo", SortField.Type.INT)); + iwc.setIndexSort(indexSort); + try (IndexWriter writer = new IndexWriter(dir, iwc)) { +List runnabels = +Arrays.asList( +() -> { + IllegalArgumentException ex = + expectThrows( + IllegalArgumentException.class, + () -> { +writer.addDocuments(Arrays.asList(new Document(), new Document())); + }); + assertEquals( + "the last document in the block must contain a numeric doc values field named: parent", + ex.getMessage()); +}, +() -> { + IllegalArgumentException ex = + expectThrows( + IllegalArgumentException.class, + () -> { +Document doc = new Document(); +doc.add(new NumericDocValuesField("parent", 0)); +writer.addDocuments(Arrays.asList(doc, new Document())); + }); + assertEquals( + "only the last document in the block must contain a numeric doc values field named: parent", + ex.getMessage()); +}, +() -> { + IllegalArgumentException ex = + expectThrows( + IllegalArgumentException.class, + () -> { +writer.addDocuments(Arrays.asList(new Document())); + }); + assertEquals( + "the last document in the block must contain a numeric doc values field named: parent", + ex.getMessage()); +}); +Collections.shuffle(runnabels, random()); +for (Runnable runnable : runnabels) { + runnable.run(); +} + } +} + } + + public void testIndexWithSortIsCongruent() throws IOException { +try (Directory dir = newDirectory()) { + IndexWriterConfig iwc = new IndexWriterConfig(new MockAnalyzer(random())); + String parentField = "parent"; + Sort indexSort = new Sort(parentField, new SortField("foo", SortField.Type.INT)); + iwc.setIndexSort(indexSort); + try (IndexWriter writer = new IndexWriter(dir, iwc)) { +Document child1 = new Document(); +child1.add(new StringField("id", Integer.toString(1), Store.YES)); +Document child2 = new Document(); +child2.add(new StringField("id", Integer.toString(1), Store.YES)); +Document parent = new Document(); +parent.add(new StringField("id", Integer.toString(1), Store.YES)); +parent.add(new NumericDocValuesField(parentField, 0)); Review Comment: no it has no meaning, we could add this field automatically at index time if a parent field is set on IWC or the sort and record the no. of children. If you wanna model multiple levels of nesting you can do that on top of the API. All we guarantee is that this block of docs is unchanged across merges. I don't think we should do more. you are free to use whatever marker you want to identify sub-roots. Yet, you can't use the parent field I think that is reserved for the top level. One upside of having a single value for all parents is that is't very storage efficient. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for index sorting with document blocks [lucene]
s1monw commented on PR #12829: URL: https://github.com/apache/lucene/pull/12829#issuecomment-1824135498 > I'm a little worried about giving up functionality, but I think if we had a list of parent-fields rather than a single parent-field that would cover what we can do today? Maybe one is enough - you can still do hierarchies with one field, but today's API actually let's you define multiple overlapping relations. I don't think we give up any functionality. can you elaborate what functionality you are referring to? I don't think we should have a list of parent fields that IW requires, what would that list be used for? I also don't understand how the APU lets you define overlapping relations that this change would prevent you from? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a merge policy wrapper that performs recursive graph bisection on merge. [lucene]
jpountz merged PR #12622: URL: https://github.com/apache/lucene/pull/12622 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Simplify advancing on postings/impacts enums [lucene]
jpountz commented on PR #12810: URL: https://github.com/apache/lucene/pull/12810#issuecomment-1824353933 I'll merge to make sure it gets some time in CI before we cut a release. Feel free to raise concerns after merging if you have any, I'll happily address them and revert if necessary. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Simplify advancing on postings/impacts enums [lucene]
jpountz merged PR #12810: URL: https://github.com/apache/lucene/pull/12810 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow FST builder to use different writer (#12543) [lucene]
dungba88 commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1402058230 ## lucene/core/src/java/org/apache/lucene/util/fst/FST.java: ## @@ -435,6 +433,13 @@ public FST(FSTMetadata metadata, DataInput in, Outputs outputs, FSTStore f this.fstReader = fstReader; } + /** + * @return true if and only if this FST is readable (e.g has a reverse BytesReader) + */ + public boolean isReadable() { Review Comment: Yeah, I was thinking as we will remove it anyway, maybe we don't need this, along with the null check (as they seems to be throw-away code). If users try to use a non-readable FST now, NPE will thrown instead (of IllegalStateExeption). And then when we change FSTCompiler to return FSTMetadata, they can't even create a non-readable FST. What do you think? Alternatively, we can double back to `NullFSTReader`, but let it throw exception like this: ``` private static final class NullFSTReader implements FSTReader { @Override public long ramBytesUsed() { return 0; } @Override public FST.BytesReader getReverseBytesReader() { throw new UnsupportedOperationException("NullFSTReader does not support getReverseBytesReader()"); } @Override public void writeTo(DataOutput out) { throw new UnsupportedOperationException("NullFSTReader does not support writeTo(DataOutput)"); } } ``` The benefits are: - We can enforce `FSTReader` to be non-null. - Remove all null-check. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow FST builder to use different writer (#12543) [lucene]
dungba88 commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1402058230 ## lucene/core/src/java/org/apache/lucene/util/fst/FST.java: ## @@ -435,6 +433,13 @@ public FST(FSTMetadata metadata, DataInput in, Outputs outputs, FSTStore f this.fstReader = fstReader; } + /** + * @return true if and only if this FST is readable (e.g has a reverse BytesReader) + */ + public boolean isReadable() { Review Comment: Yeah, I was thinking as we will remove it anyway, maybe we don't need this, along with the null check (as they seems to be throw-away code). If users try to use a non-readable FST now, NPE will thrown instead (of IllegalStateExeption). And then when we change FSTCompiler to return FSTMetadata, they can't even create a non-readable FST. What do you think? Alternatively, we can double back to `NullFSTReader`, but let it throw exception like this: ``` private static final class NullFSTReader implements FSTReader { @Override public long ramBytesUsed() { return 0; } @Override public FST.BytesReader getReverseBytesReader() { throw new UnsupportedOperationException("NullFSTReader does not support getReverseBytesReader()"); } @Override public void writeTo(DataOutput out) { throw new UnsupportedOperationException("NullFSTReader does not support writeTo(DataOutput)"); } } ``` The benefits are: - We can enforce `FSTReader` to be non-null. - Remove all null-check. In the next PR, `NullFSTReader` can be used for the temporary FST created during building. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Virtual threads and Lucene (support async tasks) [lucene]
Jeevananthan-23 commented on issue #12531: URL: https://github.com/apache/lucene/issues/12531#issuecomment-1824474577 @uschindler, I understand the complexity of completely rewriting all Lucene internals. However, IMO it is necessary to do so in parallel. Relying entirely on MMAP is a bad idea it is good for warmup queries. Ref: https://db.cs.cmu.edu/mmap-cidr2022/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Explore a single scoring implementation in DrillSidewaysScorer [LUCENE-10037] [lucene]
slow-J commented on issue #11076: URL: https://github.com/apache/lucene/issues/11076#issuecomment-1824516300 Hi @gsmiller, I found a drill sideways tasks file in luceneutil ([ds.tasks](https://github.com/mikemccand/luceneutil/blob/5fae60800edd84c70492ac8765824ca2e4a6a991/tasks/ds.tasks)) I tried locally creating a wikimediumds (`WIKI_MEDIUM_DS = Data('wikimediumds', constants.WIKI_MEDIUM_DOCS_LINE_FILE, constants.WIKI_MEDIUM_DOCS_COUNT, constants.WIKI_DS)`) and ran `python3 src/python/localrun.py -source wikimediumds` I commented out calls to `doUnionScoring` and `doDrillDownAdvanceScoring` and left only `doQueryFirstScoring` https://github.com/slow-J/lucene/commit/7423ddda30a2ca5e3fd90ee9743b3c40ad6b94e0 , it passes all unit tests. ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value HighTermEasyDD2 2718.56 (6.0%) 2656.82 (4.8%) -2.3% ( -12% -9%) 0.189 MedTermEasyOrDD1 2454.91 (5.6%) 2417.95 (4.4%) -1.5% ( -10% -8%) 0.343 MedTermMixedDD2 3184.16 (5.5%) 3137.70 (7.0%) -1.5% ( -13% - 11%) 0.465 LowTermEasyDD2 2602.71 (6.0%) 2570.49 (5.6%) -1.2% ( -12% - 11%) 0.503 LowTermEasyDD1 2583.23 (4.7%) 2553.44 (5.2%) -1.2% ( -10% -9%) 0.465 HighTermMixedDD2 2673.50 (4.3%) 2643.56 (6.8%) -1.1% ( -11% - 10%) 0.536 LowTermHardDD1 2668.88 (5.9%) 2639.66 (4.0%) -1.1% ( -10% -9%) 0.496 MedTermHardDD1 2320.13 (5.1%) 2298.27 (5.7%) -0.9% ( -11% - 10%) 0.580 HighTermHardOrDD2 2477.61 (4.8%) 2454.33 (3.6%) -0.9% ( -8% -7%) 0.481 PKLookup 110.75 (12.0%) 109.98 (11.3%) -0.7% ( -21% - 25%) 0.849 HighTermEasyDD1 2777.91 (6.0%) 2762.76 (6.7%) -0.5% ( -12% - 12%) 0.787 MedTermHardOrDD1 2998.49 (5.2%) 2983.15 (5.8%) -0.5% ( -10% - 11%) 0.769 HighTermEasyOrDD1 2513.54 (4.9%) 2503.33 (4.4%) -0.4% ( -9% -9%) 0.782 LowTermHardOrDD2 2349.94 (6.2%) 2346.39 (4.9%) -0.2% ( -10% - 11%) 0.932 LowTermHardDD2 2566.37 (4.5%) 2570.74 (5.5%)0.2% ( -9% - 10%) 0.914 MedTermEasyDD2 3231.73 (6.8%) 3239.67 (6.4%)0.2% ( -12% - 14%) 0.906 MedTermEasyDD1 3323.66 (7.4%) 3344.85 (5.4%)0.6% ( -11% - 14%) 0.755 LowTermEasyOrDD1 2385.39 (5.0%) 2401.56 (5.6%)0.7% ( -9% - 11%) 0.686 LowTermEasyOrDD2 2430.31 (4.7%) 2454.95 (5.6%)1.0% ( -8% - 11%) 0.534 HighTermEasyOrDD2 2394.51 (5.0%) 2424.64 (6.2%)1.3% ( -9% - 13%) 0.478 MedTermEasyOrDD2 2856.22 (7.0%) 2892.76 (5.1%)1.3% ( -10% - 14%) 0.508 LowTermMixedDD2 2672.52 (5.9%) 2716.97 (5.9%)1.7% ( -9% - 14%) 0.372 MedTermHardDD2 2319.03 (6.4%) 2361.41 (4.6%)1.8% ( -8% - 13%) 0.299 HighTermHardOrDD1 2550.89 (5.4%) 2598.09 (2.7%)1.9% ( -5% - 10%) 0.172 LowTermHardOrDD1 2429.07 (7.3%) 2484.87 (4.6%)2.3% ( -8% - 15%) 0.234 HighTermHardDD2 2627.08 (5.8%) 2697.49 (5.7%)2.7% ( -8% - 14%) 0.139 MedTermHardOrDD2 2846.21 (6.0%) 2956.36 (6.2%)3.9% ( -7% - 17%) 0.045 HighTermHardDD1 2684.90 (4.4%) 2800.76 (3.3%)4.3% ( -3% - 12%) 0.000 ``` I think from results we can see that there is no performance loss from removing doUnionScoring` and `doDrillDownAdvanceScoring`. I am wondering is the setup correct enough to trust the results? Does ([ds.tasks](https://github.com/mikemccand/luceneutil/blob/5fae60800edd84c70492ac8765824ca2e4a6a991/tasks/ds.tasks)) contain enough tasks? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Move MergeState.DocMap to a FunctionalInterface [lucene]
s1monw opened a new pull request, #12836: URL: https://github.com/apache/lucene/pull/12836 This change converts MergeState to an interface to make use of lambda expressions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move MergeState.DocMap to a FunctionalInterface [lucene]
jpountz commented on code in PR #12836: URL: https://github.com/apache/lucene/pull/12836#discussion_r1403533598 ## lucene/core/src/java/org/apache/lucene/index/IndexWriter.java: ## @@ -5237,13 +5234,10 @@ public int get(int docID) { for (int i = 0; i < docMaps.length; ++i) { MergeState.DocMap reorderDocMap = reorderDocMaps[i]; docMaps[i] = - new MergeState.DocMap() { -@Override -public int get(int docID) { - int reorderedDocId = reorderDocMap.get(docID); - int compactedDocId = compactionDocMap.get(reorderedDocId); - return compactedDocId; -} + docID -> { +int reorderedDocId = reorderDocMap.get(docID); +int compactedDocId = compactionDocMap.get(reorderedDocId); +return compactedDocId; Review Comment: Seeing the above makes me want to write something along the lines of `(reorderDocMap::get).andThen(compactionDocMap::get)`. (Just food for thoughts, it might not actually be practical) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup concurrent multi-segment HNWS graph search [lucene]
mayya-sharipova commented on code in PR #12794: URL: https://github.com/apache/lucene/pull/12794#discussion_r1403551939 ## lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java: ## @@ -26,26 +26,71 @@ * @lucene.experimental */ public final class TopKnnCollector extends AbstractKnnCollector { + private static final float DEFAULT_GREEDINESS = 0.9f; Review Comment: @vigyasharma Thanks for your feedback. Indeed, more documentation is needed, I will add it after we finalize the experiments. A general idea with the introduction of a second shorter local queue is that different searches from different graphs can progress differently. We don't want to stop searching a graph if we are just starting and still in a bad neighbourhood where similarity can be worse that the globally collected results. We still want to make some progress. As you correctly noticed, `greediness` is meant to show how greedy is our local segment based search if we are not competitive globally. A good approach could be to be greedy, and don't do much exploration in this case, keeping size of the second queue small. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup concurrent multi-segment HNWS graph search [lucene]
mayya-sharipova commented on code in PR #12794: URL: https://github.com/apache/lucene/pull/12794#discussion_r1403560016 ## lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java: ## @@ -79,24 +81,30 @@ public Query rewrite(IndexSearcher indexSearcher) throws IOException { filterWeight = null; } +MaxScoreAccumulator globalMinSimAcc = new MaxScoreAccumulator(); TaskExecutor taskExecutor = indexSearcher.getTaskExecutor(); List leafReaderContexts = reader.leaves(); List> tasks = new ArrayList<>(leafReaderContexts.size()); for (LeafReaderContext context : leafReaderContexts) { - tasks.add(() -> searchLeaf(context, filterWeight)); + tasks.add(() -> searchLeaf(context, filterWeight, globalMinSimAcc)); Review Comment: `MaxScoreAccumulator` is thread safe, it uses `LongAccumulator` that is enough for our purposes to collect max global similarity. And the same instance of `MaxScoreAccumulator` is shared among all slices. This approach is similar to `TopScoreDocCollector`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup concurrent multi-segment HNWS graph search [lucene]
mayya-sharipova commented on code in PR #12794: URL: https://github.com/apache/lucene/pull/12794#discussion_r1403563168 ## lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java: ## @@ -26,26 +26,71 @@ * @lucene.experimental */ public final class TopKnnCollector extends AbstractKnnCollector { + private static final float DEFAULT_GREEDINESS = 0.9f; private final NeighborQueue queue; + private final float greediness; + private final NeighborQueue queueg; + private final MaxScoreAccumulator globalMinSimAcc; + private boolean kResultsCollected = false; + private float cachedGlobalMinSim = Float.NEGATIVE_INFINITY; + + // greediness of globally non-competitive search: [0,1] /** * @param k the number of neighbors to collect * @param visitLimit how many vector nodes the results are allowed to visit + * @param globalMinSimAcc the global minimum competitive similarity tracked across all segments */ - public TopKnnCollector(int k, int visitLimit) { + public TopKnnCollector(int k, int visitLimit, MaxScoreAccumulator globalMinSimAcc) { +super(k, visitLimit); +this.greediness = DEFAULT_GREEDINESS; +this.queue = new NeighborQueue(k, false); +int queuegSize = Math.max(1, Math.round((1 - greediness) * k)); +this.queueg = new NeighborQueue(queuegSize, false); Review Comment: Indeed, it is local. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move MergeState.DocMap to a FunctionalInterface [lucene]
s1monw commented on code in PR #12836: URL: https://github.com/apache/lucene/pull/12836#discussion_r1403570681 ## lucene/core/src/java/org/apache/lucene/index/IndexWriter.java: ## @@ -5237,13 +5234,10 @@ public int get(int docID) { for (int i = 0; i < docMaps.length; ++i) { MergeState.DocMap reorderDocMap = reorderDocMaps[i]; docMaps[i] = - new MergeState.DocMap() { -@Override -public int get(int docID) { - int reorderedDocId = reorderDocMap.get(docID); - int compactedDocId = compactionDocMap.get(reorderedDocId); - return compactedDocId; -} + docID -> { +int reorderedDocId = reorderDocMap.get(docID); +int compactedDocId = compactionDocMap.get(reorderedDocId); +return compactedDocId; Review Comment: yeah I am a bit afraid of the boxing that would happen here but we can for sure remove the variables -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Removing TermInSetQuery array ctor [lucene]
slow-J opened a new pull request, #12837: URL: https://github.com/apache/lucene/pull/12837 Creating a PR for the items discussed in https://github.com/apache/lucene/issues/12243. Currently, calling the [`KeywordField#newSetQuery`](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/document/KeywordField.java#L174-L180) which takes an array and creates: 2 TermInSetQueries (ctors used:[1](https://github.com/apache/lucene/blob/569533bd76a115e239d8c621d2540462bc12d891/lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java#L89), [2](https://github.com/apache/lucene/blob/569533bd76a115e239d8c621d2540462bc12d891/lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java#L102)) with arrays, the TiSQ then immediately changes the array to a list `Arrays.asList(terms)`. I initially wanted to add another `KeywordField#newSetQuery` ctor which takes `Collection` as a param to remove the needless calls to change the input to array (currently needed to call KeywordField#newSetQuery), which then gets immediately changed back to collection (when creating TiSQ). The discussion in the issue, mentioned that since TermInSetQuery has constructors which take both array and collection, we should consolidate them (unless I am misunderstanding). So in this PR, I am removing the TiSQ array ctor (since order does not matter in this case) and changing the `KeywordField#newSetQuery` param to collection (previously array). Closes https://github.com/apache/lucene/issues/12243. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Add a new static method for KeywordField#newSetQuery to support collections parameter [lucene]
slow-J commented on issue #12243: URL: https://github.com/apache/lucene/issues/12243#issuecomment-1824681333 > > I don't think the fact that TermInSetQuery has a Collection ctor should impact this at all. Maybe that should be removed? > > +1 > > It's weird to take both array and Collection when the API is just going to do the obvious conversion that a caller would do? We should somehow be consistent here about such "sugar" APIs. > > But if we are going to pick one, shouldn't it be `Collection` for these two cases, since order does not matter? Taking an array implies that order might be significant, but (I think?) order does not matter for `KeywordField.newSetQuery` nor for `TermInSetQuery` ctor? > > I don't feel strongly though (Collection vs array) ... but I do agree we should be consistent. Thanks Mike, to progress the discussion I opened a PR to remove the `TermInSetQuery` array ctor and change the `KeywordField.newSetQuery` values param to `Collections` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup concurrent multi-segment HNWS graph search [lucene]
mayya-sharipova commented on code in PR #12794: URL: https://github.com/apache/lucene/pull/12794#discussion_r1403584908 ## lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java: ## @@ -26,26 +26,71 @@ * @lucene.experimental */ public final class TopKnnCollector extends AbstractKnnCollector { + private static final float DEFAULT_GREEDINESS = 0.9f; private final NeighborQueue queue; + private final float greediness; + private final NeighborQueue queueg; + private final MaxScoreAccumulator globalMinSimAcc; + private boolean kResultsCollected = false; + private float cachedGlobalMinSim = Float.NEGATIVE_INFINITY; + + // greediness of globally non-competitive search: [0,1] /** * @param k the number of neighbors to collect * @param visitLimit how many vector nodes the results are allowed to visit + * @param globalMinSimAcc the global minimum competitive similarity tracked across all segments */ - public TopKnnCollector(int k, int visitLimit) { + public TopKnnCollector(int k, int visitLimit, MaxScoreAccumulator globalMinSimAcc) { +super(k, visitLimit); +this.greediness = DEFAULT_GREEDINESS; +this.queue = new NeighborQueue(k, false); +int queuegSize = Math.max(1, Math.round((1 - greediness) * k)); +this.queueg = new NeighborQueue(queuegSize, false); +this.globalMinSimAcc = globalMinSimAcc; + } + + public TopKnnCollector( + int k, int visitLimit, MaxScoreAccumulator globalMinSimAcc, float greediness) { super(k, visitLimit); +this.greediness = greediness; this.queue = new NeighborQueue(k, false); +this.queueg = new NeighborQueue(Math.round((1 - greediness) * k), false); +this.globalMinSimAcc = globalMinSimAcc; } @Override public boolean collect(int docId, float similarity) { -return queue.insertWithOverflow(docId, similarity); +boolean result = queue.insertWithOverflow(docId, similarity); +queueg.insertWithOverflow(docId, similarity); + +boolean reachedKResults = (kResultsCollected == false && queue.size() == k()); +if (reachedKResults) { + kResultsCollected = true; +} +if (globalMinSimAcc != null && kResultsCollected) { + // as we've collected k results, we can start exchanging globally + globalMinSimAcc.accumulate(queue.topNode(), queue.topScore()); + + // periodically update the local copy of global similarity + if (reachedKResults || (visitedCount & globalMinSimAcc.modInterval) == 0) { +MaxScoreAccumulator.DocAndScore docAndScore = globalMinSimAcc.get(); +cachedGlobalMinSim = docAndScore.score; Review Comment: Thanks for your idea, it looks like it should provide better speedups than the current approach. We can check globalMinSim before starting search, and see if it makes difference -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Virtual threads and Lucene (support async tasks) [lucene]
uschindler commented on issue #12531: URL: https://github.com/apache/lucene/issues/12531#issuecomment-1824732748 > @uschindler, I understand the complexity of completely rewriting all Lucene internals. However, IMO it is necessary to do so in parallel. Relying entirely on MMAP is a bad idea it is good for warmup queries. > > Ref: https://db.cs.cmu.edu/mmap-cidr2022/ Lucene has different access patterns that are not database like. MMAP works perfectly here. Lucene is using WORM when writing files. This paper is known to use and our long-time testing figured out that it does not apply to most Lucene workloads. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup concurrent multi-segment HNWS graph search [lucene]
mayya-sharipova commented on PR #12794: URL: https://github.com/apache/lucene/pull/12794#issuecomment-1824736252 @vigyasharma Answering other questions: > We seem to consistently see an improvement in recall between single segment, and multi-segment runs (both seq and conc.) on baseline. Is this because with multiple segments, we get multiple entry points into the overall graph? Whereas in a single merged segment, we only have access to a sparser set of nodes in layer-1 while finding the single best entry point? Indeed, this is the correct observation. For multiple segments, we retrieve k results from each segment, and then merge k* num_of_segments results to get the best `k` results. As we are retrieving more results, we also get better recall than from the single segment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Simplify advancing on postings/impacts enums [lucene]
jpountz opened a new pull request, #12838: URL: https://github.com/apache/lucene/pull/12838 This is a new iteration on #12810, which I had to revert because of test failures when docFreq is a multiple of 128. We currently have a hack when the doc freq is not a multiple of 128 in order to not write skip data for the last block. I suspect it made sense in the past when postings were only about advancing postings, but storing this last skip data point also helps simplify advancing because postings enums can then always assume that the last doc ID of a full block is equal to the next skip doc. This also helps dynamic pruning by also storing impacts for this last block. See the second commit of this PR for more details. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move MergeState.DocMap to a FunctionalInterface [lucene]
s1monw merged PR #12836: URL: https://github.com/apache/lucene/pull/12836 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] Grow arrays up to a given limit to avoid overallocation where possible [lucene]
stefanvodita opened a new issue, #12839: URL: https://github.com/apache/lucene/issues/12839 ### Description `ArrayUtils` provides methods to grow arrays, overallocating exponentially, with the possibility of requesting a minimum size. Sometimes we have an upper limit to the number of elements that would go into that array. In those situations, it would be nice to have a method that grows up to a limit. For example, `DirectoryTaxonomyReader.getBulkOrdinals` dynamically grows an array pointing to ordinals missing from the cache ([code](https://github.com/apache/lucene/blob/981339be0413730598631c1f578ba8a6493061b9/lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java#L354)). But we know ahead of time there can't be more missing ordinals than there are ordinals in the taxonomy. We can limit the array growth to avoid overallocating. This pattern might be applicable in multiple places. `TermsAutomatonScorer` seems to use a [similar pattern](https://github.com/apache/lucene/blob/981339be0413730598631c1f578ba8a6493061b9/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/TermAutomatonScorer.java#L294). The API could look like this: ``` public static int[] growInRange(int[] array, int minLength, int maxLength) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow FST builder to use different writer (#12543) [lucene]
dungba88 commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1402058230 ## lucene/core/src/java/org/apache/lucene/util/fst/FST.java: ## @@ -435,6 +433,13 @@ public FST(FSTMetadata metadata, DataInput in, Outputs outputs, FSTStore f this.fstReader = fstReader; } + /** + * @return true if and only if this FST is readable (e.g has a reverse BytesReader) + */ + public boolean isReadable() { Review Comment: Yeah, I was thinking as we will remove it anyway, maybe we don't need this, along with the null check (as they seems to be throw-away code). If users try to use a non-readable FST now, NPE will thrown instead (of IllegalStateExeption). And then when we change FSTCompiler to return FSTMetadata, they can't even create a non-readable FST. What do you think? Alternatively, we can double back to `NullFSTReader`, but let it throw exception like this: ``` private static final class NullFSTReader implements FSTReader { @Override public long ramBytesUsed() { return 0; } @Override public FST.BytesReader getReverseBytesReader() { throw new UnsupportedOperationException("NullFSTReader does not support getReverseBytesReader()"); } @Override public void writeTo(DataOutput out) { throw new UnsupportedOperationException("NullFSTReader does not support writeTo(DataOutput)"); } } ``` The benefits are: - We can enforce `FSTReader` to be non-null and remove all null-check, while still throwing meaningful exception when users try to use a nkn-readable FST. - In the next PR, `NullFSTReader` can be used for the temporary FST created during building so it won't be throw-way code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Random access term dictionary [lucene]
Tony-X commented on PR #12688: URL: https://github.com/apache/lucene/pull/12688#issuecomment-1825228925 After some tweaking and tinkering I was finally able to get the bench running the way I wanted in luceneutil! @mikemccand Unfortunately luceneutil out of the box doesn't work for my case... ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value [19/1802] Wildcard 56.20 (2.0%)7.30 (0.2%) -87.0% ( -87% - -86%) 0.000 Respell 44.00 (1.7%) 14.46 (0.6%) -67.1% ( -68% - -65%) 0.000 Fuzzy1 54.11 (1.2%) 20.86 (0.8%) -61.4% ( -62% - -60%) 0.000 Prefix3 103.61 (0.8%) 41.37 (0.5%) -60.1% ( -60% - -59%) 0.000 Fuzzy2 42.43 (1.2%) 20.26 (0.9%) -52.3% ( -53% - -50%) 0.000 HighTermTitleSort 127.60 (1.7%) 114.65 (1.6%) -10.1% ( -13% - -6%) 0.000 HighTermMonthSort 2549.22 (2.8%) 2435.74 (3.8%) -4.5% ( -10% -2%) 0.037 AndHighLow 708.99 (2.3%) 678.15 (2.1%) -4.4% ( -8% -0%) 0.001 LowTerm 369.16 (5.5%) 358.15 (3.1%) -3.0% ( -10% -5%) 0.287 OrNotHighLow 258.51 (1.8%) 252.11 (2.2%) -2.5% ( -6% -1%) 0.054 OrHighLow 348.58 (1.1%) 340.11 (2.5%) -2.4% ( -5% -1%) 0.046 OrNotHighHigh 141.31 (6.7%) 138.54 (3.2%) -2.0% ( -11% -8%) 0.554 IntNRQ 17.93 (1.8%) 17.62 (3.7%) -1.7% ( -7% -3%) 0.349 MedSloppyPhrase 26.73 (0.8%) 26.28 (1.3%) -1.7% ( -3% -0%) 0.017 HighIntervalsOrdered3.88 (2.6%)3.82 (2.3%) -1.6% ( -6% -3%) 0.314 HighTerm 290.77 (7.3%) 286.20 (6.9%) -1.6% ( -14% - 13%) 0.727 OrHighNotHigh 296.51 (6.9%) 291.90 (4.9%) -1.6% ( -12% - 11%) 0.682 LowIntervalsOrdered 15.48 (1.1%) 15.26 (1.3%) -1.4% ( -3% -0%) 0.057 MedTerm 405.54 (6.3%) 399.77 (5.9%) -1.4% ( -12% - 11%) 0.713 LowSloppyPhrase 44.64 (0.5%) 44.04 (2.1%) -1.3% ( -3% -1%) 0.171 HighTermDayOfYearSort 177.36 (1.6%) 175.08 (1.8%) -1.3% ( -4% -2%) 0.234 OrHighMed 79.44 (3.7%) 78.48 (3.0%) -1.2% ( -7% -5%) 0.572 OrNotHighMed 268.82 (4.3%) 265.74 (3.1%) -1.1% ( -8% -6%) 0.632 BrowseMonthTaxoFacets3.89 (0.4%)3.85 (1.2%) -0.9% ( -2% -0%) 0.134 AndHighMedDayTaxoFacets 44.33 (0.7%) 43.96 (1.1%) -0.8% ( -2% -0%) 0.145 MedSpanNear 30.67 (1.1%) 30.42 (2.3%) -0.8% ( -4% -2%) 0.481 LowSpanNear4.56 (0.8%)4.53 (2.1%) -0.6% ( -3% -2%) 0.576 HighSpanNear8.52 (1.4%)8.47 (2.1%) -0.5% ( -3% -2%) 0.636 OrHighNotLow 236.32 (6.6%) 235.28 (4.4%) -0.4% ( -10% - 11%) 0.901 AndHighHighDayTaxoFacets4.31 (0.6%)4.29 (0.8%) -0.4% ( -1% -1%) 0.446 LowPhrase 69.90 (1.2%) 69.67 (2.7%) -0.3% ( -4% -3%) 0.807 AndHighMed 41.87 (0.6%) 41.75 (2.7%) -0.3% ( -3% -3%) 0.828 OrHighHigh 45.86 (7.4%) 45.77 (7.9%) -0.2% ( -14% - 16%) 0.969 TermDTSort 101.90 (2.3%) 101.84 (2.0%) -0.1% ( -4% -4%) 0.966 HighSloppyPhrase0.48 (2.2%)0.48 (2.6%) -0.0% ( -4% -4%) 0.995 BrowseDateSSDVFacets1.00 (3.7%)1.00 (5.3%) -0.0% ( -8% -9%) 1.000 HighTermTitleBDVSort4.48 (5.4%)4.49 (4.4%)0.0% ( -9% - 10%) 0.990 BrowseDayOfYearSSDVFacets3.51 (13.4%)3.52 (13.3%)0.1% ( -23% - 30%) 0.990 BrowseMonthSSDVFacets3.38 (0.7%)3.39 (0.6%)0.1% ( -1% -1%) 0.740 Browse
Re: [PR] hunspell: allow in-memory entry sorting for faster dictionary loading [lucene]
donnerpeter merged PR #12834: URL: https://github.com/apache/lucene/pull/12834 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Hide the internal data structure of HeapPointWriter [lucene]
iverase commented on PR #12762: URL: https://github.com/apache/lucene/pull/12762#issuecomment-1825276830 I didn't notice any performance change. Added javadocs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org