[jira] [Commented] (LUCENE-10030) [DrillSidewaysScorer] redundant score() calculations in doQueryFirstScoring
[ https://issues.apache.org/jira/browse/LUCENE-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17387972#comment-17387972 ] Grigoriy Troitskiy commented on LUCENE-10030: - Yes, It would be great to backport into 8.x. Didn't find any hints in the wiki, could you pls describe the steps? Btw, should we have added changes to [https://github.com/apache/lucene/blob/main/lucene/CHANGES.txt] ? > [DrillSidewaysScorer] redundant score() calculations in doQueryFirstScoring > --- > > Key: LUCENE-10030 > URL: https://issues.apache.org/jira/browse/LUCENE-10030 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Grigoriy Troitskiy >Priority: Major > Time Spent: 2h 20m > Remaining Estimate: 0h > > *Diff* > {code:java} > @@ -195,11 +195,8 @@ class DrillSidewaysScorer extends BulkScorer { > > collectDocID = docID; > > - // TODO: we could score on demand instead since we are > - // daat here: > - collectScore = baseScorer.score(); > - > if (failedCollector == null) { > + collectScore = baseScorer.score(); > // Hit passed all filters, so it's "real": > collectHit(collector, dims); > } else { > {code} > > *Motivation* > 1. Performance degradation: we have quite heavy custom implementation of > score(). So when we started using DrillSideways, this call became top-1 in a > profiler snapshot (top-3 with default scoring). We tried doUnionScoring and > doDrillDownAdvanceScoring, but no luck: > doUnionScoring scores all baseQuery docIds > doDrillDownAdvanceScoring avoids some redundant docIds scorings, considering > symmetric difference of top two iterator's docIds, but still scores some > docIds, that will be filtered out by 3rd, 4th, ... dimension iterators > doQueryFirstScoring scores near-miss docIds > Best way is to score only true hits (where baseQuery and all N drill-down > iterators match). So we suggest a small modification of doQueryFirstScoring. > > 2. Speaking of doQueryFirstScoring, it doesn't look like we need to > calculate a score for near-miss hit, because it won't be used anywhere. > FacetsCollectorManager creates FacetsCollector with default constructor > > [https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/FacetsCollectorManager.java#L35] > so FacetCollector has false for keepScores > > [https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/FacetsCollector.java#L119] > and collectScore is not being used > > [https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/DrillSidewaysScorer.java#L200] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a change in pull request #226: LUCENE-10036: Add factory method to ScoreCachingWrappingScorer that ensures unnecessary wrapping doesn't occur
jpountz commented on a change in pull request #226: URL: https://github.com/apache/lucene/pull/226#discussion_r677348827 ## File path: lucene/core/src/test/org/apache/lucene/search/TestScoreCachingWrappingScorer.java ## @@ -156,4 +156,28 @@ public void testGetScores() throws Exception { ir.close(); directory.close(); } + + public void testNoUnnecessaryWrap() throws Exception { +Scorable base = +new Scorable() { + @Override + public float score() throws IOException { +return -1; + } + + @Override + public int docID() { +return -1; + } +}; + +// Wrapping the first time should produce a different instance: +Scorable wrapped = ScoreCachingWrappingScorer.wrap(base); +assertNotEquals(base, wrapped); + +// But if we try to wrap an instance of ScoreCachingWrappingScorer, it shouldn't unnecessarily +// wrap again: +Scorable doubleWrapped = ScoreCachingWrappingScorer.wrap(wrapped); +assertEquals(wrapped, doubleWrapped); Review comment: Use `assertSame`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on pull request #220: LUCENE-9450: Use BinaryDocValue fields in the taxonomy index based on the existing index version
mikemccand commented on pull request #220: URL: https://github.com/apache/lucene/pull/220#issuecomment-887472869 Also, to be clear, even though the opening comment says the PR implemented option 1, it has now iterated onto option 2 (switching based on the index created version metadata). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on a change in pull request #220: LUCENE-9450: Use BinaryDocValue fields in the taxonomy index based on the existing index version
mikemccand commented on a change in pull request #220: URL: https://github.com/apache/lucene/pull/220#discussion_r677406777 ## File path: lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestBackwardsCompatibility.java ## @@ -49,13 +49,8 @@ // // Then move the zip file to your trunk checkout and use it in your test cases - public static final String oldTaxonomyIndexName = "taxonomy.8.6.3-cfs"; + public static final String oldTaxonomyIndexName = "taxonomy.8.10.0-cfs"; - // LUCENE-9334 requires consistency of field data structures between documents. - // Old taxonomy index had $full_path$ field indexed only with postings, - // It is not allowed to add the same field $full_path$ indexed with BinaryDocValues - // for a new segment, that this test is trying to do. - @AwaitsFix(bugUrl = "https://issues.apache.org/jira/browse/LUCENE-9334";) Review comment: Woot! ## File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java ## @@ -475,8 +477,15 @@ private int addCategoryDocument(FacetLabel categoryPath, int parent) throws IOEx String fieldPath = FacetsConfig.pathToString(categoryPath.components, categoryPath.length); fullPathField.setStringValue(fieldPath); + +if (useOlderStoredFieldIndex) { + fullPathField = new StringField(Consts.FULL, fieldPath, Field.Store.YES); Review comment: Hmm this is odd -- up above this `if` we are already calling `setStringValue` on the reused `fullPathField`, but then inside this `if` we override `fullPathField` with a new `StringField` instance? Maybe just move the above line inside the `if`? Also, why don't we also re-use the `BinaryDocValues` field? ## File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java ## @@ -164,9 +159,16 @@ public DirectoryTaxonomyWriter(Directory directory, OpenMode openMode, TaxonomyW openMode = config.getOpenMode(); if (!DirectoryReader.indexExists(directory)) { Review comment: Maybe take the opportunity to repair this horrible `!` by switching it to `== false`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10034) Vectors NeighborQueue MIN/MAX heap reversed?
[ https://issues.apache.org/jira/browse/LUCENE-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388054#comment-17388054 ] Mayya Sharipova commented on LUCENE-10034: -- Tomoko and Michael, thank you for the discussion. From my side, this part of the code is very clear now. About naming "similarity" VS "distance", IMH, the current "Similarity" naming for a vector metric function suits Lucene better, as top docs are arranged in a decreasing order of scores, the same way as a "similarity" metric arranges them. > Vectors NeighborQueue MIN/MAX heap reversed? > > > Key: LUCENE-10034 > URL: https://issues.apache.org/jira/browse/LUCENE-10034 > Project: Lucene - Core > Issue Type: Bug >Reporter: Mayya Sharipova >Priority: Trivial > > NeighborQueue is defined as following: > {code:java} > NeighborQueue(int initialSize, boolean reversed) { > if (reversed) { > heap = LongHeap.create(LongHeap.Order.MAX, initialSize); > } else { > heap = LongHeap.create(LongHeap.Order.MIN, initialSize); > } > } > {code} > should it be reversed? should it be instead using MIN heap for reversed > functions such as EUCLIDEAN distance, as we are interested in neigbors with > min euclidean distances? > I apologize if I missed some broader context where this definition makes > sense. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10030) [DrillSidewaysScorer] redundant score() calculations in doQueryFirstScoring
[ https://issues.apache.org/jira/browse/LUCENE-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388092#comment-17388092 ] Michael McCandless commented on LUCENE-10030: - bq. Btw, should we have added changes to https://github.com/apache/lucene/blob/main/lucene/CHANGES.txt ? Looks like [~gsmiller] just added the CHANGES entry. What an awesome performance bug fix! Thanks [~gtroitskiy]. +1 to backport to 8.x, but that is a bit tricky because it is a different git repository: https://github.com/apache/lucene-solr You could use diff/patch to carry the change over, or, I think it is possible to add two remotes to your local git repository and then `git cherry-pick` across them, to pull this change from 9.0 back to 8.x. > [DrillSidewaysScorer] redundant score() calculations in doQueryFirstScoring > --- > > Key: LUCENE-10030 > URL: https://issues.apache.org/jira/browse/LUCENE-10030 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Grigoriy Troitskiy >Priority: Major > Time Spent: 2h 20m > Remaining Estimate: 0h > > *Diff* > {code:java} > @@ -195,11 +195,8 @@ class DrillSidewaysScorer extends BulkScorer { > > collectDocID = docID; > > - // TODO: we could score on demand instead since we are > - // daat here: > - collectScore = baseScorer.score(); > - > if (failedCollector == null) { > + collectScore = baseScorer.score(); > // Hit passed all filters, so it's "real": > collectHit(collector, dims); > } else { > {code} > > *Motivation* > 1. Performance degradation: we have quite heavy custom implementation of > score(). So when we started using DrillSideways, this call became top-1 in a > profiler snapshot (top-3 with default scoring). We tried doUnionScoring and > doDrillDownAdvanceScoring, but no luck: > doUnionScoring scores all baseQuery docIds > doDrillDownAdvanceScoring avoids some redundant docIds scorings, considering > symmetric difference of top two iterator's docIds, but still scores some > docIds, that will be filtered out by 3rd, 4th, ... dimension iterators > doQueryFirstScoring scores near-miss docIds > Best way is to score only true hits (where baseQuery and all N drill-down > iterators match). So we suggest a small modification of doQueryFirstScoring. > > 2. Speaking of doQueryFirstScoring, it doesn't look like we need to > calculate a score for near-miss hit, because it won't be used anywhere. > FacetsCollectorManager creates FacetsCollector with default constructor > > [https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/FacetsCollectorManager.java#L35] > so FacetCollector has false for keepScores > > [https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/FacetsCollector.java#L119] > and collectScore is not being used > > [https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/DrillSidewaysScorer.java#L200] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10034) Vectors NeighborQueue MIN/MAX heap reversed?
[ https://issues.apache.org/jira/browse/LUCENE-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388054#comment-17388054 ] Mayya Sharipova edited comment on LUCENE-10034 at 7/27/21, 2:35 PM: Tomoko and Michael, thank you for the discussion. From my side, this part of the code is very clear now. About naming "similarity" VS "distance", while "distance" indeed is easier to understand, IMH, the current "Similarity" naming for a vector metric function suits Lucene better, as top docs are arranged in a decreasing order of scores, the same way as a "similarity" metric arranges them. was (Author: mayya): Tomoko and Michael, thank you for the discussion. From my side, this part of the code is very clear now. About naming "similarity" VS "distance", IMH, the current "Similarity" naming for a vector metric function suits Lucene better, as top docs are arranged in a decreasing order of scores, the same way as a "similarity" metric arranges them. > Vectors NeighborQueue MIN/MAX heap reversed? > > > Key: LUCENE-10034 > URL: https://issues.apache.org/jira/browse/LUCENE-10034 > Project: Lucene - Core > Issue Type: Bug >Reporter: Mayya Sharipova >Priority: Trivial > > NeighborQueue is defined as following: > {code:java} > NeighborQueue(int initialSize, boolean reversed) { > if (reversed) { > heap = LongHeap.create(LongHeap.Order.MAX, initialSize); > } else { > heap = LongHeap.create(LongHeap.Order.MIN, initialSize); > } > } > {code} > should it be reversed? should it be instead using MIN heap for reversed > functions such as EUCLIDEAN distance, as we are interested in neigbors with > min euclidean distances? > I apologize if I missed some broader context where this definition makes > sense. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a change in pull request #226: LUCENE-10036: Add factory method to ScoreCachingWrappingScorer that ensures unnecessary wrapping doesn't occur
gsmiller commented on a change in pull request #226: URL: https://github.com/apache/lucene/pull/226#discussion_r677530466 ## File path: lucene/core/src/test/org/apache/lucene/search/TestScoreCachingWrappingScorer.java ## @@ -156,4 +156,28 @@ public void testGetScores() throws Exception { ir.close(); directory.close(); } + + public void testNoUnnecessaryWrap() throws Exception { +Scorable base = +new Scorable() { + @Override + public float score() throws IOException { +return -1; + } + + @Override + public int docID() { +return -1; + } +}; + +// Wrapping the first time should produce a different instance: +Scorable wrapped = ScoreCachingWrappingScorer.wrap(base); +assertNotEquals(base, wrapped); + +// But if we try to wrap an instance of ScoreCachingWrappingScorer, it shouldn't unnecessarily +// wrap again: +Scorable doubleWrapped = ScoreCachingWrappingScorer.wrap(wrapped); +assertEquals(wrapped, doubleWrapped); Review comment: Ooop, yep. Changed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller merged pull request #226: LUCENE-10036: Add factory method to ScoreCachingWrappingScorer that ensures unnecessary wrapping doesn't occur
gsmiller merged pull request #226: URL: https://github.com/apache/lucene/pull/226 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10036) Ensure ScoreCachingWrappingScorer doesn't unnecessarily wrap another ScoreCachingWrappingScorer
[ https://issues.apache.org/jira/browse/LUCENE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388119#comment-17388119 ] ASF subversion and git services commented on LUCENE-10036: -- Commit e44636c28080055502bd30a42c8b1c12801d6b75 in lucene's branch refs/heads/main from Greg Miller [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=e44636c ] LUCENE-10036: Add factory method to ScoreCachingWrappingScorer that ensures unnecessary wrapping doesn't occur (#226) > Ensure ScoreCachingWrappingScorer doesn't unnecessarily wrap another > ScoreCachingWrappingScorer > --- > > Key: LUCENE-10036 > URL: https://issues.apache.org/jira/browse/LUCENE-10036 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Affects Versions: main (9.0), 8.10 >Reporter: Greg Miller >Assignee: Greg Miller >Priority: Trivial > Time Spent: 40m > Remaining Estimate: 0h > > This is a trivial issue, but it's easy to mistakenly "double wrap" an > instance of {{ScoreCachingWrappingScorer}}, which is a bit wasteful. The > calling code currently needs to check the instance type of the {{Scorable}} > they intend to wrap to avoid this. {{FieldComparator}} is actually the only > calling code that does this check. > It would be nice to add a factory method that encapsulates this check in > {{ScoreCachingWrappingScorer}} so that calling code doesn't need to worry > about it. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] gsmiller merged pull request #2534: LUCENE-10036: Add factory method to ScoreCachingWrappingScorer that ensures unnecessary wrapping doesn't occur
gsmiller merged pull request #2534: URL: https://github.com/apache/lucene-solr/pull/2534 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10036) Ensure ScoreCachingWrappingScorer doesn't unnecessarily wrap another ScoreCachingWrappingScorer
[ https://issues.apache.org/jira/browse/LUCENE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388120#comment-17388120 ] ASF subversion and git services commented on LUCENE-10036: -- Commit f444a69f703fdd4b9617251411b09a648f955732 in lucene-solr's branch refs/heads/branch_8x from Greg Miller [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=f444a69 ] LUCENE-10036: Add factory method to ScoreCachingWrappingScorer that ensures unnecessary wrapping doesn't occur (#2534) > Ensure ScoreCachingWrappingScorer doesn't unnecessarily wrap another > ScoreCachingWrappingScorer > --- > > Key: LUCENE-10036 > URL: https://issues.apache.org/jira/browse/LUCENE-10036 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Affects Versions: main (9.0), 8.10 >Reporter: Greg Miller >Assignee: Greg Miller >Priority: Trivial > Time Spent: 1h > Remaining Estimate: 0h > > This is a trivial issue, but it's easy to mistakenly "double wrap" an > instance of {{ScoreCachingWrappingScorer}}, which is a bit wasteful. The > calling code currently needs to check the instance type of the {{Scorable}} > they intend to wrap to avoid this. {{FieldComparator}} is actually the only > calling code that does this check. > It would be nice to add a factory method that encapsulates this check in > {{ScoreCachingWrappingScorer}} so that calling code doesn't need to worry > about it. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly
mikemccand commented on a change in pull request #225: URL: https://github.com/apache/lucene/pull/225#discussion_r677506600 ## File path: lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java ## @@ -0,0 +1,225 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.util.automaton; + +import java.util.Arrays; +import java.util.HashMap; +import java.util.Map; +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.hppc.BitMixer; + +/** + * A RunAutomaton that does not require DFA, it will determinize and memorize the generated DFA + * state along with the run + * + * implemented based on: https://swtch.com/~rsc/regexp/regexp1.html + */ +public class NFARunAutomaton { + + /** state ordinal of "no such state" */ + public static final int MISSING = -1; + + private static final int NOT_COMPUTED = -2; + + private final Automaton automaton; + private final int[] points; + private final Map dStateToOrd = new HashMap<>(); // could init lazily? + private DState[] dStates; + private final int alphabetSize; + + /** + * Constructor, assuming alphabet size is the whole codepoint space Review comment: Maybe say `whole Unicode code point space`? ## File path: lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java ## @@ -0,0 +1,225 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.util.automaton; + +import java.util.Arrays; +import java.util.HashMap; +import java.util.Map; +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.hppc.BitMixer; + +/** + * A RunAutomaton that does not require DFA, it will determinize and memorize the generated DFA + * state along with the run + * + * implemented based on: https://swtch.com/~rsc/regexp/regexp1.html + */ +public class NFARunAutomaton { + + /** state ordinal of "no such state" */ + public static final int MISSING = -1; + + private static final int NOT_COMPUTED = -2; + + private final Automaton automaton; + private final int[] points; + private final Map dStateToOrd = new HashMap<>(); // could init lazily? + private DState[] dStates; + private final int alphabetSize; + + /** + * Constructor, assuming alphabet size is the whole codepoint space + * + * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} for + * better efficiency + */ + public NFARunAutomaton(Automaton automaton) { +this(automaton, Character.MAX_CODE_POINT); + } + + /** + * Constructor + * + * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} * + * for better efficiency + * @param alphabetSize alphabet size + */ + public NFARunAutomaton(Automaton automaton, int alphabetSize) { +this.automaton = automaton; +points = automaton.getStartPoints(); +this.alphabetSize = alphabetSize; +dStates = new DState[10]; +findDState(new DState(new int[] {0})); + } + + /** + * For a given state and an incoming character (codepoint), return the next state + * + * @param state incoming state, should either be 0 or some state that is returned previously by + * this function + * @param c codepoint + * @return the next state or {@link #MISSING} if the transition doesn't exist + */ + public int step(int state, int c) { +assert dStates[state] != null; +return step(d
[GitHub] [lucene] jpountz opened a new pull request #227: LUCENE-10033: Encode numeric doc values and ordinals of SORTED(_SET) doc values in blocks.
jpountz opened a new pull request #227: URL: https://github.com/apache/lucene/pull/227 This moves doc values to an approach that is more similar to postings, where values are grouped in blocks of 128 values that are compressed together. Decoding a single value requires decoding the entire block that contains the value. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10033) Encode doc values in smaller blocks of values, like postings
[ https://issues.apache.org/jira/browse/LUCENE-10033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388178#comment-17388178 ] Adrien Grand commented on LUCENE-10033: --- I opened a PR with this idea. Queries that consume most values like the Browse* faceting tasks become faster, but queries that only consume a small subset of values like some sorting tasks (not all, on of them is faster) become slower. {noformat} TaskQPS baseline StdDev QPS patch StdDev Pct diff p-value HighTermMonthSort 101.33 (9.7%) 51.93 (2.8%) -48.7% ( -55% - -40%) 0.000 TermDTSort 587.24 (6.1%) 404.20 (2.9%) -31.2% ( -37% - -23%) 0.000 IntNRQ 85.55 (14.7%) 73.16 (1.6%) -14.5% ( -26% -2%) 0.000 OrHighNotMed 1301.37 (3.7%) 1218.64 (2.3%) -6.4% ( -11% -0%) 0.000 OrNotHighHigh 1121.91 (4.1%) 1089.27 (2.7%) -2.9% ( -9% -4%) 0.008 MedTerm 2156.71 (3.3%) 2103.32 (3.6%) -2.5% ( -9% -4%) 0.022 Fuzzy2 67.41 (4.6%) 65.74 (4.9%) -2.5% ( -11% -7%) 0.098 OrNotHighLow 1099.66 (3.7%) 1078.60 (3.0%) -1.9% ( -8% -4%) 0.073 MedIntervalsOrdered 79.39 (3.0%) 77.94 (3.7%) -1.8% ( -8% -5%) 0.088 MedPhrase 403.62 (2.8%) 397.19 (2.3%) -1.6% ( -6% -3%) 0.050 OrHighMed 130.57 (3.0%) 128.64 (2.6%) -1.5% ( -6% -4%) 0.099 LowIntervalsOrdered 20.82 (2.5%) 20.55 (3.4%) -1.3% ( -6% -4%) 0.167 HighIntervalsOrdered2.95 (5.1%)2.91 (5.8%) -1.1% ( -11% - 10%) 0.530 OrHighLow 579.45 (2.9%) 574.45 (2.4%) -0.9% ( -5% -4%) 0.306 LowSpanNear 33.20 (2.9%) 33.06 (3.5%) -0.4% ( -6% -6%) 0.668 HighSpanNear9.79 (3.5%)9.79 (3.7%) -0.0% ( -7% -7%) 0.996 Respell 221.47 (2.1%) 221.62 (2.8%) 0.1% ( -4% -4%) 0.931 HighSloppyPhrase 36.64 (3.4%) 36.69 (4.0%) 0.1% ( -7% -7%) 0.915 Wildcard 283.85 (6.5%) 285.06 (7.2%) 0.4% ( -12% - 15%) 0.845 LowSloppyPhrase 175.77 (4.3%) 176.56 (4.4%) 0.5% ( -7% -9%) 0.740 AndHighHigh 64.34 (2.5%) 64.84 (3.4%) 0.8% ( -5% -6%) 0.410 HighTerm 2146.56 (3.3%) 2164.26 (4.5%) 0.8% ( -6% -8%) 0.505 HighTermTitleBDVSort 27.18 (4.6%) 27.41 (2.1%) 0.8% ( -5% -7%) 0.461 OrHighNotLow 1261.38 (2.3%) 1274.89 (3.0%) 1.1% ( -4% -6%) 0.210 MedSpanNear 26.96 (4.1%) 27.28 (3.5%) 1.2% ( -6% -9%) 0.336 MedSloppyPhrase 102.18 (4.7%) 103.51 (5.1%) 1.3% ( -8% - 11%) 0.399 BrowseDateTaxoFacets3.15 (4.0%)3.19 (4.0%) 1.4% ( -6% -9%) 0.281 BrowseDayOfYearTaxoFacets3.15 (4.0%)3.20 (4.0%) 1.5% ( -6% -9%) 0.250 AndHighLow 1295.59 (3.3%) 1318.11 (3.4%) 1.7% ( -4% -8%) 0.105 Prefix3 63.21 (15.4%) 64.49 (17.1%) 2.0% ( -26% - 40%) 0.694 OrHighHigh 35.41 (3.1%) 36.24 (3.1%) 2.4% ( -3% -8%) 0.015 Fuzzy1 253.74 (6.1%) 260.89 (7.1%) 2.8% ( -9% - 16%) 0.175 BrowseMonthTaxoFacets3.42 (7.7%)3.52 (4.1%) 2.9% ( -8% - 15%) 0.135 AndHighMed 164.48 (2.6%) 169.43 (3.3%) 3.0% ( -2% -9%) 0.001 LowTerm 2645.26 (4.9%) 2752.43 (5.6%) 4.1% ( -6% - 15%) 0.015 OrHighNotHigh 1286.12 (3.7%) 1349.66 (4.6%) 4.9% ( -3% - 13%) 0.000 HighPhrase 105.61 (3.7%) 111.65 (4.8%) 5.7% ( -2% - 14%) 0.000 LowPhrase 35.85 (2.6%) 38.76 (3.3%) 8.1% ( 2% - 14%) 0.000 OrNotHighMed 1241.35 (3.1%) 1368.49 (3.6%) 10.2% ( 3% - 17%) 0.000 HighTermDayOfYearSort 573.92 (9.5%) 687.19 (7.9%) 19.7% ( 2% - 40%) 0.000 BrowseMonthSSDVFacets 11.52 (5.1%) 17.81 (23.5%) 54.6% ( 24% - 87%) 0.000 BrowseDayOfYearSSDVFacets 11.24 (3.9%) 18.15 (23.1%) 61.4% ( 33% - 91%) 0.000 {noformat} >
[jira] [Commented] (LUCENE-10030) [DrillSidewaysScorer] redundant score() calculations in doQueryFirstScoring
[ https://issues.apache.org/jira/browse/LUCENE-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388181#comment-17388181 ] Greg Miller commented on LUCENE-10030: -- Yeah [~gtroitskiy], I caught the missing CHANGES entry right after I pushed and took care of it. As for backporting, as [~mikemccand] noted, you'll need to create a PR against the "old" repo (lucene-solr) in github with the same change. Using cherry-pick is a little "tricky" since there's a ton of formatting changes between the two repos, which we'd prefer to leave out of the diff. Given that the change is relatively small, you could probably copy/paste stuff across. Let me know if you need any help! > [DrillSidewaysScorer] redundant score() calculations in doQueryFirstScoring > --- > > Key: LUCENE-10030 > URL: https://issues.apache.org/jira/browse/LUCENE-10030 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Grigoriy Troitskiy >Priority: Major > Time Spent: 2h 20m > Remaining Estimate: 0h > > *Diff* > {code:java} > @@ -195,11 +195,8 @@ class DrillSidewaysScorer extends BulkScorer { > > collectDocID = docID; > > - // TODO: we could score on demand instead since we are > - // daat here: > - collectScore = baseScorer.score(); > - > if (failedCollector == null) { > + collectScore = baseScorer.score(); > // Hit passed all filters, so it's "real": > collectHit(collector, dims); > } else { > {code} > > *Motivation* > 1. Performance degradation: we have quite heavy custom implementation of > score(). So when we started using DrillSideways, this call became top-1 in a > profiler snapshot (top-3 with default scoring). We tried doUnionScoring and > doDrillDownAdvanceScoring, but no luck: > doUnionScoring scores all baseQuery docIds > doDrillDownAdvanceScoring avoids some redundant docIds scorings, considering > symmetric difference of top two iterator's docIds, but still scores some > docIds, that will be filtered out by 3rd, 4th, ... dimension iterators > doQueryFirstScoring scores near-miss docIds > Best way is to score only true hits (where baseQuery and all N drill-down > iterators match). So we suggest a small modification of doQueryFirstScoring. > > 2. Speaking of doQueryFirstScoring, it doesn't look like we need to > calculate a score for near-miss hit, because it won't be used anywhere. > FacetsCollectorManager creates FacetsCollector with default constructor > > [https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/FacetsCollectorManager.java#L35] > so FacetCollector has false for keepScores > > [https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/FacetsCollector.java#L119] > and collectScore is not being used > > [https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/DrillSidewaysScorer.java#L200] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zhaih commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly
zhaih commented on a change in pull request #225: URL: https://github.com/apache/lucene/pull/225#discussion_r677666989 ## File path: lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java ## @@ -0,0 +1,225 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.util.automaton; + +import java.util.Arrays; +import java.util.HashMap; +import java.util.Map; +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.hppc.BitMixer; + +/** + * A RunAutomaton that does not require DFA, it will determinize and memorize the generated DFA + * state along with the run + * + * implemented based on: https://swtch.com/~rsc/regexp/regexp1.html + */ +public class NFARunAutomaton { + + /** state ordinal of "no such state" */ + public static final int MISSING = -1; + + private static final int NOT_COMPUTED = -2; + + private final Automaton automaton; + private final int[] points; + private final Map dStateToOrd = new HashMap<>(); // could init lazily? + private DState[] dStates; + private final int alphabetSize; + + /** + * Constructor, assuming alphabet size is the whole codepoint space + * + * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} for + * better efficiency + */ + public NFARunAutomaton(Automaton automaton) { +this(automaton, Character.MAX_CODE_POINT); + } + + /** + * Constructor + * + * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} * + * for better efficiency + * @param alphabetSize alphabet size + */ + public NFARunAutomaton(Automaton automaton, int alphabetSize) { +this.automaton = automaton; +points = automaton.getStartPoints(); +this.alphabetSize = alphabetSize; +dStates = new DState[10]; +findDState(new DState(new int[] {0})); + } + + /** + * For a given state and an incoming character (codepoint), return the next state + * + * @param state incoming state, should either be 0 or some state that is returned previously by + * this function + * @param c codepoint + * @return the next state or {@link #MISSING} if the transition doesn't exist + */ + public int step(int state, int c) { +assert dStates[state] != null; +return step(dStates[state], c); + } + + /** + * Run through a given codepoint array, return accepted or not, should only be used in test + * + * @param s String represented by an int array + * @return accept or not + */ + boolean run(int[] s) { +int p = 0; +for (int c : s) { + p = step(p, c); + if (p == MISSING) return false; +} +return dStates[p].isAccept; + } + + /** + * From an existing DFA state, step to next DFA state given character c if the transition is + * previously tried then this operation will just use the cached result, otherwise it will call + * {@link #step(int[], int)} to get the next state and cache the result + */ + private int step(DState dState, int c) { +int charClass = getCharClass(c); +if (dState.nextState(charClass) == NOT_COMPUTED) { + // the next dfa state has not been computed yet + dState.setNextState(charClass, findDState(step(dState.nfaStates, c))); +} +return dState.nextState(charClass); + } + + /** + * given a list of NFA states and a character c, compute the output list of NFA state which is + * wrapped as a DFA state + */ + private DState step(int[] nfaStates, int c) { +Transition transition = new Transition(); +StateSet stateSet = new StateSet(5); // fork IntHashSet from hppc instead? +int numTransitions; +for (int nfaState : nfaStates) { + numTransitions = automaton.initTransition(nfaState, transition); + for (int i = 0; i < numTransitions; i++) { +automaton.getNextTransition(transition); +if (transition.min <= c && transition.max >= c) { + stateSet.incr(transition.dest); +} + } +} +if (stateSet.size() == 0) { + return null; +} +return new DState(stateSet.getArray()); + } + + /** + * return the ordinal of given DFA state, generate a new ordinal if the given DFA state
[GitHub] [lucene] gautamworah96 commented on a change in pull request #220: LUCENE-9450: Use BinaryDocValue fields in the taxonomy index based on the existing index version
gautamworah96 commented on a change in pull request #220: URL: https://github.com/apache/lucene/pull/220#discussion_r677669790 ## File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java ## @@ -475,8 +477,15 @@ private int addCategoryDocument(FacetLabel categoryPath, int parent) throws IOEx String fieldPath = FacetsConfig.pathToString(categoryPath.components, categoryPath.length); fullPathField.setStringValue(fieldPath); + +if (useOlderStoredFieldIndex) { + fullPathField = new StringField(Consts.FULL, fieldPath, Field.Store.YES); Review comment: So when we first initialize the `fieldPath` variable it is [initialized](https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java#L185) with `Field.Store.NO`. Then when we realize that the field has to be stored as a StringField, we need its value as well, so I reassign it with a `new StringField(Consts.FULL, fieldPath, Field.Store.YES)` statement. > Maybe just move the above line inside the if? Sure > why don't we also re-use the BinaryDocValues field Hmmm. Could you elaborate a bit? The field depends on the input `FacetLabel categoryPath` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zhaih commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly
zhaih commented on a change in pull request #225: URL: https://github.com/apache/lucene/pull/225#discussion_r677669213 ## File path: lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java ## @@ -0,0 +1,225 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.util.automaton; + +import java.util.Arrays; +import java.util.HashMap; +import java.util.Map; +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.hppc.BitMixer; + +/** + * A RunAutomaton that does not require DFA, it will determinize and memorize the generated DFA + * state along with the run + * + * implemented based on: https://swtch.com/~rsc/regexp/regexp1.html + */ +public class NFARunAutomaton { + + /** state ordinal of "no such state" */ + public static final int MISSING = -1; + + private static final int NOT_COMPUTED = -2; + + private final Automaton automaton; + private final int[] points; + private final Map dStateToOrd = new HashMap<>(); // could init lazily? + private DState[] dStates; + private final int alphabetSize; + + /** + * Constructor, assuming alphabet size is the whole codepoint space Review comment: ++ ## File path: lucene/core/src/test/org/apache/lucene/util/automaton/TestNFARunAutomaton.java ## @@ -0,0 +1,83 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.util.automaton; + +import java.util.Arrays; +import org.apache.lucene.util.IntsRef; +import org.apache.lucene.util.LuceneTestCase; + +public class TestNFARunAutomaton extends LuceneTestCase { + + public void testRandom() { +for (int i = 0; i < 100; i++) { + RegExp regExp = null; + while (regExp == null) { +try { + regExp = new RegExp(AutomatonTestUtil.randomRegexp(random())); +} catch (IllegalArgumentException e) { + ignoreException(e); +} + } + Automaton dfa = regExp.toAutomaton(); Review comment: Yeah I think that's better, will change it. ## File path: lucene/core/src/test/org/apache/lucene/util/automaton/TestNFARunAutomaton.java ## @@ -0,0 +1,83 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.util.automaton; + +import java.util.Arrays; +import org.apache.lucene.util.IntsRef; +import org.apache.lucene.util.LuceneTestCase; + +public class TestNFARunAutomaton extends LuceneTestCase { + + public void testRandom() { +for (int i = 0; i < 100; i++) { + RegExp regExp = null; + while (regExp == null) { +try { + regExp = new RegExp(AutomatonTestUtil.randomRegexp(random())); +} catch (IllegalArgument
[jira] [Resolved] (LUCENE-10036) Ensure ScoreCachingWrappingScorer doesn't unnecessarily wrap another ScoreCachingWrappingScorer
[ https://issues.apache.org/jira/browse/LUCENE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller resolved LUCENE-10036. -- Fix Version/s: 8.10 main (9.0) Resolution: Fixed > Ensure ScoreCachingWrappingScorer doesn't unnecessarily wrap another > ScoreCachingWrappingScorer > --- > > Key: LUCENE-10036 > URL: https://issues.apache.org/jira/browse/LUCENE-10036 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Affects Versions: main (9.0), 8.10 >Reporter: Greg Miller >Assignee: Greg Miller >Priority: Trivial > Fix For: main (9.0), 8.10 > > Time Spent: 1h > Remaining Estimate: 0h > > This is a trivial issue, but it's easy to mistakenly "double wrap" an > instance of {{ScoreCachingWrappingScorer}}, which is a bit wasteful. The > calling code currently needs to check the instance type of the {{Scorable}} > they intend to wrap to avoid this. {{FieldComparator}} is actually the only > calling code that does this check. > It would be nice to add a factory method that encapsulates this check in > {{ScoreCachingWrappingScorer}} so that calling code doesn't need to worry > about it. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] gautamworah96 commented on pull request #2521: Backport LUCENE-9948 LUCENE-9964
gautamworah96 commented on pull request #2521: URL: https://github.com/apache/lucene-solr/pull/2521#issuecomment-887734156 @gsmiller Yes. I think the change is simple enough to be coded manually. Git did not allow direct cherry picking due to conflicts and that is why I suggested that I backport the other change as well. I'll revise it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #220: LUCENE-9450: Use BinaryDocValue fields in the taxonomy index based on the existing index version
gautamworah96 commented on a change in pull request #220: URL: https://github.com/apache/lucene/pull/220#discussion_r677794298 ## File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java ## @@ -164,9 +159,16 @@ public DirectoryTaxonomyWriter(Directory directory, OpenMode openMode, TaxonomyW openMode = config.getOpenMode(); if (!DirectoryReader.indexExists(directory)) { Review comment: Sure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10037) Explore a single scoring implementation in DrillSidewaysScorer
Greg Miller created LUCENE-10037: Summary: Explore a single scoring implementation in DrillSidewaysScorer Key: LUCENE-10037 URL: https://issues.apache.org/jira/browse/LUCENE-10037 Project: Lucene - Core Issue Type: Improvement Components: modules/facet Affects Versions: main (9.0) Reporter: Greg Miller {{DrillSidewaysScorer}} currently implements three separate strategies for bulk scoring documents: {{doQueryFirstScoring}}, {{doUnionScoring}} and {{doDrillDownAdvanceScoring}}. As far as I can tell, this code dates back to 2013 and two of the three approaches appear to emulate the {{BooleanScorer}} "window scoring" / "term-at-a-time" strategy. While this strategy in {{BooleanScorer}} is still useful in some cases, the primary benefit, from what I can tell, is to avoid re-heap operations in disjunction cases (as recently [described|http://mail-archives.apache.org/mod_mbox/lucene-dev/202106.mbox/%3CCAPsWd%2BMbYckCR2LHxHy4-%3DoZPnvX%3D9Er8hwb%2BG76jHb85JePvw%40mail.gmail.com%3E] by [~jpountz]). I can't see any reason why we'd prefer these two approaches anymore in {{DrillSidewaysScorer}} since we're doing pure conjunctions (no re-heaping to worry about) and {{doQueryFirstScoring}} takes advantage of skipping by advancing postings (while the other two approaches iterate their postings entirely, only relying on nextDoc functionality). Finally, we added an optimization (LUCENE-10030) that can only work for {{doQueryFirstScoring}} that lazily evaluates the {{score}} (where-as {{doUnionScoring}} and {{doDrillDownAdvanceScoring}} eagerly evaluate it). All this is to say we should try sending all scoring through {{doQueryFirstScoring}} and see how it benchmarks. I'm not sure if we have benchmarks setup already for drill sideways, but I'd love to see if we can't optimize {{DrillSidewaysScorer}} while also reducing its code complexity! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mrkm4ntr opened a new pull request #228: Remove impossible assert condition
mrkm4ntr opened a new pull request #228: URL: https://github.com/apache/lucene/pull/228 # Description DWPTs that have not same deleteQueue with flushingQueue are already filtered. # Checklist Please review the following and check all that apply: - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code conforms to the standards described there to the best of my ability. - [ ] I have created a Jira issue and added the issue ID to my pull request title. I think this is too trivial to create Jira issue. But if it is required, I will create an issue. - [x] I have given Lucene maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [x] I have developed this patch against the `main` branch. - [x] I have run `./gradlew check`. - [ ] I have added tests for my changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10037) Explore a single scoring implementation in DrillSidewaysScorer
[ https://issues.apache.org/jira/browse/LUCENE-10037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller updated LUCENE-10037: - Issue Type: Task (was: Improvement) > Explore a single scoring implementation in DrillSidewaysScorer > -- > > Key: LUCENE-10037 > URL: https://issues.apache.org/jira/browse/LUCENE-10037 > Project: Lucene - Core > Issue Type: Task > Components: modules/facet >Affects Versions: main (9.0) >Reporter: Greg Miller >Priority: Minor > > {{DrillSidewaysScorer}} currently implements three separate strategies for > bulk scoring documents: {{doQueryFirstScoring}}, {{doUnionScoring}} and > {{doDrillDownAdvanceScoring}}. As far as I can tell, this code dates back to > 2013 and two of the three approaches appear to emulate the {{BooleanScorer}} > "window scoring" / "term-at-a-time" strategy. While this strategy in > {{BooleanScorer}} is still useful in some cases, the primary benefit, from > what I can tell, is to avoid re-heap operations in disjunction cases (as > recently > [described|http://mail-archives.apache.org/mod_mbox/lucene-dev/202106.mbox/%3CCAPsWd%2BMbYckCR2LHxHy4-%3DoZPnvX%3D9Er8hwb%2BG76jHb85JePvw%40mail.gmail.com%3E] > by [~jpountz]). I can't see any reason why we'd prefer these two approaches > anymore in {{DrillSidewaysScorer}} since we're doing pure conjunctions (no > re-heaping to worry about) and {{doQueryFirstScoring}} takes advantage of > skipping by advancing postings (while the other two approaches iterate their > postings entirely, only relying on nextDoc functionality). Finally, we added > an optimization (LUCENE-10030) that can only work for {{doQueryFirstScoring}} > that lazily evaluates the {{score}} (where-as {{doUnionScoring}} and > {{doDrillDownAdvanceScoring}} eagerly evaluate it). > > All this is to say we should try sending all scoring through > {{doQueryFirstScoring}} and see how it benchmarks. I'm not sure if we have > benchmarks setup already for drill sideways, but I'd love to see if we can't > optimize {{DrillSidewaysScorer}} while also reducing its code complexity! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on a change in pull request #220: LUCENE-9450: Use BinaryDocValue fields in the taxonomy index based on the existing index version
mikemccand commented on a change in pull request #220: URL: https://github.com/apache/lucene/pull/220#discussion_r677908217 ## File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java ## @@ -475,8 +477,15 @@ private int addCategoryDocument(FacetLabel categoryPath, int parent) throws IOEx String fieldPath = FacetsConfig.pathToString(categoryPath.components, categoryPath.length); fullPathField.setStringValue(fieldPath); + +if (useOlderStoredFieldIndex) { + fullPathField = new StringField(Consts.FULL, fieldPath, Field.Store.YES); Review comment: OK it's just super confusing that we act like we will reuse it (by creating it in ctor) but then sometimes overwrite it and other times re-use it. Couldn't we init it with the right `Field.Store.YES|NO` option up front, since we know in ctor which way it'll go, and then just always re-use it? And let's scratch my suggestion to also reuse the `BDV` field -- let's just make a new one each time like you are doing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dnhatn commented on pull request #228: Remove impossible assert condition
dnhatn commented on pull request #228: URL: https://github.com/apache/lucene/pull/228#issuecomment-887965341 @mrkm4ntr I think we can remove the assertion entirely. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mrkm4ntr commented on pull request #228: Remove unnecessary assertion
mrkm4ntr commented on pull request #228: URL: https://github.com/apache/lucene/pull/228#issuecomment-887970592 @dnhatn Thanks! I removed the assertion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10035) Simple text codec add multi level skip list data
[ https://issues.apache.org/jira/browse/LUCENE-10035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuda updated LUCENE-10035: -- Issue Type: Wish (was: New Feature) > Simple text codec add multi level skip list data > -- > > Key: LUCENE-10035 > URL: https://issues.apache.org/jira/browse/LUCENE-10035 > Project: Lucene - Core > Issue Type: Wish > Components: core/codecs >Affects Versions: main (9.0) >Reporter: wuda >Priority: Major > Labels: Impact, MultiLevelSkipList, SimpleTextCodec > Time Spent: 10m > Remaining Estimate: 0h > > Simple text codec add skip list data( include impact) to help understand > index format,For debugging, curiosity, transparency only!! When term's > docFreq greater than or equal to SimpleTextSkipWriter.BLOCK_SIZE (default > value is 8), Simple text codec will write skip list, the *.pst (simple text > term dictionary file)* file will looks like this > {code:java} > field title > term args > doc 2 > freq 2 > pos 7 > pos 10 > ## we omit docs for better view .. > doc 98 > freq 2 > pos 2 > pos 6 > skipList > ? > level 1 > skipDoc 65 > skipDocFP 949 > impacts > impact > freq 1 > norm 2 > impact > freq 2 > norm 12 > impact > freq 3 > norm 13 > impacts_end > ? > level 0 > skipDoc 17 > skipDocFP 284 > impacts > impact > freq 1 > norm 2 > impact > freq 2 > norm 12 > impacts_end > skipDoc 34 > skipDocFP 624 > impacts > impact > freq 1 > norm 2 > impact > freq 2 > norm 12 > impact > freq 3 > norm 14 > impacts_end > skipDoc 65 > skipDocFP 949 > impacts > impact > freq 1 > norm 2 > impact > freq 2 > norm 12 > impact > freq 3 > norm 13 > impacts_end > skipDoc 90 > skipDocFP 1311 > impacts > impact > freq 1 > norm 2 > impact > freq 2 > norm 10 > impact > freq 3 > norm 13 > impact > freq 4 > norm 14 > impacts_end > END > checksum 000829315543 > {code} > compare with previous,we add *skipList,level, skipDoc, skipDocFP, impacts, > impact, freq, norm* nodes, at the same, simple text codec can support > advanceShallow when search time. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] glawson0 commented on pull request #157: LUCENE-9963 Fix issue with FlattenGraphFilter throwing exceptions from holes
glawson0 commented on pull request #157: URL: https://github.com/apache/lucene/pull/157#issuecomment-888027281 We have been using this change internally for a few weeks now. We no longer encounter the `ArrayIndexOutOfBounds` exceptions that we were previously experiencing. Depending on the dataset/analyzer combination we have seen up to a 1% increase in the average number of tokens per field. This comes from the tokens that had previously been dropped now being correctly indexed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org