[jira] [Created] (LUCENE-10038) i have no issue
mahnoor jabbar created LUCENE-10038: --- Summary: i have no issue Key: LUCENE-10038 URL: https://issues.apache.org/jira/browse/LUCENE-10038 Project: Lucene - Core Issue Type: New Feature Components: core/FSTs Affects Versions: 8.8.2 Reporter: mahnoor jabbar [Carters Coupon Codes|https://uttercoupons.com/front/store-profile/carters-coupon-codes] is best code provided by Carter's. Amazing Discount Offers, Get Carter's Coupons & Promo Codes and save up to 50% on the offer, so get the code helps you to save on coupon and promo codes -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7020) TieredMergePolicy - cascade maxMergeAtOnce setting to maxMergeAtOnceExplicit
[ https://issues.apache.org/jira/browse/LUCENE-7020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388572#comment-17388572 ] Adrien Grand commented on LUCENE-7020: -- I've just seen a similar issue to the one that Shawn is decribing. A small index (3.3GB) has more than 30 segments and ends up needing two rounds to be force-merged down to 1 segment. With the default settings, it takes 264s to force-merge this index. If I set the max number of segments to merge at once to 50, then force-merging down to 1 segment takes 190s, 28% faster. An alternative I'd like to propose would be to raise the default value of maxMergeAtOnceExplicit to 50 instead of 30. While 30-segments indices can be as small as 2.2GB with the default configuration(10 2MB segments, 10 20MB segments and 10 200MB segments), a 50-segments index must be at least 72GB (10 2MB segments, 10 20MB segments, 10 200MB segments, 10 2GB segments and 10 5GB segments). Or maybe we shouldn't limit the number of segments to merge at once with explicit merges? I understand the argument about read-ahead, but we also have data structures that are very CPU-intensive to merge like stored fields with index sorting, vectors or multi-dimensional points (when N>1) because they may need to recompute the data structure entirely. Avoiding cascading merges in such cases is very helpful. For the record, the example I gave above falls in none of these cases and yet already yields a significant speedup if it doesn't need to cascade merges. > TieredMergePolicy - cascade maxMergeAtOnce setting to maxMergeAtOnceExplicit > > > Key: LUCENE-7020 > URL: https://issues.apache.org/jira/browse/LUCENE-7020 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 5.4.1 >Reporter: Shawn Heisey >Assignee: Shawn Heisey >Priority: Major > Attachments: LUCENE-7020.patch > > > SOLR-8621 covers improvements in configuring a merge policy in Solr. > Discussions on that issue brought up the fact that if large values are > configured for maxMergeAtOnce and segmentsPerTier, but maxMergeAtOnceExplicit > is not changed, then doing a forceMerge is likely to not work as expected. > When I first configured maxMergeAtOnce and segmentsPerTier to 35 in Solr, I > saw an optimize (forceMerge) fully rewrite most of the index *twice* in order > to achieve a single segment, because there were approximately 80 segments in > the index before the optimize, and maxMergeAtOnceExplicit defaults to 30. On > advice given via the solr-user mailing list, I configured > maxMergeAtOnceExplicit to 105 and have not had that problem since. > I propose that setting maxMergeAtOnce should also set maxMergeAtOnceExplicit > to three times the new value -- unless the setMaxMergeAtOnceExplicit method > has been invoked, indicating that the user wishes to set that value > themselves. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9999) CombinedFieldQuery can fail when document is missing fields
[ https://issues.apache.org/jira/browse/LUCENE-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julie Tibshirani resolved LUCENE-. -- Fix Version/s: 8.10 Resolution: Fixed > CombinedFieldQuery can fail when document is missing fields > --- > > Key: LUCENE- > URL: https://issues.apache.org/jira/browse/LUCENE- > Project: Lucene - Core > Issue Type: Bug >Reporter: Julie Tibshirani >Priority: Major > Fix For: 8.10 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > If some documents match but don't contain all fields, then > {{CombinedFieldQuery}} can fail when attempting to compute norms. This is > because {{MultiFieldNormValues}} assumes all fields in the document have > norms. > Originally surfaced in this Elasticsearch issue: > https://github.com/elastic/elasticsearch/issues/74037. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10039) With a single field, CombinedFieldQuery can score incorrectly
Julie Tibshirani created LUCENE-10039: - Summary: With a single field, CombinedFieldQuery can score incorrectly Key: LUCENE-10039 URL: https://issues.apache.org/jira/browse/LUCENE-10039 Project: Lucene - Core Issue Type: Bug Reporter: Julie Tibshirani When there's only one field, {{CombinedFieldQuery}} will ignore its weight while scoring. This makes the scoring inconsistent, since the field weight is supposed to multiply its term frequency. This can also come up when searching over multiple fields, when some segment happens to contain only one field. The problem was caught by this test: {code} ant test -Dtestcase=TestCombinedFieldQuery -Dtests.method=testCopyFieldWithMissingFields -Dtests.seed=8FA982798BC8FEF6 -Dtests.nightly=true {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani opened a new pull request #229: LUCENE-10039: Fix single-field scoring for CombinedFieldQuery
jtibshirani opened a new pull request #229: URL: https://github.com/apache/lucene/pull/229 When there's only one field, CombinedFieldQuery will ignore its weight while scoring. This makes the scoring inconsistent, since the field weight is supposed to multiply its term frequency. This PR removes the optimizations around single-field scoring to make sure the weight is always taken into account. These optimizations don't seem critical since it should be rare for CombinedFieldQuery to run over only one field. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on pull request #229: LUCENE-10039: Fix single-field scoring for CombinedFieldQuery
jtibshirani commented on pull request #229: URL: https://github.com/apache/lucene/pull/229#issuecomment-888241745 Great point, the existing test `testCopyFieldWithMissingFields` only very rarely triggers this case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly
rmuir commented on pull request #225: URL: https://github.com/apache/lucene/pull/225#issuecomment-888270205 > This is super exciting! I'm amazed how little code you needed to get this first version running. but a runautomaton for this won't run any queries on its own: brute force isn't how these queries actually work. the important part is the intersection (skipping around)... I suggest, please let's not try to "overshare" and refactor all this stuff alongside DFA stuff until there is a query we can actually benchmark to see if the performance is even viable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly
rmuir commented on a change in pull request #225: URL: https://github.com/apache/lucene/pull/225#discussion_r678255582 ## File path: lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java ## @@ -0,0 +1,225 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.util.automaton; + +import java.util.Arrays; +import java.util.HashMap; +import java.util.Map; +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.hppc.BitMixer; + +/** + * A RunAutomaton that does not require DFA, it will determinize and memorize the generated DFA + * state along with the run + * + * implemented based on: https://swtch.com/~rsc/regexp/regexp1.html + */ +public class NFARunAutomaton { + + /** state ordinal of "no such state" */ + public static final int MISSING = -1; + + private static final int NOT_COMPUTED = -2; + + private final Automaton automaton; + private final int[] points; + private final Map dStateToOrd = new HashMap<>(); // could init lazily? + private DState[] dStates; + private final int alphabetSize; + + /** + * Constructor, assuming alphabet size is the whole codepoint space + * + * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} for + * better efficiency + */ + public NFARunAutomaton(Automaton automaton) { +this(automaton, Character.MAX_CODE_POINT); + } + + /** + * Constructor + * + * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} * + * for better efficiency + * @param alphabetSize alphabet size + */ + public NFARunAutomaton(Automaton automaton, int alphabetSize) { +this.automaton = automaton; +points = automaton.getStartPoints(); +this.alphabetSize = alphabetSize; +dStates = new DState[10]; +findDState(new DState(new int[] {0})); + } + + /** + * For a given state and an incoming character (codepoint), return the next state + * + * @param state incoming state, should either be 0 or some state that is returned previously by + * this function + * @param c codepoint + * @return the next state or {@link #MISSING} if the transition doesn't exist + */ + public int step(int state, int c) { +assert dStates[state] != null; +return step(dStates[state], c); + } + + /** + * Run through a given codepoint array, return accepted or not, should only be used in test + * + * @param s String represented by an int array + * @return accept or not + */ + boolean run(int[] s) { Review comment: see my comment: we should avoid oversharing for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a change in pull request #224: LUCENE-10035: Simple text codec add multi level skip list data
jpountz commented on a change in pull request #224: URL: https://github.com/apache/lucene/pull/224#discussion_r678249370 ## File path: lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextSkipReader.java ## @@ -0,0 +1,179 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.codecs.simpletext; + +import static org.apache.lucene.codecs.simpletext.SimpleTextSkipWriter.FREQ; +import static org.apache.lucene.codecs.simpletext.SimpleTextSkipWriter.IMPACT; +import static org.apache.lucene.codecs.simpletext.SimpleTextSkipWriter.IMPACTS; +import static org.apache.lucene.codecs.simpletext.SimpleTextSkipWriter.IMPACTS_END; +import static org.apache.lucene.codecs.simpletext.SimpleTextSkipWriter.NORM; +import static org.apache.lucene.codecs.simpletext.SimpleTextSkipWriter.SKIP_DOC; +import static org.apache.lucene.codecs.simpletext.SimpleTextSkipWriter.SKIP_DOC_FP; +import static org.apache.lucene.codecs.simpletext.SimpleTextSkipWriter.SKIP_LIST; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.List; +import org.apache.lucene.codecs.MultiLevelSkipListReader; +import org.apache.lucene.index.Impact; +import org.apache.lucene.index.Impacts; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.store.BufferedChecksumIndexInput; +import org.apache.lucene.store.ChecksumIndexInput; +import org.apache.lucene.store.IndexInput; +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.BytesRefBuilder; +import org.apache.lucene.util.CharsRefBuilder; +import org.apache.lucene.util.StringHelper; + +/** + * This class reads skip lists with multiple levels. + * + * See {@link SimpleTextFieldsWriter} for the information about the encoding of the multi level + * skip lists. + * + * @lucene.experimental + */ +public class SimpleTextSkipReader extends MultiLevelSkipListReader { Review comment: can it be made pkg-private? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani merged pull request #229: LUCENE-10039: Fix single-field scoring for CombinedFieldQuery
jtibshirani merged pull request #229: URL: https://github.com/apache/lucene/pull/229 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10039) With a single field, CombinedFieldQuery can score incorrectly
[ https://issues.apache.org/jira/browse/LUCENE-10039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388737#comment-17388737 ] ASF subversion and git services commented on LUCENE-10039: -- Commit e8663b30b85c1d48a8d18d37866a553895ffb8ae in lucene's branch refs/heads/main from Julie Tibshirani [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=e8663b3 ] LUCENE-10039: Fix single-field scoring for CombinedFieldQuery (#229) When there's only one field, CombinedFieldQuery will ignore its weight while scoring. This makes the scoring inconsistent, since the field weight is supposed to multiply its term frequency. This PR removes the optimizations around single-field scoring to make sure the weight is always taken into account. These optimizations are not critical since it should be uncommon to use CombinedFieldQuery with only one field. > With a single field, CombinedFieldQuery can score incorrectly > - > > Key: LUCENE-10039 > URL: https://issues.apache.org/jira/browse/LUCENE-10039 > Project: Lucene - Core > Issue Type: Bug >Reporter: Julie Tibshirani >Priority: Minor > Time Spent: 20m > Remaining Estimate: 0h > > When there's only one field, {{CombinedFieldQuery}} will ignore its weight > while scoring. This makes the scoring inconsistent, since the field weight is > supposed to multiply its term frequency. > This can also come up when searching over multiple fields, when some segment > happens to contain only one field. The problem was caught by this test: > {code} > ant test -Dtestcase=TestCombinedFieldQuery > -Dtests.method=testCopyFieldWithMissingFields -Dtests.seed=8FA982798BC8FEF6 > -Dtests.nightly=true > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly
mikemccand commented on pull request #225: URL: https://github.com/apache/lucene/pull/225#issuecomment-888278654 > I suggest, please let's not try to "overshare" and refactor all this stuff alongside DFA stuff until there is a query we can actually benchmark to see if the performance is even viable OK yeah +1 to keep it wholly separate (full fork) for now until we learn more how this `NFARegexpQuery` behaves. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10035) Simple text codec add multi level skip list data
[ https://issues.apache.org/jira/browse/LUCENE-10035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388751#comment-17388751 ] Adrien Grand commented on LUCENE-10035: --- Wow! This is impressive work! > Simple text codec add multi level skip list data > -- > > Key: LUCENE-10035 > URL: https://issues.apache.org/jira/browse/LUCENE-10035 > Project: Lucene - Core > Issue Type: Wish > Components: core/codecs >Affects Versions: main (9.0) >Reporter: wuda >Priority: Major > Labels: Impact, MultiLevelSkipList, SimpleTextCodec > Time Spent: 20m > Remaining Estimate: 0h > > Simple text codec add skip list data( include impact) to help understand > index format,For debugging, curiosity, transparency only!! When term's > docFreq greater than or equal to SimpleTextSkipWriter.BLOCK_SIZE (default > value is 8), Simple text codec will write skip list, the *.pst (simple text > term dictionary file)* file will looks like this > {code:java} > field title > term args > doc 2 > freq 2 > pos 7 > pos 10 > ## we omit docs for better view .. > doc 98 > freq 2 > pos 2 > pos 6 > skipList > ? > level 1 > skipDoc 65 > skipDocFP 949 > impacts > impact > freq 1 > norm 2 > impact > freq 2 > norm 12 > impact > freq 3 > norm 13 > impacts_end > ? > level 0 > skipDoc 17 > skipDocFP 284 > impacts > impact > freq 1 > norm 2 > impact > freq 2 > norm 12 > impacts_end > skipDoc 34 > skipDocFP 624 > impacts > impact > freq 1 > norm 2 > impact > freq 2 > norm 12 > impact > freq 3 > norm 14 > impacts_end > skipDoc 65 > skipDocFP 949 > impacts > impact > freq 1 > norm 2 > impact > freq 2 > norm 12 > impact > freq 3 > norm 13 > impacts_end > skipDoc 90 > skipDocFP 1311 > impacts > impact > freq 1 > norm 2 > impact > freq 2 > norm 10 > impact > freq 3 > norm 13 > impact > freq 4 > norm 14 > impacts_end > END > checksum 000829315543 > {code} > compare with previous,we add *skipList,level, skipDoc, skipDocFP, impacts, > impact, freq, norm* nodes, at the same, simple text codec can support > advanceShallow when search time. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on pull request #157: LUCENE-9963 Fix issue with FlattenGraphFilter throwing exceptions from holes
mikemccand commented on pull request #157: URL: https://github.com/apache/lucene/pull/157#issuecomment-888287420 > We have been using this change internally for a few weeks now. We no longer encounter the ArrayIndexOutOfBounds exceptions that we were previously experiencing. Depending on the dataset/analyzer combination we have seen up to a 1% increase in the average number of tokens per field. This comes from the tokens that had previously been dropped now being correctly indexed. Thanks for the update @glawson0! This is great news. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #220: LUCENE-9450: Use BinaryDocValue fields in the taxonomy index based on the existing index version
jpountz commented on pull request #220: URL: https://github.com/apache/lucene/pull/220#issuecomment-888288068 > What is the use of the LiveIndexWriterConfig.createdVersionMajor It's very expert. It's necessary if you have multiple workers creating indices that you then want to merge together using `IndexWriter#addIndexes`. `addIndexes` requires that all indices have the same major version, so if you are doing a rolling upgrade on your workers to a new Lucene major, this helps ensure that all indices are created in a way that they can be merged eventually. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dnhatn merged pull request #228: Remove unnecessary assertion
dnhatn merged pull request #228: URL: https://github.com/apache/lucene/pull/228 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9304) Clean up DWPTPool
[ https://issues.apache.org/jira/browse/LUCENE-9304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388803#comment-17388803 ] ASF subversion and git services commented on LUCENE-9304: - Commit 03b1db91f9e5b11816274efd0da8503db27ccce0 in lucene's branch refs/heads/main from Shintaro Murakami [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=03b1db9 ] LUCENE-9304: Remove assertion in DocumentsWriterFlushControl (#228) This is assertion becomes obvious after LUCENE-9304. > Clean up DWPTPool > -- > > Key: LUCENE-9304 > URL: https://issues.apache.org/jira/browse/LUCENE-9304 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: main (9.0), 8.6 >Reporter: Simon Willnauer >Assignee: Simon Willnauer >Priority: Major > Fix For: main (9.0), 8.6 > > Time Spent: 11h 10m > Remaining Estimate: 0h > > DWPTPool currently uses an indirection called ThreadState to hold DWPT > instances. This class holds several information that belongs in other places, > inherits from ReentrantLock and has a mutable nature. Instead we could pool > the DWPT directly and remove other indirections inside DWPTFlushControl if > we move some of the ThreadState properties to DWPT directly. The threadpool > also has a problem that is grows it's ThreadStates to the number of > concurrently indexing threads but never shrinks it if they are reduced. With > pooling DWPT directly this limitation could be removed. > In summary, this component has seen quite some refactoring and requires some > cleanups and docs changes in order to stay the test of time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dnhatn commented on pull request #228: Remove unnecessary assertion
dnhatn commented on pull request #228: URL: https://github.com/apache/lucene/pull/228#issuecomment-888337488 Merged, thanks @mrkm4ntr. It's preferable to have a Jira ticket opened before having a pull request. But it's okay for this change as it's straightforward. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9304) Clean up DWPTPool
[ https://issues.apache.org/jira/browse/LUCENE-9304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388804#comment-17388804 ] ASF subversion and git services commented on LUCENE-9304: - Commit 03b1db91f9e5b11816274efd0da8503db27ccce0 in lucene's branch refs/heads/main from Shintaro Murakami [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=03b1db9 ] LUCENE-9304: Remove assertion in DocumentsWriterFlushControl (#228) This is assertion becomes obvious after LUCENE-9304. > Clean up DWPTPool > -- > > Key: LUCENE-9304 > URL: https://issues.apache.org/jira/browse/LUCENE-9304 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: main (9.0), 8.6 >Reporter: Simon Willnauer >Assignee: Simon Willnauer >Priority: Major > Fix For: main (9.0), 8.6 > > Time Spent: 11h 10m > Remaining Estimate: 0h > > DWPTPool currently uses an indirection called ThreadState to hold DWPT > instances. This class holds several information that belongs in other places, > inherits from ReentrantLock and has a mutable nature. Instead we could pool > the DWPT directly and remove other indirections inside DWPTFlushControl if > we move some of the ThreadState properties to DWPT directly. The threadpool > also has a problem that is grows it's ThreadStates to the number of > concurrently indexing threads but never shrinks it if they are reduced. With > pooling DWPT directly this limitation could be removed. > In summary, this component has seen quite some refactoring and requires some > cleanups and docs changes in order to stay the test of time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10040) Handle deletions in nearest vector search
Julie Tibshirani created LUCENE-10040: - Summary: Handle deletions in nearest vector search Key: LUCENE-10040 URL: https://issues.apache.org/jira/browse/LUCENE-10040 Project: Lucene - Core Issue Type: Improvement Reporter: Julie Tibshirani Currently nearest vector search doesn't account for deleted documents. Even if a document is not in {{LeafReader#getLiveDocs}}, it could still be returned from {{LeafReader#searchNearestVectors}}. This seems like it'd be surprising + difficult for users, since other search APIs account for deleted docs. We've discussed extending {{searchNearestVectors}} to take a parameter like {{Bits liveDocs}}. This issue discusses options around adding support. One approach is to just filter out deleted docs after running the KNN search. This behavior seems hard to work with as a user: fewer than {{k}} docs might come back from your KNN search! Alternatively, {{LeafReader#searchNearestVectors}} could always return the {{k}} nearest undeleted docs. To implement this, HNSW could omit deleted docs while assembling its candidate list. It would traverse further into the graph, visiting more nodes to ensure it gathers the required candidates. (Note deleted docs would still be visited/ traversed). The [hnswlib library|https://github.com/nmslib/hnswlib] contains an implementation like this, where you can mark documents as deleted and they're skipped during search. This approach seems reasonable to me, but there are some challenges: * Performance can be unpredictable. If deletions are random, it shouldn't have a huge effect. But in the worst case, a segment could have 50% deleted docs, and they all happen to be near the query vector. HNSW would need to traverse through around half the entire graph to collect neighbors. * As far as I know, there hasn't been academic research or any testing into how well this performs in terms of recall. I have a vague intuition it could be harder to achieve high recall as the algorithm traverses areas further from the "natural" entry points. The HNSW paper doesn't mention deletions/ filtering, and I haven't seen community benchmarks around it. Background links: * Thoughts on deletions from the author of the HNSW paper: [https://github.com/nmslib/hnswlib/issues/4#issuecomment-378739892] * Blog from Vespa team which mentions combining KNN and search filters (very similar to applying deleted docs): [https://blog.vespa.ai/approximate-nearest-neighbor-search-in-vespa-part-1/]. The "Exact vs Approximate" section shows good performance even when a large percentage of documents are filtered out. The team mentioned to me they didn't have the chance to measure recall, only latency. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10016) VectorReader.search needs rethought, o.a.l.search integration?
[ https://issues.apache.org/jira/browse/LUCENE-10016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388806#comment-17388806 ] Julie Tibshirani commented on LUCENE-10016: --- Deletions are an interesting topic, I opened https://issues.apache.org/jira/browse/LUCENE-10040 for a dedicated discussion. Maybe we could close this issue in favor of that one and also https://issues.apache.org/jira/browse/LUCENE-9614, which discusses a high-level API for KNN search? If we close this, we should decide if we want to transfer its "blocker" status to those issues. > VectorReader.search needs rethought, o.a.l.search integration? > -- > > Key: LUCENE-10016 > URL: https://issues.apache.org/jira/browse/LUCENE-10016 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Blocker > Fix For: 9.0 > > Time Spent: 50m > Remaining Estimate: 0h > > There's no search integration (e.g. queries) for the current vector values, > no documentation/examples that I can find. > Instead the codec has this method: > {code} > TopDocs search(String field, float[] target, int k, int fanout) > {code} > First, the "fanout" parameter needs to go, this is specific to HNSW impl, get > it out of here. > Second, How am I supposed to skip over deleted documents? How can I use > filters? How should i search across multiple segments? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Deleted] (LUCENE-10038) i have no issue
[ https://issues.apache.org/jira/browse/LUCENE-10038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe deleted LUCENE-10038: - > i have no issue > --- > > Key: LUCENE-10038 > URL: https://issues.apache.org/jira/browse/LUCENE-10038 > Project: Lucene - Core > Issue Type: New Feature >Reporter: mahnoor jabbar >Priority: Major > > [Carters Coupon > Codes|https://uttercoupons.com/front/store-profile/carters-coupon-codes] is > best code provided by Carter's. Amazing Discount Offers, Get Carter's Coupons > & Promo Codes and save up to 50% on the offer, so get the code helps you to > save on coupon and promo codes -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jtibshirani opened a new pull request #2535: LUCENE-10039: Fix single-field scoring for CombinedFieldQuery
jtibshirani opened a new pull request #2535: URL: https://github.com/apache/lucene-solr/pull/2535 When there's only one field, CombinedFieldQuery will ignore its weight while scoring. This makes the scoring inconsistent, since the field weight is supposed to multiply its term frequency. This PR removes the optimizations around single-field scoring to make sure the weight is always taken into account. These optimizations are not critical since it should be uncommon to use CombinedFieldQuery with only one field. This backport also incorporates the part of LUCENE-9823 that applies to CombinedFieldQuery. We no longer rewrite single-field queries, which can also change their scoring. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jtibshirani merged pull request #2535: LUCENE-10039: Fix single-field scoring for CombinedFieldQuery
jtibshirani merged pull request #2535: URL: https://github.com/apache/lucene-solr/pull/2535 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10039) With a single field, CombinedFieldQuery can score incorrectly
[ https://issues.apache.org/jira/browse/LUCENE-10039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julie Tibshirani resolved LUCENE-10039. --- Fix Version/s: 8.10 Resolution: Fixed > With a single field, CombinedFieldQuery can score incorrectly > - > > Key: LUCENE-10039 > URL: https://issues.apache.org/jira/browse/LUCENE-10039 > Project: Lucene - Core > Issue Type: Bug >Reporter: Julie Tibshirani >Priority: Minor > Fix For: 8.10 > > Time Spent: 50m > Remaining Estimate: 0h > > When there's only one field, {{CombinedFieldQuery}} will ignore its weight > while scoring. This makes the scoring inconsistent, since the field weight is > supposed to multiply its term frequency. > This can also come up when searching over multiple fields, when some segment > happens to contain only one field. The problem was caught by this test: > {code} > ant test -Dtestcase=TestCombinedFieldQuery > -Dtests.method=testCopyFieldWithMissingFields -Dtests.seed=8FA982798BC8FEF6 > -Dtests.nightly=true > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9823) SynonymQuery rewrite can change field boost calculation
[ https://issues.apache.org/jira/browse/LUCENE-9823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388814#comment-17388814 ] ASF subversion and git services commented on LUCENE-9823: - Commit e31762253fcf7ef85fa0c09fdb40d3daf201a9d1 in lucene-solr's branch refs/heads/branch_8x from Julie Tibshirani [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e317622 ] LUCENE-10039: Fix single-field scoring for CombinedFieldQuery (#2535) When there's only one field, CombinedFieldQuery will ignore its weight while scoring. This makes the scoring inconsistent, since the field weight is supposed to multiply its term frequency. This PR removes the optimizations around single-field scoring to make sure the weight is always taken into account. These optimizations are not critical since it should be uncommon to use CombinedFieldQuery with only one field. This backport also incorporates the part of LUCENE-9823 that applies to CombinedFieldQuery. We no longer rewrite single-field queries, which can also change their scoring. > SynonymQuery rewrite can change field boost calculation > --- > > Key: LUCENE-9823 > URL: https://issues.apache.org/jira/browse/LUCENE-9823 > Project: Lucene - Core > Issue Type: Bug >Reporter: Julie Tibshirani >Priority: Minor > Labels: newdev > Fix For: main (9.0) > > Time Spent: 1h > Remaining Estimate: 0h > > SynonymQuery accepts a boost per term, which acts as a multiplier on the term > frequency in the document. When rewriting a SynonymQuery with a single term, > we create a BoostQuery wrapping a TermQuery. This changes the meaning of the > boost: it now multiplies the final TermQuery score instead of multiplying the > term frequency before it's passed to the score calculation. > This is a small point, but maybe it's worth avoiding rewriting a single-term > SynonymQuery unless the boost is 1.0. > The same consideration affects CombinedFieldQuery in sandbox. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10039) With a single field, CombinedFieldQuery can score incorrectly
[ https://issues.apache.org/jira/browse/LUCENE-10039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388813#comment-17388813 ] ASF subversion and git services commented on LUCENE-10039: -- Commit e31762253fcf7ef85fa0c09fdb40d3daf201a9d1 in lucene-solr's branch refs/heads/branch_8x from Julie Tibshirani [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e317622 ] LUCENE-10039: Fix single-field scoring for CombinedFieldQuery (#2535) When there's only one field, CombinedFieldQuery will ignore its weight while scoring. This makes the scoring inconsistent, since the field weight is supposed to multiply its term frequency. This PR removes the optimizations around single-field scoring to make sure the weight is always taken into account. These optimizations are not critical since it should be uncommon to use CombinedFieldQuery with only one field. This backport also incorporates the part of LUCENE-9823 that applies to CombinedFieldQuery. We no longer rewrite single-field queries, which can also change their scoring. > With a single field, CombinedFieldQuery can score incorrectly > - > > Key: LUCENE-10039 > URL: https://issues.apache.org/jira/browse/LUCENE-10039 > Project: Lucene - Core > Issue Type: Bug >Reporter: Julie Tibshirani >Priority: Minor > Fix For: 8.10 > > Time Spent: 50m > Remaining Estimate: 0h > > When there's only one field, {{CombinedFieldQuery}} will ignore its weight > while scoring. This makes the scoring inconsistent, since the field weight is > supposed to multiply its term frequency. > This can also come up when searching over multiple fields, when some segment > happens to contain only one field. The problem was caught by this test: > {code} > ant test -Dtestcase=TestCombinedFieldQuery > -Dtests.method=testCopyFieldWithMissingFields -Dtests.seed=8FA982798BC8FEF6 > -Dtests.nightly=true > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10015) Remove VectorValues.SimilarityFunction.NONE
[ https://issues.apache.org/jira/browse/LUCENE-10015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388834#comment-17388834 ] Julie Tibshirani commented on LUCENE-10015: --- > In case it's helpful context: currently we only support Euclidean and cosine >distance, which is technically redundant We actually only support Euclidean and dot product! Sorry if this caused confusion, I have no idea why I thought we implemented cosine instead. > Remove VectorValues.SimilarityFunction.NONE > --- > > Key: LUCENE-10015 > URL: https://issues.apache.org/jira/browse/LUCENE-10015 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Blocker > Fix For: 9.0 > > Time Spent: 50m > Remaining Estimate: 0h > > This stuff is HNSW-implementation specific. It can be moved to a codec > parameter. > The NONE option should be removed: it just makes the codec more complex. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10015) Remove VectorValues.SimilarityFunction.NONE
[ https://issues.apache.org/jira/browse/LUCENE-10015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388834#comment-17388834 ] Julie Tibshirani edited comment on LUCENE-10015 at 7/28/21, 3:09 PM: - {quote}In case it's helpful context: currently we only support Euclidean and cosine distance, which is technically redundant {quote} We actually only support Euclidean and dot product! Sorry if this caused confusion, I have no idea why I thought we implemented cosine instead. was (Author: julietibs): > In case it's helpful context: currently we only support Euclidean and cosine >distance, which is technically redundant We actually only support Euclidean and dot product! Sorry if this caused confusion, I have no idea why I thought we implemented cosine instead. > Remove VectorValues.SimilarityFunction.NONE > --- > > Key: LUCENE-10015 > URL: https://issues.apache.org/jira/browse/LUCENE-10015 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Blocker > Fix For: 9.0 > > Time Spent: 50m > Remaining Estimate: 0h > > This stuff is HNSW-implementation specific. It can be moved to a codec > parameter. > The NONE option should be removed: it just makes the codec more complex. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] wuda0112 commented on a change in pull request #224: LUCENE-10035: Simple text codec add multi level skip list data
wuda0112 commented on a change in pull request #224: URL: https://github.com/apache/lucene/pull/224#discussion_r678424267 ## File path: lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextSkipReader.java ## @@ -0,0 +1,179 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.codecs.simpletext; + +import static org.apache.lucene.codecs.simpletext.SimpleTextSkipWriter.FREQ; +import static org.apache.lucene.codecs.simpletext.SimpleTextSkipWriter.IMPACT; +import static org.apache.lucene.codecs.simpletext.SimpleTextSkipWriter.IMPACTS; +import static org.apache.lucene.codecs.simpletext.SimpleTextSkipWriter.IMPACTS_END; +import static org.apache.lucene.codecs.simpletext.SimpleTextSkipWriter.NORM; +import static org.apache.lucene.codecs.simpletext.SimpleTextSkipWriter.SKIP_DOC; +import static org.apache.lucene.codecs.simpletext.SimpleTextSkipWriter.SKIP_DOC_FP; +import static org.apache.lucene.codecs.simpletext.SimpleTextSkipWriter.SKIP_LIST; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.List; +import org.apache.lucene.codecs.MultiLevelSkipListReader; +import org.apache.lucene.index.Impact; +import org.apache.lucene.index.Impacts; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.store.BufferedChecksumIndexInput; +import org.apache.lucene.store.ChecksumIndexInput; +import org.apache.lucene.store.IndexInput; +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.BytesRefBuilder; +import org.apache.lucene.util.CharsRefBuilder; +import org.apache.lucene.util.StringHelper; + +/** + * This class reads skip lists with multiple levels. + * + * See {@link SimpleTextFieldsWriter} for the information about the encoding of the multi level + * skip lists. + * + * @lucene.experimental + */ +public class SimpleTextSkipReader extends MultiLevelSkipListReader { Review comment: sorry i have ran at wrong git branch, so it passed, when i realized that, i convert to draft, i will test it carefully again until satisfy all unit test. and yes , it should be pkg-private. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on a change in pull request #157: LUCENE-9963 Fix issue with FlattenGraphFilter throwing exceptions from holes
mikemccand commented on a change in pull request #157: URL: https://github.com/apache/lucene/pull/157#discussion_r678412320 ## File path: lucene/core/src/java/org/apache/lucene/analysis/AutomatonToTokenStream.java ## @@ -0,0 +1,197 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.analysis; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.LinkedList; +import java.util.List; +import java.util.Map; +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; +import org.apache.lucene.analysis.tokenattributes.OffsetAttribute; +import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute; +import org.apache.lucene.analysis.tokenattributes.PositionLengthAttribute; +import org.apache.lucene.util.automaton.Automaton; +import org.apache.lucene.util.automaton.Operations; +import org.apache.lucene.util.automaton.Transition; + +/** Converts an Automaton into a TokenStream. */ +public class AutomatonToTokenStream { Review comment: Whoa, awesome! This will be a really helpful infrastructure for future testing! And of course now I really feel compelled to take a Lev1("lucene") and run it through here and watch what the resulting tokens are! ## File path: lucene/core/src/java/org/apache/lucene/analysis/AutomatonToTokenStream.java ## @@ -0,0 +1,197 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.analysis; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.LinkedList; +import java.util.List; +import java.util.Map; +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; +import org.apache.lucene.analysis.tokenattributes.OffsetAttribute; +import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute; +import org.apache.lucene.analysis.tokenattributes.PositionLengthAttribute; +import org.apache.lucene.util.automaton.Automaton; +import org.apache.lucene.util.automaton.Operations; +import org.apache.lucene.util.automaton.Transition; + +/** Converts an Automaton into a TokenStream. */ +public class AutomatonToTokenStream { + + private AutomatonToTokenStream() {} + + /** + * converts an automaton into a TokenStream. This is done by first Topo sorting the nodes in the + * Automaton. Nodes that have the same distance from the start are grouped together to form the + * position nodes for the TokenStream. The resulting TokenStream releases edges from the automaton + * as tokens in order from the position nodes. This requires the automaton be a finite DAG. + * + * @param automaton automaton to convert. Must be a finite DAG. + * @return TokenStream representation of automaton. + */ + public static TokenStream toTokenStream(Automaton automaton) { +if (Operations.isFinite(automaton) == false) { + throw new IllegalArgumentException("Automaton must be finite"); +} + +List> positionNodes = new ArrayList<>(); + +Transition[][] transitions = automaton.getSortedTransitions(); + +int[] indegree = new int[transitions.length]; + +for (int i = 0; i < transitions.length; i++) { + for (int edge = 0; edge < transitions[i].length; edge++) { +indegree[transitions[i][edge].dest] += 1; + } +} +if (indegree[0] != 0) { + throw new IllegalArgumentException("Start node has incoming edges, crea
[jira] [Commented] (LUCENE-10016) VectorReader.search needs rethought, o.a.l.search integration?
[ https://issues.apache.org/jira/browse/LUCENE-10016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388866#comment-17388866 ] Adrien Grand commented on LUCENE-10016: --- One thing that would still be missing would be the oal.demo integration. At the same time I'm unsure if we can easily add vector search to the demo as we'd need a way to turn some data that exists on the user computer into vectors in a way that nearest-neighbor search makes sense. > VectorReader.search needs rethought, o.a.l.search integration? > -- > > Key: LUCENE-10016 > URL: https://issues.apache.org/jira/browse/LUCENE-10016 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Blocker > Fix For: 9.0 > > Time Spent: 50m > Remaining Estimate: 0h > > There's no search integration (e.g. queries) for the current vector values, > no documentation/examples that I can find. > Instead the codec has this method: > {code} > TopDocs search(String field, float[] target, int k, int fanout) > {code} > First, the "fanout" parameter needs to go, this is specific to HNSW impl, get > it out of here. > Second, How am I supposed to skip over deleted documents? How can I use > filters? How should i search across multiple segments? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10016) VectorReader.search needs rethought, o.a.l.search integration?
[ https://issues.apache.org/jira/browse/LUCENE-10016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388868#comment-17388868 ] Robert Muir commented on LUCENE-10016: -- Even if it isn't in the o.a.l.demo module, a simple test similar to "TestDemo" would be a great step: https://github.com/apache/lucene/blob/main/lucene/core/src/test/org/apache/lucene/TestDemo.java By this, I mean a high-level unit test that uses indexwriter/indexsearcher/queries and not low-level codec apis. > VectorReader.search needs rethought, o.a.l.search integration? > -- > > Key: LUCENE-10016 > URL: https://issues.apache.org/jira/browse/LUCENE-10016 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Blocker > Fix For: 9.0 > > Time Spent: 50m > Remaining Estimate: 0h > > There's no search integration (e.g. queries) for the current vector values, > no documentation/examples that I can find. > Instead the codec has this method: > {code} > TopDocs search(String field, float[] target, int k, int fanout) > {code} > First, the "fanout" parameter needs to go, this is specific to HNSW impl, get > it out of here. > Second, How am I supposed to skip over deleted documents? How can I use > filters? How should i search across multiple segments? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10040) Handle deletions in nearest vector search
[ https://issues.apache.org/jira/browse/LUCENE-10040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388877#comment-17388877 ] Adrien Grand commented on LUCENE-10040: --- bq. Performance can be unpredictable. If deletions are random, it shouldn't have a huge effect. But in the worst case, a segment could have 50% deleted docs, and they all happen to be near the query vector. HNSW would need to traverse through around half the entire graph to collect neighbors. FWIW this is a general problem with Lucene. For instance if you run a term query, we'll use impacts to know which blocks may contain competitive documents, but we could hit a worst-case scenario where the documents that make the block competitive are deleted, and all the non-deleted documents of the block are not competitive. > Handle deletions in nearest vector search > - > > Key: LUCENE-10040 > URL: https://issues.apache.org/jira/browse/LUCENE-10040 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Julie Tibshirani >Priority: Major > > Currently nearest vector search doesn't account for deleted documents. Even > if a document is not in {{LeafReader#getLiveDocs}}, it could still be > returned from {{LeafReader#searchNearestVectors}}. This seems like it'd be > surprising + difficult for users, since other search APIs account for deleted > docs. We've discussed extending {{searchNearestVectors}} to take a parameter > like {{Bits liveDocs}}. This issue discusses options around adding support. > One approach is to just filter out deleted docs after running the KNN search. > This behavior seems hard to work with as a user: fewer than {{k}} docs might > come back from your KNN search! > Alternatively, {{LeafReader#searchNearestVectors}} could always return the > {{k}} nearest undeleted docs. To implement this, HNSW could omit deleted docs > while assembling its candidate list. It would traverse further into the > graph, visiting more nodes to ensure it gathers the required candidates. > (Note deleted docs would still be visited/ traversed). The [hnswlib > library|https://github.com/nmslib/hnswlib] contains an implementation like > this, where you can mark documents as deleted and they're skipped during > search. > This approach seems reasonable to me, but there are some challenges: > * Performance can be unpredictable. If deletions are random, it shouldn't > have a huge effect. But in the worst case, a segment could have 50% deleted > docs, and they all happen to be near the query vector. HNSW would need to > traverse through around half the entire graph to collect neighbors. > * As far as I know, there hasn't been academic research or any testing into > how well this performs in terms of recall. I have a vague intuition it could > be harder to achieve high recall as the algorithm traverses areas further > from the "natural" entry points. The HNSW paper doesn't mention deletions/ > filtering, and I haven't seen community benchmarks around it. > Background links: > * Thoughts on deletions from the author of the HNSW paper: > [https://github.com/nmslib/hnswlib/issues/4#issuecomment-378739892] > * Blog from Vespa team which mentions combining KNN and search filters (very > similar to applying deleted docs): > [https://blog.vespa.ai/approximate-nearest-neighbor-search-in-vespa-part-1/]. > The "Exact vs Approximate" section shows good performance even when a large > percentage of documents are filtered out. The team mentioned to me they > didn't have the chance to measure recall, only latency. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #227: LUCENE-10033: Encode numeric doc values and ordinals of SORTED(_SET) doc values in blocks.
gsmiller commented on pull request #227: URL: https://github.com/apache/lucene/pull/227#issuecomment-888465657 This is really interesting/exciting! I'm working through this PR now but I notice you've used a slightly different approach to the FOR encoding (compared to what's done in the postings). Is this intentional for some reason, or is it more to get something out quickly for benchmarking (results were interesting by the way!)? Is there a reason you chose not to use the existing `ForUtil` directly? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #220: LUCENE-9450: Use BinaryDocValue fields in the taxonomy index based on the existing index version
gautamworah96 commented on a change in pull request #220: URL: https://github.com/apache/lucene/pull/220#discussion_r678493344 ## File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java ## @@ -475,8 +477,15 @@ private int addCategoryDocument(FacetLabel categoryPath, int parent) throws IOEx String fieldPath = FacetsConfig.pathToString(categoryPath.components, categoryPath.length); fullPathField.setStringValue(fieldPath); + +if (useOlderStoredFieldIndex) { + fullPathField = new StringField(Consts.FULL, fieldPath, Field.Store.YES); Review comment: Hmmm. I guess I did not find it confusing but it is a bit strange for sure. The next commit initializes it upfront -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #227: LUCENE-10033: Encode numeric doc values and ordinals of SORTED(_SET) doc values in blocks.
jpountz commented on pull request #227: URL: https://github.com/apache/lucene/pull/227#issuecomment-888479181 Indeed I wanted to get something out quickly for benchmarking where I could easily play with different block sizes, while ForUtil is very rigid (hardcoded block size of 128 and explicitly rejects numbers of bits per value > 32). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10033) Encode doc values in smaller blocks of values, like postings
[ https://issues.apache.org/jira/browse/LUCENE-10033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388892#comment-17388892 ] Adrien Grand commented on LUCENE-10033: --- Unfortunately I noticed that the sorted queries that didn't become slower only didn't become slower because the field was also indexed with points, so the short-circuiting logic we have to progressively add a filter that only matches competitive documents hid the slowdown. If I hack the benchmark code to not use this optimization then sorted queries are all about 40-50% slower. > Encode doc values in smaller blocks of values, like postings > > > Key: LUCENE-10033 > URL: https://issues.apache.org/jira/browse/LUCENE-10033 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 0.5h > Remaining Estimate: 0h > > This is a follow-up to the discussion on this thread: > https://lists.apache.org/thread.html/r7b757074d5f02874ce3a295b0007dff486bc10d08fb0b5e5a4ba72c5%40%3Cdev.lucene.apache.org%3E. > Our current approach for doc values uses large blocks of 16k values where > values can be decompressed independently, using DirectWriter/DirectReader. > This is a bit inefficient in some cases, e.g. a single outlier can grow the > number of bits per value for the entire block, we can't easily use run-length > compression, etc. Plus, it encourages using a different sub-class for every > compression technique, which puts pressure on the JVM. > We'd like to move to an approach that would be more similar to postings with > smaller blocks (e.g. 128 values) whose values get all decompressed at once > (using SIMD instructions), with skip data within blocks in order to > efficiently skip to arbitrary doc IDs (or maybe still use jump tables as > today's doc values, and as discussed here for postings: > https://lists.apache.org/thread.html/r7c3cb7ab143fd4ecbc05c04064d10ef9fb50c5b4d6479b0f35732677%40%3Cdev.lucene.apache.org%3E). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10016) VectorReader.search needs rethought, o.a.l.search integration?
[ https://issues.apache.org/jira/browse/LUCENE-10016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388899#comment-17388899 ] Michael Sokolov commented on LUCENE-10016: -- as for the demo, there is a start on something we could use in luceneutil. It would requirea a fairly large word->vector dictionary though. I think maybe the way to do it is to provide instructions for downloading the dictionary rather than shipping it as part of the demo. > VectorReader.search needs rethought, o.a.l.search integration? > -- > > Key: LUCENE-10016 > URL: https://issues.apache.org/jira/browse/LUCENE-10016 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Blocker > Fix For: 9.0 > > Time Spent: 50m > Remaining Estimate: 0h > > There's no search integration (e.g. queries) for the current vector values, > no documentation/examples that I can find. > Instead the codec has this method: > {code} > TopDocs search(String field, float[] target, int k, int fanout) > {code} > First, the "fanout" parameter needs to go, this is specific to HNSW impl, get > it out of here. > Second, How am I supposed to skip over deleted documents? How can I use > filters? How should i search across multiple segments? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on a change in pull request #214: LUCENE-10027 provide leaf sorter from commit
mayya-sharipova commented on a change in pull request #214: URL: https://github.com/apache/lucene/pull/214#discussion_r678507441 ## File path: lucene/core/src/java/org/apache/lucene/index/DirectoryReader.java ## @@ -122,6 +122,23 @@ public static DirectoryReader open(final IndexCommit commit) throws IOException return StandardDirectoryReader.open(commit.getDirectory(), commit, null); } + /** + * Expert: returns an IndexReader reading the index in the given {@link IndexCommit}. + * + * @param commit the commit point to open + * @param leafSorter a comparator for sorting leaf readers. Providing leafSorter is useful for + * indices on which it is expected to run many queries with particular sort criteria (e.g. for + * time-based indices, this is usually a descending sort on timestamp). In this case {@code + * leafSorter} should sort leaves according to this sort criteria. Providing leafSorter allows + * to speed up this particular type of sort queries by early terminating while iterating + * through segments and segments' documents + * @throws IOException if there is a low-level IO error + */ + public static DirectoryReader open(final IndexCommit commit, Comparator leafSorter) Review comment: @dnhatn @jpountz Thanks for your comments. @jpountz About this comment: > Since both minSupportedMajorVersion and leafSorter are super expert parameters, I think it's fine to require that users to provide both instead of keeping adding new variants of DirectoryReader#open. I am not happy either about adding new variant of `DirectoryReader#open`, but are we ok to modify the current public API: `DirectoryReader open(IndexCommit, minSupportedMajorVersion)` to add a new parameter? The modification will be in the minor 8.10. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on a change in pull request #214: LUCENE-10027 provide leaf sorter from commit
mayya-sharipova commented on a change in pull request #214: URL: https://github.com/apache/lucene/pull/214#discussion_r678516508 ## File path: lucene/core/src/java/org/apache/lucene/index/DirectoryReader.java ## @@ -122,6 +122,23 @@ public static DirectoryReader open(final IndexCommit commit) throws IOException return StandardDirectoryReader.open(commit.getDirectory(), commit, null); } + /** + * Expert: returns an IndexReader reading the index in the given {@link IndexCommit}. + * + * @param commit the commit point to open + * @param leafSorter a comparator for sorting leaf readers. Providing leafSorter is useful for + * indices on which it is expected to run many queries with particular sort criteria (e.g. for + * time-based indices, this is usually a descending sort on timestamp). In this case {@code + * leafSorter} should sort leaves according to this sort criteria. Providing leafSorter allows + * to speed up this particular type of sort queries by early terminating while iterating + * through segments and segments' documents + * @throws IOException if there is a low-level IO error + */ + public static DirectoryReader open(final IndexCommit commit, Comparator leafSorter) Review comment: Actually after reading `CHANGES.txt` file, I've noticed many API changes even in minor versions. So, @jpountz the commit a44c8120133aa01588c76b58321b8bffae0dd0c7 addresses your comment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #227: LUCENE-10033: Encode numeric doc values and ordinals of SORTED(_SET) doc values in blocks.
gsmiller commented on pull request #227: URL: https://github.com/apache/lucene/pull/227#issuecomment-888503779 > and explicitly rejects numbers of bits per value > 32 Ah right, of course this would be an issue here. Thanks for clarifying! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on pull request #175: LUCENE-9990: gradle7 support
gautamworah96 commented on pull request #175: URL: https://github.com/apache/lucene/pull/175#issuecomment-888545634 I gave this PR another shot (since the Palantir plugin has been patched in v2.0.0 for Gradle 7 support), but had some new issues come up. The good news, I *think* that using the `-Porg.gradle.java.installations.paths` command line param points Gradle to use that specific JDK for building and running the project. The bad news, since `JavaInstallationRegistry` is now deprecated, the build fails in multiple places (some that use the Java version to add specific JVM params, and others where we use the plugin to get the Java command to generate some javadoc). As of right now, I am just trial-and-erroring some code to see what works. Some WIP code is pushed [here](https://github.com/gautamworah96/lucene/pull/new/LUCENE-9990). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] sejal-pawar commented on pull request #159: LUCENE-9945: extend DrillSideways to expose FacetCollector and drill-down dimensions
sejal-pawar commented on pull request #159: URL: https://github.com/apache/lucene/pull/159#issuecomment-888702505 > Thanks @sejal-pawar! This is more what I was originally describing in the Jira issue. Thanks for updating your PR! > > I left some small comments on variable naming, javadoc, etc. Otherwise this seems pretty close to me. > > It would be nice to add a test case though around this new functionality. Maybe you could write a test that relies on the newly-exposed FacetsCollectors and computes a Facets result that is expected to agree with the Facets result exposed already? That would be a nice way to confirm the correct collectors are getting exposed (and don't regress somehow with a future change). Because there are a number of different cases here (many different implementations of the static `search` method), you could leverage Lucene's randomized testing to randomly invoke different code paths (e.g., randomly provide a CollectorManager vs. a Collector; randomly provide an ExecutorService to the ctor; etc.). > > I suppose instead of randomized testing, you could also add on some checks to the existing test cases that also grab the FacetsCollectors from the result and validate them against the Facets that are already tested. That might actually be the easiest way to go about the testing. Have a look in `TestDrillSideways` for what we do currently. Hey Greg, (apologies for the late reply!) I resolved the other comments but while writing the test, I noticed that a lot of test cases in DrillSidewaysResult involve the same logic for initialising DrillSideways. Ex. [1](https://code.amazon.com/packages/lucene/blobs/7a7003c51c8c0470f04e9df2ee9cb6002e124689/--/lucene/facet/src/test/org/apache/lucene/facet/TestDrillSideways.java#L1762) Would it perhaps make sense to extract the initialisation of DrillSideways into a helper test class called `DrillSidewaysInitialiser`? I was thinking of encapsulating all the required pieces like Directory, DirectoryTaxonomyWriter into a single class. Something similar has been done for document generation and initialisation in `org.apache.lucene.index.DocHelper`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz merged pull request #221: LUCENE-10031: Speed up SortedDocIdMerger on low-cardinality sort fields.
jpountz merged pull request #221: URL: https://github.com/apache/lucene/pull/221 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10031) Speedup to SortedDocIDMerger when sorting on low-cardinality fields
[ https://issues.apache.org/jira/browse/LUCENE-10031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17389289#comment-17389289 ] ASF subversion and git services commented on LUCENE-10031: -- Commit 0e6c3146d7853d27037213dc58eddc16a0e05daa in lucene's branch refs/heads/main from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=0e6c314 ] LUCENE-10031: Speed up SortedDocIdMerger on low-cardinality sort fields. (#221) When sorting by low-cardinality fields, the same sub remains current for long sequences of doc IDs. This speeds up SortedDocIdMerger a bit by extracting the sub that leads iteration. > Speedup to SortedDocIDMerger when sorting on low-cardinality fields > --- > > Key: LUCENE-10031 > URL: https://issues.apache.org/jira/browse/LUCENE-10031 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 50m > Remaining Estimate: 0h > > I've been looking at profiles of indexing with index sorting enabled and saw > non-negligible time spent in SortedDocIDMerger. This isn't completely > surprising as this little class is called on every document whenever merging > postings, doc values, stored fields, etc. > I'm especially interested in cases when the sort key is on a low cardinality > field, so the priority queue doesn't get reordered often. I've been playing > with a change to SortedDocIdMerger that makes merging significantly faster in > that case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org