date:20210728

[jira] [Created] (LUCENE-10038) i have no issue

2021-07-28 Thread mahnoor jabbar (Jira)

mahnoor jabbar created LUCENE-10038:
---

 Summary: i have no issue
 Key: LUCENE-10038
 URL: https://issues.apache.org/jira/browse/LUCENE-10038
 Project: Lucene - Core
  Issue Type: New Feature
  Components: core/FSTs
Affects Versions: 8.8.2
Reporter: mahnoor jabbar


[Carters Coupon 
Codes|https://uttercoupons.com/front/store-profile/carters-coupon-codes] is 
best code provided by Carter's. Amazing Discount Offers, Get Carter's Coupons & 
Promo Codes and save up to 50% on the offer, so get the code helps you to save 
on coupon and promo codes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7020) TieredMergePolicy - cascade maxMergeAtOnce setting to maxMergeAtOnceExplicit

2021-07-28 Thread Adrien Grand (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-7020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388572#comment-17388572
 ] 

Adrien Grand commented on LUCENE-7020:
--

I've just seen a similar issue to the one that Shawn is decribing. A small 
index (3.3GB) has more than 30 segments and ends up needing two rounds to be 
force-merged down to 1 segment. With the default settings, it takes 264s to 
force-merge this index. If I set the max number of segments to merge at once to 
50, then force-merging down to 1 segment takes 190s, 28% faster.

An alternative I'd like to propose would be to raise the default value of 
maxMergeAtOnceExplicit to 50 instead of 30. While 30-segments indices can be as 
small as 2.2GB with the default configuration(10 2MB segments, 10 20MB segments 
and 10 200MB segments), a 50-segments index must be at least 72GB (10 2MB 
segments, 10 20MB segments, 10 200MB segments, 10 2GB segments and 10 5GB 
segments).

Or maybe we shouldn't limit the number of segments to merge at once with 
explicit merges? I understand the argument about read-ahead, but we also have 
data structures that are very CPU-intensive to merge like stored fields with 
index sorting, vectors or multi-dimensional points (when N>1) because they may 
need to recompute the data structure entirely. Avoiding cascading merges in 
such cases is very helpful. For the record, the example I gave above falls in 
none of these cases and yet already yields a significant speedup if it doesn't 
need to cascade merges.



> TieredMergePolicy - cascade maxMergeAtOnce setting to maxMergeAtOnceExplicit
> 
>
> Key: LUCENE-7020
> URL: https://issues.apache.org/jira/browse/LUCENE-7020
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 5.4.1
>Reporter: Shawn Heisey
>Assignee: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7020.patch
>
>
> SOLR-8621 covers improvements in configuring a merge policy in Solr.
> Discussions on that issue brought up the fact that if large values are 
> configured for maxMergeAtOnce and segmentsPerTier, but maxMergeAtOnceExplicit 
> is not changed, then doing a forceMerge is likely to not work as expected.
> When I first configured maxMergeAtOnce and segmentsPerTier to 35 in Solr, I 
> saw an optimize (forceMerge) fully rewrite most of the index *twice* in order 
> to achieve a single segment, because there were approximately 80 segments in 
> the index before the optimize, and maxMergeAtOnceExplicit defaults to 30.  On 
> advice given via the solr-user mailing list, I configured 
> maxMergeAtOnceExplicit to 105 and have not had that problem since.
> I propose that setting maxMergeAtOnce should also set maxMergeAtOnceExplicit 
> to three times the new value -- unless the setMaxMergeAtOnceExplicit method 
> has been invoked, indicating that the user wishes to set that value 
> themselves.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-9999) CombinedFieldQuery can fail when document is missing fields

2021-07-28 Thread Julie Tibshirani (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julie Tibshirani resolved LUCENE-.
--
Fix Version/s: 8.10
   Resolution: Fixed

> CombinedFieldQuery can fail when document is missing fields
> ---
>
> Key: LUCENE-
> URL: https://issues.apache.org/jira/browse/LUCENE-
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Julie Tibshirani
>Priority: Major
> Fix For: 8.10
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> If some documents match but don't contain all fields, then 
> {{CombinedFieldQuery}} can fail when attempting to compute norms. This is 
> because {{MultiFieldNormValues}} assumes all fields in the document have 
> norms.
> Originally surfaced in this Elasticsearch issue: 
> https://github.com/elastic/elasticsearch/issues/74037.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10039) With a single field, CombinedFieldQuery can score incorrectly

2021-07-28 Thread Julie Tibshirani (Jira)

Julie Tibshirani created LUCENE-10039:
-

 Summary: With a single field, CombinedFieldQuery can score 
incorrectly
 Key: LUCENE-10039
 URL: https://issues.apache.org/jira/browse/LUCENE-10039
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Julie Tibshirani


When there's only one field, {{CombinedFieldQuery}} will ignore its weight 
while scoring. This makes the scoring inconsistent, since the field weight is 
supposed to multiply its term frequency.

This can also come up when searching over multiple fields, when some segment 
happens to contain only one field. The problem was caught by this test:

{code}
ant test  -Dtestcase=TestCombinedFieldQuery 
-Dtests.method=testCopyFieldWithMissingFields -Dtests.seed=8FA982798BC8FEF6 
-Dtests.nightly=true
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jtibshirani opened a new pull request #229: LUCENE-10039: Fix single-field scoring for CombinedFieldQuery

2021-07-28 Thread GitBox



jtibshirani opened a new pull request #229:
URL: https://github.com/apache/lucene/pull/229


   When there's only one field, CombinedFieldQuery will ignore its weight while
   scoring. This makes the scoring inconsistent, since the field weight is 
supposed
   to multiply its term frequency.
   
   This PR removes the optimizations around single-field scoring to make sure 
the
   weight is always taken into account. These optimizations don't seem critical
   since it should be rare for CombinedFieldQuery to run over only one field.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on pull request #229: LUCENE-10039: Fix single-field scoring for CombinedFieldQuery

2021-07-28 Thread GitBox



jtibshirani commented on pull request #229:
URL: https://github.com/apache/lucene/pull/229#issuecomment-888241745


   Great point, the existing test `testCopyFieldWithMissingFields` only very 
rarely triggers this case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

2021-07-28 Thread GitBox



rmuir commented on pull request #225:
URL: https://github.com/apache/lucene/pull/225#issuecomment-888270205


   > This is super exciting! I'm amazed how little code you needed to get this 
first version running.
   
   but a runautomaton for this won't run any queries on its own: brute force 
isn't how these queries actually work. the important part is the intersection 
(skipping around)...
   
   I suggest, please let's not try to "overshare" and refactor all this stuff 
alongside DFA stuff until there is a query we can actually benchmark to see if 
the performance is even viable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

2021-07-28 Thread GitBox



rmuir commented on a change in pull request #225:
URL: https://github.com/apache/lucene/pull/225#discussion_r678255582



##
File path: 
lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
##
@@ -0,0 +1,225 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.hppc.BitMixer;
+
+/**
+ * A RunAutomaton that does not require DFA, it will determinize and memorize 
the generated DFA
+ * state along with the run
+ *
+ * implemented based on: https://swtch.com/~rsc/regexp/regexp1.html
+ */
+public class NFARunAutomaton {
+
+  /** state ordinal of "no such state" */
+  public static final int MISSING = -1;
+
+  private static final int NOT_COMPUTED = -2;
+
+  private final Automaton automaton;
+  private final int[] points;
+  private final Map dStateToOrd = new HashMap<>(); // could 
init lazily?
+  private DState[] dStates;
+  private final int alphabetSize;
+
+  /**
+   * Constructor, assuming alphabet size is the whole codepoint space
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use 
{@link RunAutomaton} for
+   * better efficiency
+   */
+  public NFARunAutomaton(Automaton automaton) {
+this(automaton, Character.MAX_CODE_POINT);
+  }
+
+  /**
+   * Constructor
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use 
{@link RunAutomaton} *
+   * for better efficiency
+   * @param alphabetSize alphabet size
+   */
+  public NFARunAutomaton(Automaton automaton, int alphabetSize) {
+this.automaton = automaton;
+points = automaton.getStartPoints();
+this.alphabetSize = alphabetSize;
+dStates = new DState[10];
+findDState(new DState(new int[] {0}));
+  }
+
+  /**
+   * For a given state and an incoming character (codepoint), return the next 
state
+   *
+   * @param state incoming state, should either be 0 or some state that is 
returned previously by
+   * this function
+   * @param c codepoint
+   * @return the next state or {@link #MISSING} if the transition doesn't exist
+   */
+  public int step(int state, int c) {
+assert dStates[state] != null;
+return step(dStates[state], c);
+  }
+
+  /**
+   * Run through a given codepoint array, return accepted or not, should only 
be used in test
+   *
+   * @param s String represented by an int array
+   * @return accept or not
+   */
+  boolean run(int[] s) {

Review comment:
   see my comment: we should avoid oversharing for now.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on a change in pull request #224: LUCENE-10035: Simple text codec add multi level skip list data

2021-07-28 Thread GitBox



jpountz commented on a change in pull request #224:
URL: https://github.com/apache/lucene/pull/224#discussion_r678249370



##
File path: 
lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextSkipReader.java
##
@@ -0,0 +1,179 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.simpletext;
+
+import static org.apache.lucene.codecs.simpletext.SimpleTextSkipWriter.FREQ;
+import static org.apache.lucene.codecs.simpletext.SimpleTextSkipWriter.IMPACT;
+import static org.apache.lucene.codecs.simpletext.SimpleTextSkipWriter.IMPACTS;
+import static 
org.apache.lucene.codecs.simpletext.SimpleTextSkipWriter.IMPACTS_END;
+import static org.apache.lucene.codecs.simpletext.SimpleTextSkipWriter.NORM;
+import static 
org.apache.lucene.codecs.simpletext.SimpleTextSkipWriter.SKIP_DOC;
+import static 
org.apache.lucene.codecs.simpletext.SimpleTextSkipWriter.SKIP_DOC_FP;
+import static 
org.apache.lucene.codecs.simpletext.SimpleTextSkipWriter.SKIP_LIST;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import org.apache.lucene.codecs.MultiLevelSkipListReader;
+import org.apache.lucene.index.Impact;
+import org.apache.lucene.index.Impacts;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.store.BufferedChecksumIndexInput;
+import org.apache.lucene.store.ChecksumIndexInput;
+import org.apache.lucene.store.IndexInput;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BytesRefBuilder;
+import org.apache.lucene.util.CharsRefBuilder;
+import org.apache.lucene.util.StringHelper;
+
+/**
+ * This class reads skip lists with multiple levels.
+ *
+ * See {@link SimpleTextFieldsWriter} for the information about the 
encoding of the multi level
+ * skip lists.
+ *
+ * @lucene.experimental
+ */
+public class SimpleTextSkipReader extends MultiLevelSkipListReader {

Review comment:
   can it be made pkg-private?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jtibshirani merged pull request #229: LUCENE-10039: Fix single-field scoring for CombinedFieldQuery

2021-07-28 Thread GitBox



jtibshirani merged pull request #229:
URL: https://github.com/apache/lucene/pull/229


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10039) With a single field, CombinedFieldQuery can score incorrectly

2021-07-28 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388737#comment-17388737
 ] 

ASF subversion and git services commented on LUCENE-10039:
--

Commit e8663b30b85c1d48a8d18d37866a553895ffb8ae in lucene's branch 
refs/heads/main from Julie Tibshirani
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=e8663b3 ]

LUCENE-10039: Fix single-field scoring for CombinedFieldQuery (#229)

When there's only one field, CombinedFieldQuery will ignore its weight while
scoring. This makes the scoring inconsistent, since the field weight is supposed
to multiply its term frequency.

This PR removes the optimizations around single-field scoring to make sure the
weight is always taken into account. These optimizations are not critical since
it should be uncommon to use CombinedFieldQuery with only one field.

> With a single field, CombinedFieldQuery can score incorrectly
> -
>
> Key: LUCENE-10039
> URL: https://issues.apache.org/jira/browse/LUCENE-10039
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Julie Tibshirani
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When there's only one field, {{CombinedFieldQuery}} will ignore its weight 
> while scoring. This makes the scoring inconsistent, since the field weight is 
> supposed to multiply its term frequency.
> This can also come up when searching over multiple fields, when some segment 
> happens to contain only one field. The problem was caught by this test:
> {code}
> ant test  -Dtestcase=TestCombinedFieldQuery 
> -Dtests.method=testCopyFieldWithMissingFields -Dtests.seed=8FA982798BC8FEF6 
> -Dtests.nightly=true
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mikemccand commented on pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

2021-07-28 Thread GitBox



mikemccand commented on pull request #225:
URL: https://github.com/apache/lucene/pull/225#issuecomment-888278654


   > I suggest, please let's not try to "overshare" and refactor all this stuff 
alongside DFA stuff until there is a query we can actually benchmark to see if 
the performance is even viable
   
   OK yeah +1 to keep it wholly separate (full fork) for now until we learn 
more how this `NFARegexpQuery` behaves.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10035) Simple text codec add multi level skip list data

2021-07-28 Thread Adrien Grand (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388751#comment-17388751
 ] 

Adrien Grand commented on LUCENE-10035:
---

Wow! This is impressive work!

> Simple text codec add  multi level skip list data 
> --
>
> Key: LUCENE-10035
> URL: https://issues.apache.org/jira/browse/LUCENE-10035
> Project: Lucene - Core
>  Issue Type: Wish
>  Components: core/codecs
>Affects Versions: main (9.0)
>Reporter: wuda
>Priority: Major
>  Labels: Impact, MultiLevelSkipList, SimpleTextCodec
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Simple text codec add skip list data( include impact) to help understand 
> index format，For debugging, curiosity, transparency only!! When term's 
> docFreq greater than or equal to SimpleTextSkipWriter.BLOCK_SIZE (default 
> value is 8), Simple text codec will write skip list, the *.pst (simple text 
> term dictionary file)* file will looks like this
> {code:java}
> field title
>   term args
> doc 2
>   freq 2
>   pos 7
>   pos 10
> ## we omit docs for better view ..
> doc 98
>   freq 2
>   pos 2
>   pos 6
> skipList 
> ?
>   level 1
> skipDoc 65
> skipDocFP 949
> impacts 
>   impact 
> freq 1
> norm 2
>   impact 
> freq 2
> norm 12
>   impact 
> freq 3
> norm 13
> impacts_end 
> ?
>   level 0
> skipDoc 17
> skipDocFP 284
> impacts 
>   impact 
> freq 1
> norm 2
>   impact 
> freq 2
> norm 12
> impacts_end 
> skipDoc 34
> skipDocFP 624
> impacts 
>   impact 
> freq 1
> norm 2
>   impact 
> freq 2
> norm 12
>   impact 
> freq 3
> norm 14
> impacts_end 
> skipDoc 65
> skipDocFP 949
> impacts 
>   impact 
> freq 1
> norm 2
>   impact 
> freq 2
> norm 12
>   impact 
> freq 3
> norm 13
> impacts_end 
> skipDoc 90
> skipDocFP 1311
> impacts 
>   impact 
> freq 1
> norm 2
>   impact 
> freq 2
> norm 10
>   impact 
> freq 3
> norm 13
>   impact 
> freq 4
> norm 14
> impacts_end 
> END
> checksum 000829315543
> {code}
> compare with previous，we add *skipList，level, skipDoc, skipDocFP, impacts, 
> impact, freq, norm* nodes, at the same, simple text codec can support 
> advanceShallow when search time.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mikemccand commented on pull request #157: LUCENE-9963 Fix issue with FlattenGraphFilter throwing exceptions from holes

2021-07-28 Thread GitBox



mikemccand commented on pull request #157:
URL: https://github.com/apache/lucene/pull/157#issuecomment-888287420


   > We have been using this change internally for a few weeks now. We no 
longer encounter the ArrayIndexOutOfBounds exceptions that we were previously 
experiencing. Depending on the dataset/analyzer combination we have seen up to 
a 1% increase in the average number of tokens per field. This comes from the 
tokens that had previously been dropped now being correctly indexed.
   
   Thanks for the update @glawson0!  This is great news.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on pull request #220: LUCENE-9450: Use BinaryDocValue fields in the taxonomy index based on the existing index version

2021-07-28 Thread GitBox



jpountz commented on pull request #220:
URL: https://github.com/apache/lucene/pull/220#issuecomment-888288068


   > What is the use of the LiveIndexWriterConfig.createdVersionMajor
   
   It's very expert. It's necessary if you have multiple workers creating 
indices that you then want to merge together using `IndexWriter#addIndexes`. 
`addIndexes` requires that all indices have the same major version, so if you 
are doing a rolling upgrade on your workers to a new Lucene major, this helps 
ensure that all indices are created in a way that they can be merged eventually.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dnhatn merged pull request #228: Remove unnecessary assertion

2021-07-28 Thread GitBox



dnhatn merged pull request #228:
URL: https://github.com/apache/lucene/pull/228


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9304) Clean up DWPTPool

2021-07-28 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388803#comment-17388803
 ] 

ASF subversion and git services commented on LUCENE-9304:
-

Commit 03b1db91f9e5b11816274efd0da8503db27ccce0 in lucene's branch 
refs/heads/main from Shintaro Murakami
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=03b1db9 ]

LUCENE-9304: Remove assertion in DocumentsWriterFlushControl (#228)

This is assertion becomes obvious after LUCENE-9304.

> Clean up DWPTPool 
> --
>
> Key: LUCENE-9304
> URL: https://issues.apache.org/jira/browse/LUCENE-9304
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: main (9.0), 8.6
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>Priority: Major
> Fix For: main (9.0), 8.6
>
>  Time Spent: 11h 10m
>  Remaining Estimate: 0h
>
> DWPTPool currently uses an indirection called ThreadState to hold DWPT 
> instances. This class holds several information that belongs in other places, 
> inherits from ReentrantLock and has a mutable nature. Instead we could pool 
> the DWPT directly and remove other indirections  inside DWPTFlushControl if 
> we move some of the ThreadState properties to DWPT directly. The threadpool 
> also has a problem that is grows it's ThreadStates to the number of 
> concurrently indexing threads but never shrinks it if they are reduced. With 
> pooling DWPT directly this limitation could be removed. 
> In summary, this component has seen quite some refactoring and requires some 
> cleanups and docs changes in order to stay the test of time.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dnhatn commented on pull request #228: Remove unnecessary assertion

2021-07-28 Thread GitBox



dnhatn commented on pull request #228:
URL: https://github.com/apache/lucene/pull/228#issuecomment-888337488


   Merged, thanks @mrkm4ntr. It's preferable to have a Jira ticket opened 
before having a pull request. But it's okay for this change as it's 
straightforward.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9304) Clean up DWPTPool

2021-07-28 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388804#comment-17388804
 ] 

ASF subversion and git services commented on LUCENE-9304:
-

Commit 03b1db91f9e5b11816274efd0da8503db27ccce0 in lucene's branch 
refs/heads/main from Shintaro Murakami
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=03b1db9 ]

LUCENE-9304: Remove assertion in DocumentsWriterFlushControl (#228)

This is assertion becomes obvious after LUCENE-9304.

> Clean up DWPTPool 
> --
>
> Key: LUCENE-9304
> URL: https://issues.apache.org/jira/browse/LUCENE-9304
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: main (9.0), 8.6
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>Priority: Major
> Fix For: main (9.0), 8.6
>
>  Time Spent: 11h 10m
>  Remaining Estimate: 0h
>
> DWPTPool currently uses an indirection called ThreadState to hold DWPT 
> instances. This class holds several information that belongs in other places, 
> inherits from ReentrantLock and has a mutable nature. Instead we could pool 
> the DWPT directly and remove other indirections  inside DWPTFlushControl if 
> we move some of the ThreadState properties to DWPT directly. The threadpool 
> also has a problem that is grows it's ThreadStates to the number of 
> concurrently indexing threads but never shrinks it if they are reduced. With 
> pooling DWPT directly this limitation could be removed. 
> In summary, this component has seen quite some refactoring and requires some 
> cleanups and docs changes in order to stay the test of time.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10040) Handle deletions in nearest vector search

2021-07-28 Thread Julie Tibshirani (Jira)

Julie Tibshirani created LUCENE-10040:
-

 Summary: Handle deletions in nearest vector search
 Key: LUCENE-10040
 URL: https://issues.apache.org/jira/browse/LUCENE-10040
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Julie Tibshirani


Currently nearest vector search doesn't account for deleted documents. Even if 
a document is not in {{LeafReader#getLiveDocs}}, it could still be returned 
from {{LeafReader#searchNearestVectors}}. This seems like it'd be surprising + 
difficult for users, since other search APIs account for deleted docs. We've 
discussed extending {{searchNearestVectors}} to take a parameter like {{Bits 
liveDocs}}. This issue discusses options around adding support.

One approach is to just filter out deleted docs after running the KNN search. 
This behavior seems hard to work with as a user: fewer than {{k}} docs might 
come back from your KNN search!

Alternatively, {{LeafReader#searchNearestVectors}} could always return the 
{{k}} nearest undeleted docs. To implement this, HNSW could omit deleted docs 
while assembling its candidate list. It would traverse further into the graph, 
visiting more nodes to ensure it gathers the required candidates. (Note deleted 
docs would still be visited/ traversed). The [hnswlib 
library|https://github.com/nmslib/hnswlib] contains an implementation like 
this, where you can mark documents as deleted and they're skipped during search.

This approach seems reasonable to me, but there are some challenges:
 * Performance can be unpredictable. If deletions are random, it shouldn't have 
a huge effect. But in the worst case, a segment could have 50% deleted docs, 
and they all happen to be near the query vector. HNSW would need to traverse 
through around half the entire graph to collect neighbors.
 * As far as I know, there hasn't been academic research or any testing into 
how well this performs in terms of recall. I have a vague intuition it could be 
harder to achieve high recall as the algorithm traverses areas further from the 
"natural" entry points. The HNSW paper doesn't mention deletions/ filtering, 
and I haven't seen community benchmarks around it.

Background links:
 * Thoughts on deletions from the author of the HNSW paper: 
[https://github.com/nmslib/hnswlib/issues/4#issuecomment-378739892]
 * Blog from Vespa team which mentions combining KNN and search filters (very 
similar to applying deleted docs): 
[https://blog.vespa.ai/approximate-nearest-neighbor-search-in-vespa-part-1/]. 
The "Exact vs Approximate" section shows good performance even when a large 
percentage of documents are filtered out. The team mentioned to me they didn't 
have the chance to measure recall, only latency.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10016) VectorReader.search needs rethought, o.a.l.search integration?

2021-07-28 Thread Julie Tibshirani (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388806#comment-17388806
 ] 

Julie Tibshirani commented on LUCENE-10016:
---

Deletions are an interesting topic, I opened 
https://issues.apache.org/jira/browse/LUCENE-10040 for a dedicated discussion. 
Maybe we could close this issue in favor of that one and also 
https://issues.apache.org/jira/browse/LUCENE-9614, which discusses a high-level 
API for KNN search? If we close this, we should decide if we want to transfer 
its "blocker" status to those issues.

> VectorReader.search needs rethought, o.a.l.search integration?
> --
>
> Key: LUCENE-10016
> URL: https://issues.apache.org/jira/browse/LUCENE-10016
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Blocker
> Fix For: 9.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> There's no search integration (e.g. queries) for the current vector values, 
> no documentation/examples that I can find.
> Instead the codec has this method:
> {code}
> TopDocs search(String field, float[] target, int k, int fanout)
> {code}
> First, the "fanout" parameter needs to go, this is specific to HNSW impl, get 
> it out of here.
> Second, How am I supposed to skip over deleted documents? How can I use 
> filters? How should i search across multiple segments?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Deleted] (LUCENE-10038) i have no issue

2021-07-28 Thread Steven Rowe (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe deleted LUCENE-10038:
-


> i have no issue
> ---
>
> Key: LUCENE-10038
> URL: https://issues.apache.org/jira/browse/LUCENE-10038
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: mahnoor jabbar
>Priority: Major
>
> [Carters Coupon 
> Codes|https://uttercoupons.com/front/store-profile/carters-coupon-codes] is 
> best code provided by Carter's. Amazing Discount Offers, Get Carter's Coupons 
> & Promo Codes and save up to 50% on the offer, so get the code helps you to 
> save on coupon and promo codes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] jtibshirani opened a new pull request #2535: LUCENE-10039: Fix single-field scoring for CombinedFieldQuery

2021-07-28 Thread GitBox



jtibshirani opened a new pull request #2535:
URL: https://github.com/apache/lucene-solr/pull/2535


   When there's only one field, CombinedFieldQuery will ignore its weight while
   scoring. This makes the scoring inconsistent, since the field weight is 
supposed
   to multiply its term frequency.
   
   This PR removes the optimizations around single-field scoring to make sure 
the
   weight is always taken into account. These optimizations are not critical 
since
   it should be uncommon to use CombinedFieldQuery with only one field.
   
   This backport also incorporates the part of LUCENE-9823 that applies to
   CombinedFieldQuery. We no longer rewrite single-field queries, which can also
   change their scoring.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] jtibshirani merged pull request #2535: LUCENE-10039: Fix single-field scoring for CombinedFieldQuery

2021-07-28 Thread GitBox



jtibshirani merged pull request #2535:
URL: https://github.com/apache/lucene-solr/pull/2535


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-10039) With a single field, CombinedFieldQuery can score incorrectly

2021-07-28 Thread Julie Tibshirani (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julie Tibshirani resolved LUCENE-10039.
---
Fix Version/s: 8.10
   Resolution: Fixed

> With a single field, CombinedFieldQuery can score incorrectly
> -
>
> Key: LUCENE-10039
> URL: https://issues.apache.org/jira/browse/LUCENE-10039
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Julie Tibshirani
>Priority: Minor
> Fix For: 8.10
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> When there's only one field, {{CombinedFieldQuery}} will ignore its weight 
> while scoring. This makes the scoring inconsistent, since the field weight is 
> supposed to multiply its term frequency.
> This can also come up when searching over multiple fields, when some segment 
> happens to contain only one field. The problem was caught by this test:
> {code}
> ant test  -Dtestcase=TestCombinedFieldQuery 
> -Dtests.method=testCopyFieldWithMissingFields -Dtests.seed=8FA982798BC8FEF6 
> -Dtests.nightly=true
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9823) SynonymQuery rewrite can change field boost calculation

2021-07-28 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388814#comment-17388814
 ] 

ASF subversion and git services commented on LUCENE-9823:
-

Commit e31762253fcf7ef85fa0c09fdb40d3daf201a9d1 in lucene-solr's branch 
refs/heads/branch_8x from Julie Tibshirani
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e317622 ]

LUCENE-10039: Fix single-field scoring for CombinedFieldQuery (#2535)

When there's only one field, CombinedFieldQuery will ignore its weight while
scoring. This makes the scoring inconsistent, since the field weight is supposed
to multiply its term frequency.

This PR removes the optimizations around single-field scoring to make sure the
weight is always taken into account. These optimizations are not critical since
it should be uncommon to use CombinedFieldQuery with only one field.

This backport also incorporates the part of LUCENE-9823 that applies to
CombinedFieldQuery. We no longer rewrite single-field queries, which can also
change their scoring.

> SynonymQuery rewrite can change field boost calculation
> ---
>
> Key: LUCENE-9823
> URL: https://issues.apache.org/jira/browse/LUCENE-9823
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Julie Tibshirani
>Priority: Minor
>  Labels: newdev
> Fix For: main (9.0)
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> SynonymQuery accepts a boost per term, which acts as a multiplier on the term 
> frequency in the document. When rewriting a SynonymQuery with a single term, 
> we create a BoostQuery wrapping a TermQuery. This changes the meaning of the 
> boost: it now multiplies the final TermQuery score instead of multiplying the 
> term frequency before it's passed to the score calculation.
> This is a small point, but maybe it's worth avoiding rewriting a single-term 
> SynonymQuery unless the boost is 1.0.
> The same consideration affects CombinedFieldQuery in sandbox.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10039) With a single field, CombinedFieldQuery can score incorrectly

2021-07-28 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388813#comment-17388813
 ] 

ASF subversion and git services commented on LUCENE-10039:
--

Commit e31762253fcf7ef85fa0c09fdb40d3daf201a9d1 in lucene-solr's branch 
refs/heads/branch_8x from Julie Tibshirani
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e317622 ]

LUCENE-10039: Fix single-field scoring for CombinedFieldQuery (#2535)

When there's only one field, CombinedFieldQuery will ignore its weight while
scoring. This makes the scoring inconsistent, since the field weight is supposed
to multiply its term frequency.

This PR removes the optimizations around single-field scoring to make sure the
weight is always taken into account. These optimizations are not critical since
it should be uncommon to use CombinedFieldQuery with only one field.

This backport also incorporates the part of LUCENE-9823 that applies to
CombinedFieldQuery. We no longer rewrite single-field queries, which can also
change their scoring.

> With a single field, CombinedFieldQuery can score incorrectly
> -
>
> Key: LUCENE-10039
> URL: https://issues.apache.org/jira/browse/LUCENE-10039
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Julie Tibshirani
>Priority: Minor
> Fix For: 8.10
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> When there's only one field, {{CombinedFieldQuery}} will ignore its weight 
> while scoring. This makes the scoring inconsistent, since the field weight is 
> supposed to multiply its term frequency.
> This can also come up when searching over multiple fields, when some segment 
> happens to contain only one field. The problem was caught by this test:
> {code}
> ant test  -Dtestcase=TestCombinedFieldQuery 
> -Dtests.method=testCopyFieldWithMissingFields -Dtests.seed=8FA982798BC8FEF6 
> -Dtests.nightly=true
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10015) Remove VectorValues.SimilarityFunction.NONE

2021-07-28 Thread Julie Tibshirani (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388834#comment-17388834
 ] 

Julie Tibshirani commented on LUCENE-10015:
---

> In case it's helpful context: currently we only support Euclidean and cosine 
>distance, which is technically redundant

We actually only support Euclidean and dot product! Sorry if this caused 
confusion, I have no idea why I thought we implemented cosine instead.

> Remove VectorValues.SimilarityFunction.NONE
> ---
>
> Key: LUCENE-10015
> URL: https://issues.apache.org/jira/browse/LUCENE-10015
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Blocker
> Fix For: 9.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> This stuff is HNSW-implementation specific. It can be moved to a codec 
> parameter.
> The NONE option should be removed: it just makes the codec more complex.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10015) Remove VectorValues.SimilarityFunction.NONE

2021-07-28 Thread Julie Tibshirani (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388834#comment-17388834
 ] 

Julie Tibshirani edited comment on LUCENE-10015 at 7/28/21, 3:09 PM:
-

{quote}In case it's helpful context: currently we only support Euclidean and 
cosine distance, which is technically redundant
{quote}
We actually only support Euclidean and dot product! Sorry if this caused 
confusion, I have no idea why I thought we implemented cosine instead.


was (Author: julietibs):
> In case it's helpful context: currently we only support Euclidean and cosine 
>distance, which is technically redundant

We actually only support Euclidean and dot product! Sorry if this caused 
confusion, I have no idea why I thought we implemented cosine instead.

> Remove VectorValues.SimilarityFunction.NONE
> ---
>
> Key: LUCENE-10015
> URL: https://issues.apache.org/jira/browse/LUCENE-10015
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Blocker
> Fix For: 9.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> This stuff is HNSW-implementation specific. It can be moved to a codec 
> parameter.
> The NONE option should be removed: it just makes the codec more complex.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] wuda0112 commented on a change in pull request #224: LUCENE-10035: Simple text codec add multi level skip list data

2021-07-28 Thread GitBox



wuda0112 commented on a change in pull request #224:
URL: https://github.com/apache/lucene/pull/224#discussion_r678424267



##
File path: 
lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextSkipReader.java
##
@@ -0,0 +1,179 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.simpletext;
+
+import static org.apache.lucene.codecs.simpletext.SimpleTextSkipWriter.FREQ;
+import static org.apache.lucene.codecs.simpletext.SimpleTextSkipWriter.IMPACT;
+import static org.apache.lucene.codecs.simpletext.SimpleTextSkipWriter.IMPACTS;
+import static 
org.apache.lucene.codecs.simpletext.SimpleTextSkipWriter.IMPACTS_END;
+import static org.apache.lucene.codecs.simpletext.SimpleTextSkipWriter.NORM;
+import static 
org.apache.lucene.codecs.simpletext.SimpleTextSkipWriter.SKIP_DOC;
+import static 
org.apache.lucene.codecs.simpletext.SimpleTextSkipWriter.SKIP_DOC_FP;
+import static 
org.apache.lucene.codecs.simpletext.SimpleTextSkipWriter.SKIP_LIST;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import org.apache.lucene.codecs.MultiLevelSkipListReader;
+import org.apache.lucene.index.Impact;
+import org.apache.lucene.index.Impacts;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.store.BufferedChecksumIndexInput;
+import org.apache.lucene.store.ChecksumIndexInput;
+import org.apache.lucene.store.IndexInput;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BytesRefBuilder;
+import org.apache.lucene.util.CharsRefBuilder;
+import org.apache.lucene.util.StringHelper;
+
+/**
+ * This class reads skip lists with multiple levels.
+ *
+ * See {@link SimpleTextFieldsWriter} for the information about the 
encoding of the multi level
+ * skip lists.
+ *
+ * @lucene.experimental
+ */
+public class SimpleTextSkipReader extends MultiLevelSkipListReader {

Review comment:
   sorry i have ran at wrong git branch, so it passed, when i realized 
that, i convert to draft, i will test it carefully again until satisfy all unit 
test. 
   
   and yes , it should be pkg-private. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mikemccand commented on a change in pull request #157: LUCENE-9963 Fix issue with FlattenGraphFilter throwing exceptions from holes

2021-07-28 Thread GitBox



mikemccand commented on a change in pull request #157:
URL: https://github.com/apache/lucene/pull/157#discussion_r678412320



##
File path: 
lucene/core/src/java/org/apache/lucene/analysis/AutomatonToTokenStream.java
##
@@ -0,0 +1,197 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.Map;
+import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
+import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
+import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
+import org.apache.lucene.analysis.tokenattributes.PositionLengthAttribute;
+import org.apache.lucene.util.automaton.Automaton;
+import org.apache.lucene.util.automaton.Operations;
+import org.apache.lucene.util.automaton.Transition;
+
+/** Converts an Automaton into a TokenStream. */
+public class AutomatonToTokenStream {

Review comment:
   Whoa, awesome!  This will be a really helpful infrastructure for future 
testing!  And of course now I really feel compelled to take a Lev1("lucene") 
and run it through here and watch what the resulting tokens are!

##
File path: 
lucene/core/src/java/org/apache/lucene/analysis/AutomatonToTokenStream.java
##
@@ -0,0 +1,197 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.Map;
+import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
+import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
+import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
+import org.apache.lucene.analysis.tokenattributes.PositionLengthAttribute;
+import org.apache.lucene.util.automaton.Automaton;
+import org.apache.lucene.util.automaton.Operations;
+import org.apache.lucene.util.automaton.Transition;
+
+/** Converts an Automaton into a TokenStream. */
+public class AutomatonToTokenStream {
+
+  private AutomatonToTokenStream() {}
+
+  /**
+   * converts an automaton into a TokenStream. This is done by first Topo 
sorting the nodes in the
+   * Automaton. Nodes that have the same distance from the start are grouped 
together to form the
+   * position nodes for the TokenStream. The resulting TokenStream releases 
edges from the automaton
+   * as tokens in order from the position nodes. This requires the automaton 
be a finite DAG.
+   *
+   * @param automaton automaton to convert. Must be a finite DAG.
+   * @return TokenStream representation of automaton.
+   */
+  public static TokenStream toTokenStream(Automaton automaton) {
+if (Operations.isFinite(automaton) == false) {
+  throw new IllegalArgumentException("Automaton must be finite");
+}
+
+List> positionNodes = new ArrayList<>();
+
+Transition[][] transitions = automaton.getSortedTransitions();
+
+int[] indegree = new int[transitions.length];
+
+for (int i = 0; i < transitions.length; i++) {
+  for (int edge = 0; edge < transitions[i].length; edge++) {
+indegree[transitions[i][edge].dest] += 1;
+  }
+}
+if (indegree[0] != 0) {
+  throw new IllegalArgumentException("Start node has incoming edges, 
crea

[jira] [Commented] (LUCENE-10016) VectorReader.search needs rethought, o.a.l.search integration?

2021-07-28 Thread Adrien Grand (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388866#comment-17388866
 ] 

Adrien Grand commented on LUCENE-10016:
---

One thing that would still be missing would be the oal.demo integration. At the 
same time I'm unsure if we can easily add vector search to the demo as we'd 
need a way to turn some data that exists on the user computer into vectors in a 
way that nearest-neighbor search makes sense. 

> VectorReader.search needs rethought, o.a.l.search integration?
> --
>
> Key: LUCENE-10016
> URL: https://issues.apache.org/jira/browse/LUCENE-10016
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Blocker
> Fix For: 9.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> There's no search integration (e.g. queries) for the current vector values, 
> no documentation/examples that I can find.
> Instead the codec has this method:
> {code}
> TopDocs search(String field, float[] target, int k, int fanout)
> {code}
> First, the "fanout" parameter needs to go, this is specific to HNSW impl, get 
> it out of here.
> Second, How am I supposed to skip over deleted documents? How can I use 
> filters? How should i search across multiple segments?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10016) VectorReader.search needs rethought, o.a.l.search integration?

2021-07-28 Thread Robert Muir (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388868#comment-17388868
 ] 

Robert Muir commented on LUCENE-10016:
--

Even if it isn't in the o.a.l.demo module, a simple test similar to "TestDemo" 
would be a great step:

https://github.com/apache/lucene/blob/main/lucene/core/src/test/org/apache/lucene/TestDemo.java

By this, I mean a high-level unit test that uses 
indexwriter/indexsearcher/queries and not low-level codec apis.

> VectorReader.search needs rethought, o.a.l.search integration?
> --
>
> Key: LUCENE-10016
> URL: https://issues.apache.org/jira/browse/LUCENE-10016
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Blocker
> Fix For: 9.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> There's no search integration (e.g. queries) for the current vector values, 
> no documentation/examples that I can find.
> Instead the codec has this method:
> {code}
> TopDocs search(String field, float[] target, int k, int fanout)
> {code}
> First, the "fanout" parameter needs to go, this is specific to HNSW impl, get 
> it out of here.
> Second, How am I supposed to skip over deleted documents? How can I use 
> filters? How should i search across multiple segments?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10040) Handle deletions in nearest vector search

2021-07-28 Thread Adrien Grand (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388877#comment-17388877
 ] 

Adrien Grand commented on LUCENE-10040:
---

bq. Performance can be unpredictable. If deletions are random, it shouldn't 
have a huge effect. But in the worst case, a segment could have 50% deleted 
docs, and they all happen to be near the query vector. HNSW would need to 
traverse through around half the entire graph to collect neighbors.

FWIW this is a general problem with Lucene. For instance if you run a term 
query, we'll use impacts to know which blocks may contain competitive 
documents, but we could hit a worst-case scenario where the documents that make 
the block competitive are deleted, and all the non-deleted documents of the 
block are not competitive.

> Handle deletions in nearest vector search
> -
>
> Key: LUCENE-10040
> URL: https://issues.apache.org/jira/browse/LUCENE-10040
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Major
>
> Currently nearest vector search doesn't account for deleted documents. Even 
> if a document is not in {{LeafReader#getLiveDocs}}, it could still be 
> returned from {{LeafReader#searchNearestVectors}}. This seems like it'd be 
> surprising + difficult for users, since other search APIs account for deleted 
> docs. We've discussed extending {{searchNearestVectors}} to take a parameter 
> like {{Bits liveDocs}}. This issue discusses options around adding support.
> One approach is to just filter out deleted docs after running the KNN search. 
> This behavior seems hard to work with as a user: fewer than {{k}} docs might 
> come back from your KNN search!
> Alternatively, {{LeafReader#searchNearestVectors}} could always return the 
> {{k}} nearest undeleted docs. To implement this, HNSW could omit deleted docs 
> while assembling its candidate list. It would traverse further into the 
> graph, visiting more nodes to ensure it gathers the required candidates. 
> (Note deleted docs would still be visited/ traversed). The [hnswlib 
> library|https://github.com/nmslib/hnswlib] contains an implementation like 
> this, where you can mark documents as deleted and they're skipped during 
> search.
> This approach seems reasonable to me, but there are some challenges:
>  * Performance can be unpredictable. If deletions are random, it shouldn't 
> have a huge effect. But in the worst case, a segment could have 50% deleted 
> docs, and they all happen to be near the query vector. HNSW would need to 
> traverse through around half the entire graph to collect neighbors.
>  * As far as I know, there hasn't been academic research or any testing into 
> how well this performs in terms of recall. I have a vague intuition it could 
> be harder to achieve high recall as the algorithm traverses areas further 
> from the "natural" entry points. The HNSW paper doesn't mention deletions/ 
> filtering, and I haven't seen community benchmarks around it.
> Background links:
>  * Thoughts on deletions from the author of the HNSW paper: 
> [https://github.com/nmslib/hnswlib/issues/4#issuecomment-378739892]
>  * Blog from Vespa team which mentions combining KNN and search filters (very 
> similar to applying deleted docs): 
> [https://blog.vespa.ai/approximate-nearest-neighbor-search-in-vespa-part-1/]. 
> The "Exact vs Approximate" section shows good performance even when a large 
> percentage of documents are filtered out. The team mentioned to me they 
> didn't have the chance to measure recall, only latency.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #227: LUCENE-10033: Encode numeric doc values and ordinals of SORTED(_SET) doc values in blocks.

2021-07-28 Thread GitBox



gsmiller commented on pull request #227:
URL: https://github.com/apache/lucene/pull/227#issuecomment-888465657


   This is really interesting/exciting!
   
   I'm working through this PR now but I notice you've used a slightly 
different approach to the FOR encoding (compared to what's done in the 
postings). Is this intentional for some reason, or is it more to get something 
out quickly for benchmarking (results were interesting by the way!)? Is there a 
reason you chose not to use the existing `ForUtil` directly?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gautamworah96 commented on a change in pull request #220: LUCENE-9450: Use BinaryDocValue fields in the taxonomy index based on the existing index version

2021-07-28 Thread GitBox



gautamworah96 commented on a change in pull request #220:
URL: https://github.com/apache/lucene/pull/220#discussion_r678493344



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java
##
@@ -475,8 +477,15 @@ private int addCategoryDocument(FacetLabel categoryPath, 
int parent) throws IOEx
 
 String fieldPath = FacetsConfig.pathToString(categoryPath.components, 
categoryPath.length);
 fullPathField.setStringValue(fieldPath);
+
+if (useOlderStoredFieldIndex) {
+  fullPathField = new StringField(Consts.FULL, fieldPath, Field.Store.YES);

Review comment:
   Hmmm. I guess I did not find it confusing but it is a bit strange for 
sure. The next commit initializes it upfront




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on pull request #227: LUCENE-10033: Encode numeric doc values and ordinals of SORTED(_SET) doc values in blocks.

2021-07-28 Thread GitBox



jpountz commented on pull request #227:
URL: https://github.com/apache/lucene/pull/227#issuecomment-888479181


   Indeed I wanted to get something out quickly for benchmarking where I could 
easily play with different block sizes, while ForUtil is very rigid (hardcoded 
block size of 128 and explicitly rejects numbers of bits per value > 32).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10033) Encode doc values in smaller blocks of values, like postings

2021-07-28 Thread Adrien Grand (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388892#comment-17388892
 ] 

Adrien Grand commented on LUCENE-10033:
---

Unfortunately I noticed that the sorted queries that didn't become slower only 
didn't become slower because the field was also indexed with points, so the 
short-circuiting logic we have to progressively add a filter that only matches 
competitive documents hid the slowdown. If I hack the benchmark code to not use 
this optimization then sorted queries are all about 40-50% slower.

> Encode doc values in smaller blocks of values, like postings
> 
>
> Key: LUCENE-10033
> URL: https://issues.apache.org/jira/browse/LUCENE-10033
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This is a follow-up to the discussion on this thread: 
> https://lists.apache.org/thread.html/r7b757074d5f02874ce3a295b0007dff486bc10d08fb0b5e5a4ba72c5%40%3Cdev.lucene.apache.org%3E.
> Our current approach for doc values uses large blocks of 16k values where 
> values can be decompressed independently, using DirectWriter/DirectReader. 
> This is a bit inefficient in some cases, e.g. a single outlier can grow the 
> number of bits per value for the entire block, we can't easily use run-length 
> compression, etc. Plus, it encourages using a different sub-class for every 
> compression technique, which puts pressure on the JVM.
> We'd like to move to an approach that would be more similar to postings with 
> smaller blocks (e.g. 128 values) whose values get all decompressed at once 
> (using SIMD instructions), with skip data within blocks in order to 
> efficiently skip to arbitrary doc IDs (or maybe still use jump tables as 
> today's doc values, and as discussed here for postings: 
> https://lists.apache.org/thread.html/r7c3cb7ab143fd4ecbc05c04064d10ef9fb50c5b4d6479b0f35732677%40%3Cdev.lucene.apache.org%3E).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10016) VectorReader.search needs rethought, o.a.l.search integration?

2021-07-28 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388899#comment-17388899
 ] 

Michael Sokolov commented on LUCENE-10016:
--

as for the demo, there is a start on something we could use in luceneutil. It 
would requirea a fairly large word->vector dictionary though. I think maybe the 
way to do it is to provide instructions for downloading the dictionary rather 
than shipping it as part of the demo. 

> VectorReader.search needs rethought, o.a.l.search integration?
> --
>
> Key: LUCENE-10016
> URL: https://issues.apache.org/jira/browse/LUCENE-10016
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Blocker
> Fix For: 9.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> There's no search integration (e.g. queries) for the current vector values, 
> no documentation/examples that I can find.
> Instead the codec has this method:
> {code}
> TopDocs search(String field, float[] target, int k, int fanout)
> {code}
> First, the "fanout" parameter needs to go, this is specific to HNSW impl, get 
> it out of here.
> Second, How am I supposed to skip over deleted documents? How can I use 
> filters? How should i search across multiple segments?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #214: LUCENE-10027 provide leaf sorter from commit

2021-07-28 Thread GitBox



mayya-sharipova commented on a change in pull request #214:
URL: https://github.com/apache/lucene/pull/214#discussion_r678507441



##
File path: lucene/core/src/java/org/apache/lucene/index/DirectoryReader.java
##
@@ -122,6 +122,23 @@ public static DirectoryReader open(final IndexCommit 
commit) throws IOException
 return StandardDirectoryReader.open(commit.getDirectory(), commit, null);
   }
 
+  /**
+   * Expert: returns an IndexReader reading the index in the given {@link 
IndexCommit}.
+   *
+   * @param commit the commit point to open
+   * @param leafSorter a comparator for sorting leaf readers. Providing 
leafSorter is useful for
+   * indices on which it is expected to run many queries with particular 
sort criteria (e.g. for
+   * time-based indices, this is usually a descending sort on timestamp). 
In this case {@code
+   * leafSorter} should sort leaves according to this sort criteria. 
Providing leafSorter allows
+   * to speed up this particular type of sort queries by early terminating 
while iterating
+   * through segments and segments' documents
+   * @throws IOException if there is a low-level IO error
+   */
+  public static DirectoryReader open(final IndexCommit commit, 
Comparator leafSorter)

Review comment:
   @dnhatn  @jpountz Thanks for your comments.  
   
   @jpountz  About this comment:
   > Since both minSupportedMajorVersion and leafSorter are super expert 
parameters, I think it's fine to require that users to provide both instead of 
keeping adding new variants of DirectoryReader#open.
   
   I am not happy either about adding new variant of `DirectoryReader#open`, 
but are we ok to modify the current public API: `DirectoryReader 
open(IndexCommit, minSupportedMajorVersion)` to add a new parameter?  The 
modification will be in the minor 8.10.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #214: LUCENE-10027 provide leaf sorter from commit

2021-07-28 Thread GitBox



mayya-sharipova commented on a change in pull request #214:
URL: https://github.com/apache/lucene/pull/214#discussion_r678516508



##
File path: lucene/core/src/java/org/apache/lucene/index/DirectoryReader.java
##
@@ -122,6 +122,23 @@ public static DirectoryReader open(final IndexCommit 
commit) throws IOException
 return StandardDirectoryReader.open(commit.getDirectory(), commit, null);
   }
 
+  /**
+   * Expert: returns an IndexReader reading the index in the given {@link 
IndexCommit}.
+   *
+   * @param commit the commit point to open
+   * @param leafSorter a comparator for sorting leaf readers. Providing 
leafSorter is useful for
+   * indices on which it is expected to run many queries with particular 
sort criteria (e.g. for
+   * time-based indices, this is usually a descending sort on timestamp). 
In this case {@code
+   * leafSorter} should sort leaves according to this sort criteria. 
Providing leafSorter allows
+   * to speed up this particular type of sort queries by early terminating 
while iterating
+   * through segments and segments' documents
+   * @throws IOException if there is a low-level IO error
+   */
+  public static DirectoryReader open(final IndexCommit commit, 
Comparator leafSorter)

Review comment:
   Actually after reading `CHANGES.txt` file, I've noticed many API changes 
even in minor versions.  So, @jpountz  the commit  
a44c8120133aa01588c76b58321b8bffae0dd0c7 addresses your comment.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #227: LUCENE-10033: Encode numeric doc values and ordinals of SORTED(_SET) doc values in blocks.

2021-07-28 Thread GitBox



gsmiller commented on pull request #227:
URL: https://github.com/apache/lucene/pull/227#issuecomment-888503779


   > and explicitly rejects numbers of bits per value > 32
   
   Ah right, of course this would be an issue here. Thanks for clarifying!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gautamworah96 commented on pull request #175: LUCENE-9990: gradle7 support

2021-07-28 Thread GitBox



gautamworah96 commented on pull request #175:
URL: https://github.com/apache/lucene/pull/175#issuecomment-888545634


   I gave this PR another shot (since the Palantir plugin has been patched in 
v2.0.0 for Gradle 7 support), but had some new issues come up. The good news, I 
*think* that using the `-Porg.gradle.java.installations.paths` command line 
param points Gradle to use that specific JDK for building and running the 
project. The bad news, since `JavaInstallationRegistry` is now deprecated, the 
build fails in multiple places (some that use the Java version to add specific 
JVM params, and others where we use the plugin to get the Java command to 
generate some javadoc). As of right now, I am just trial-and-erroring some code 
to see what works. 
   
   Some WIP code is pushed 
[here](https://github.com/gautamworah96/lucene/pull/new/LUCENE-9990). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] sejal-pawar commented on pull request #159: LUCENE-9945: extend DrillSideways to expose FacetCollector and drill-down dimensions

2021-07-28 Thread GitBox



sejal-pawar commented on pull request #159:
URL: https://github.com/apache/lucene/pull/159#issuecomment-888702505


   > Thanks @sejal-pawar! This is more what I was originally describing in the 
Jira issue. Thanks for updating your PR!
   > 
   > I left some small comments on variable naming, javadoc, etc. Otherwise 
this seems pretty close to me.
   > 
   > It would be nice to add a test case though around this new functionality. 
Maybe you could write a test that relies on the newly-exposed FacetsCollectors 
and computes a Facets result that is expected to agree with the Facets result 
exposed already? That would be a nice way to confirm the correct collectors are 
getting exposed (and don't regress somehow with a future change). Because there 
are a number of different cases here (many different implementations of the 
static `search` method), you could leverage Lucene's randomized testing to 
randomly invoke different code paths (e.g., randomly provide a CollectorManager 
vs. a Collector; randomly provide an ExecutorService to the ctor; etc.).
   > 
   > I suppose instead of randomized testing, you could also add on some checks 
to the existing test cases that also grab the FacetsCollectors from the result 
and validate them against the Facets that are already tested. That might 
actually be the easiest way to go about the testing. Have a look in 
`TestDrillSideways` for what we do currently.
   
   Hey Greg, (apologies for the late reply!) I resolved the other comments but 
while writing the test, I noticed that a lot of test cases in 
DrillSidewaysResult involve the same logic for initialising DrillSideways. Ex. 
[1](https://code.amazon.com/packages/lucene/blobs/7a7003c51c8c0470f04e9df2ee9cb6002e124689/--/lucene/facet/src/test/org/apache/lucene/facet/TestDrillSideways.java#L1762)
 Would it perhaps make sense to extract the initialisation of DrillSideways 
into a helper test class called `DrillSidewaysInitialiser`? I was thinking of 
encapsulating all the required pieces like Directory, DirectoryTaxonomyWriter 
into a single class. Something similar has been done for document generation 
and initialisation in `org.apache.lucene.index.DocHelper`. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz merged pull request #221: LUCENE-10031: Speed up SortedDocIdMerger on low-cardinality sort fields.

2021-07-28 Thread GitBox



jpountz merged pull request #221:
URL: https://github.com/apache/lucene/pull/221


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10031) Speedup to SortedDocIDMerger when sorting on low-cardinality fields

2021-07-28 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17389289#comment-17389289
 ] 

ASF subversion and git services commented on LUCENE-10031:
--

Commit 0e6c3146d7853d27037213dc58eddc16a0e05daa in lucene's branch 
refs/heads/main from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=0e6c314 ]

LUCENE-10031: Speed up SortedDocIdMerger on low-cardinality sort fields. (#221)

When sorting by low-cardinality fields, the same sub remains current for long
sequences of doc IDs. This speeds up SortedDocIdMerger a bit by extracting
the sub that leads iteration.

> Speedup to SortedDocIDMerger when sorting on low-cardinality fields
> ---
>
> Key: LUCENE-10031
> URL: https://issues.apache.org/jira/browse/LUCENE-10031
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> I've been looking at profiles of indexing with index sorting enabled and saw 
> non-negligible time spent in SortedDocIDMerger. This isn't completely 
> surprising as this little class is called on every document whenever merging 
> postings, doc values, stored fields, etc.
> I'm especially interested in cases when the sort key is on a low cardinality 
> field, so the priority queue doesn't get reordered often. I've been playing 
> with a change to SortedDocIdMerger that makes merging significantly faster in 
> that case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

46 matches

Mail list logo