[jira] [Commented] (LUCENE-10030) [DrillSidewaysScorer] redundant score() calculations in doQueryFirstScoring

2021-07-27 Thread Grigoriy Troitskiy (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17387972#comment-17387972
 ] 

Grigoriy Troitskiy commented on LUCENE-10030:
-

Yes, It would be great to backport into 8.x. Didn't find any hints in the wiki, 
could you pls describe the steps?

Btw, should we have added changes to 
[https://github.com/apache/lucene/blob/main/lucene/CHANGES.txt] ?

> [DrillSidewaysScorer] redundant score() calculations in doQueryFirstScoring
> ---
>
> Key: LUCENE-10030
> URL: https://issues.apache.org/jira/browse/LUCENE-10030
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Grigoriy Troitskiy
>Priority: Major
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> *Diff*
> {code:java}
> @@ -195,11 +195,8 @@ class DrillSidewaysScorer extends BulkScorer {
>  
>        collectDocID = docID;
>  
> -      // TODO: we could score on demand instead since we are
> -      // daat here:
> -      collectScore = baseScorer.score();
> -
>        if (failedCollector == null) {
> +        collectScore = baseScorer.score();
>          // Hit passed all filters, so it's "real":
>          collectHit(collector, dims);
>        } else {
> {code}
>  
>  *Motivation*
>  1. Performance degradation: we have quite heavy custom implementation of 
> score(). So when we started using DrillSideways, this call became top-1 in a 
> profiler snapshot (top-3 with default scoring). We tried doUnionScoring and 
> doDrillDownAdvanceScoring, but no luck:
>  doUnionScoring scores all baseQuery docIds
>  doDrillDownAdvanceScoring avoids some redundant docIds scorings, considering 
> symmetric difference of top two iterator's docIds, but still scores some 
> docIds, that will be filtered out by 3rd, 4th, ... dimension iterators
>  doQueryFirstScoring scores near-miss docIds
>  Best way is to score only true hits (where baseQuery and all N drill-down 
> iterators match). So we suggest a small modification of doQueryFirstScoring.
>   
>  2. Speaking of doQueryFirstScoring, it doesn't look like we need to 
> calculate a score for near-miss hit, because it won't be used anywhere.
>  FacetsCollectorManager creates FacetsCollector with default constructor
>  
> [https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/FacetsCollectorManager.java#L35]
>  so FacetCollector has false for keepScores 
>  
> [https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/FacetsCollector.java#L119]
>  and collectScore is not being used
>  
> [https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/DrillSidewaysScorer.java#L200]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a change in pull request #226: LUCENE-10036: Add factory method to ScoreCachingWrappingScorer that ensures unnecessary wrapping doesn't occur

2021-07-27 Thread GitBox


jpountz commented on a change in pull request #226:
URL: https://github.com/apache/lucene/pull/226#discussion_r677348827



##
File path: 
lucene/core/src/test/org/apache/lucene/search/TestScoreCachingWrappingScorer.java
##
@@ -156,4 +156,28 @@ public void testGetScores() throws Exception {
 ir.close();
 directory.close();
   }
+
+  public void testNoUnnecessaryWrap() throws Exception {
+Scorable base =
+new Scorable() {
+  @Override
+  public float score() throws IOException {
+return -1;
+  }
+
+  @Override
+  public int docID() {
+return -1;
+  }
+};
+
+// Wrapping the first time should produce a different instance:
+Scorable wrapped = ScoreCachingWrappingScorer.wrap(base);
+assertNotEquals(base, wrapped);
+
+// But if we try to wrap an instance of ScoreCachingWrappingScorer, it 
shouldn't unnecessarily
+// wrap again:
+Scorable doubleWrapped = ScoreCachingWrappingScorer.wrap(wrapped);
+assertEquals(wrapped, doubleWrapped);

Review comment:
   Use `assertSame`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on pull request #220: LUCENE-9450: Use BinaryDocValue fields in the taxonomy index based on the existing index version

2021-07-27 Thread GitBox


mikemccand commented on pull request #220:
URL: https://github.com/apache/lucene/pull/220#issuecomment-887472869


   Also, to be clear, even though the opening comment says the PR implemented 
option 1, it has now iterated onto option 2 (switching based on the index 
created version metadata).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on a change in pull request #220: LUCENE-9450: Use BinaryDocValue fields in the taxonomy index based on the existing index version

2021-07-27 Thread GitBox


mikemccand commented on a change in pull request #220:
URL: https://github.com/apache/lucene/pull/220#discussion_r677406777



##
File path: 
lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestBackwardsCompatibility.java
##
@@ -49,13 +49,8 @@
   //
   // Then move the zip file to your trunk checkout and use it in your test 
cases
 
-  public static final String oldTaxonomyIndexName = "taxonomy.8.6.3-cfs";
+  public static final String oldTaxonomyIndexName = "taxonomy.8.10.0-cfs";
 
-  // LUCENE-9334 requires consistency of field data structures between 
documents.
-  // Old taxonomy index had $full_path$ field indexed only with postings,
-  // It is not allowed to add the same field $full_path$ indexed with 
BinaryDocValues
-  // for a new segment, that this test is trying to do.
-  @AwaitsFix(bugUrl = "https://issues.apache.org/jira/browse/LUCENE-9334";)

Review comment:
   Woot!

##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java
##
@@ -475,8 +477,15 @@ private int addCategoryDocument(FacetLabel categoryPath, 
int parent) throws IOEx
 
 String fieldPath = FacetsConfig.pathToString(categoryPath.components, 
categoryPath.length);
 fullPathField.setStringValue(fieldPath);
+
+if (useOlderStoredFieldIndex) {
+  fullPathField = new StringField(Consts.FULL, fieldPath, Field.Store.YES);

Review comment:
   Hmm this is odd -- up above this `if` we are already calling 
`setStringValue` on the reused `fullPathField`, but then inside this `if` we 
override `fullPathField` with a new `StringField` instance?
   
   Maybe just move the above line inside the `if`?
   
   Also, why don't we also re-use the `BinaryDocValues` field?

##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java
##
@@ -164,9 +159,16 @@ public DirectoryTaxonomyWriter(Directory directory, 
OpenMode openMode, TaxonomyW
 openMode = config.getOpenMode();
 if (!DirectoryReader.indexExists(directory)) {

Review comment:
   Maybe take the opportunity to repair this horrible `!` by switching it 
to `== false`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10034) Vectors NeighborQueue MIN/MAX heap reversed?

2021-07-27 Thread Mayya Sharipova (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388054#comment-17388054
 ] 

Mayya Sharipova commented on LUCENE-10034:
--

Tomoko and Michael, thank you for the discussion. From my side, this part of 
the code is very clear now.

About naming "similarity" VS "distance",  IMH,  the current "Similarity" naming 
for a vector metric function suits Lucene better, as top docs are arranged in a 
decreasing order of scores, the same way as a "similarity" metric arranges 
them. 

> Vectors NeighborQueue MIN/MAX heap reversed?
> 
>
> Key: LUCENE-10034
> URL: https://issues.apache.org/jira/browse/LUCENE-10034
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Mayya Sharipova
>Priority: Trivial
>
> NeighborQueue is defined as following:
> {code:java}
> NeighborQueue(int initialSize, boolean reversed) {
>   if (reversed) {
> heap = LongHeap.create(LongHeap.Order.MAX, initialSize);
>   } else {
> heap = LongHeap.create(LongHeap.Order.MIN, initialSize);
>   }
> }
> {code}
> should it be reversed? should it be instead using MIN heap for reversed 
> functions such as EUCLIDEAN  distance, as we are interested in neigbors with 
> min euclidean distances? 
> I apologize if I missed some broader context where this definition makes 
> sense. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10030) [DrillSidewaysScorer] redundant score() calculations in doQueryFirstScoring

2021-07-27 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388092#comment-17388092
 ] 

Michael McCandless commented on LUCENE-10030:
-

bq. Btw, should we have added changes to 
https://github.com/apache/lucene/blob/main/lucene/CHANGES.txt ?

Looks like [~gsmiller] just added the CHANGES entry.

What an awesome performance bug fix!  Thanks [~gtroitskiy].

+1 to backport to 8.x, but that is a bit tricky because it is a different git 
repository: https://github.com/apache/lucene-solr

You could use diff/patch to carry the change over, or, I think it is possible 
to add two remotes to your local git repository and then `git cherry-pick` 
across them, to pull this change from 9.0 back to 8.x.

> [DrillSidewaysScorer] redundant score() calculations in doQueryFirstScoring
> ---
>
> Key: LUCENE-10030
> URL: https://issues.apache.org/jira/browse/LUCENE-10030
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Grigoriy Troitskiy
>Priority: Major
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> *Diff*
> {code:java}
> @@ -195,11 +195,8 @@ class DrillSidewaysScorer extends BulkScorer {
>  
>        collectDocID = docID;
>  
> -      // TODO: we could score on demand instead since we are
> -      // daat here:
> -      collectScore = baseScorer.score();
> -
>        if (failedCollector == null) {
> +        collectScore = baseScorer.score();
>          // Hit passed all filters, so it's "real":
>          collectHit(collector, dims);
>        } else {
> {code}
>  
>  *Motivation*
>  1. Performance degradation: we have quite heavy custom implementation of 
> score(). So when we started using DrillSideways, this call became top-1 in a 
> profiler snapshot (top-3 with default scoring). We tried doUnionScoring and 
> doDrillDownAdvanceScoring, but no luck:
>  doUnionScoring scores all baseQuery docIds
>  doDrillDownAdvanceScoring avoids some redundant docIds scorings, considering 
> symmetric difference of top two iterator's docIds, but still scores some 
> docIds, that will be filtered out by 3rd, 4th, ... dimension iterators
>  doQueryFirstScoring scores near-miss docIds
>  Best way is to score only true hits (where baseQuery and all N drill-down 
> iterators match). So we suggest a small modification of doQueryFirstScoring.
>   
>  2. Speaking of doQueryFirstScoring, it doesn't look like we need to 
> calculate a score for near-miss hit, because it won't be used anywhere.
>  FacetsCollectorManager creates FacetsCollector with default constructor
>  
> [https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/FacetsCollectorManager.java#L35]
>  so FacetCollector has false for keepScores 
>  
> [https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/FacetsCollector.java#L119]
>  and collectScore is not being used
>  
> [https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/DrillSidewaysScorer.java#L200]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10034) Vectors NeighborQueue MIN/MAX heap reversed?

2021-07-27 Thread Mayya Sharipova (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388054#comment-17388054
 ] 

Mayya Sharipova edited comment on LUCENE-10034 at 7/27/21, 2:35 PM:


Tomoko and Michael, thank you for the discussion. From my side, this part of 
the code is very clear now.

About naming "similarity" VS "distance",  while "distance" indeed is easier to 
understand,  IMH,  the current "Similarity" naming for a vector metric function 
suits Lucene better, as top docs are arranged in a decreasing order of scores, 
the same way as a "similarity" metric arranges them. 


was (Author: mayya):
Tomoko and Michael, thank you for the discussion. From my side, this part of 
the code is very clear now.

About naming "similarity" VS "distance",  IMH,  the current "Similarity" naming 
for a vector metric function suits Lucene better, as top docs are arranged in a 
decreasing order of scores, the same way as a "similarity" metric arranges 
them. 

> Vectors NeighborQueue MIN/MAX heap reversed?
> 
>
> Key: LUCENE-10034
> URL: https://issues.apache.org/jira/browse/LUCENE-10034
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Mayya Sharipova
>Priority: Trivial
>
> NeighborQueue is defined as following:
> {code:java}
> NeighborQueue(int initialSize, boolean reversed) {
>   if (reversed) {
> heap = LongHeap.create(LongHeap.Order.MAX, initialSize);
>   } else {
> heap = LongHeap.create(LongHeap.Order.MIN, initialSize);
>   }
> }
> {code}
> should it be reversed? should it be instead using MIN heap for reversed 
> functions such as EUCLIDEAN  distance, as we are interested in neigbors with 
> min euclidean distances? 
> I apologize if I missed some broader context where this definition makes 
> sense. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a change in pull request #226: LUCENE-10036: Add factory method to ScoreCachingWrappingScorer that ensures unnecessary wrapping doesn't occur

2021-07-27 Thread GitBox


gsmiller commented on a change in pull request #226:
URL: https://github.com/apache/lucene/pull/226#discussion_r677530466



##
File path: 
lucene/core/src/test/org/apache/lucene/search/TestScoreCachingWrappingScorer.java
##
@@ -156,4 +156,28 @@ public void testGetScores() throws Exception {
 ir.close();
 directory.close();
   }
+
+  public void testNoUnnecessaryWrap() throws Exception {
+Scorable base =
+new Scorable() {
+  @Override
+  public float score() throws IOException {
+return -1;
+  }
+
+  @Override
+  public int docID() {
+return -1;
+  }
+};
+
+// Wrapping the first time should produce a different instance:
+Scorable wrapped = ScoreCachingWrappingScorer.wrap(base);
+assertNotEquals(base, wrapped);
+
+// But if we try to wrap an instance of ScoreCachingWrappingScorer, it 
shouldn't unnecessarily
+// wrap again:
+Scorable doubleWrapped = ScoreCachingWrappingScorer.wrap(wrapped);
+assertEquals(wrapped, doubleWrapped);

Review comment:
   Ooop, yep. Changed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller merged pull request #226: LUCENE-10036: Add factory method to ScoreCachingWrappingScorer that ensures unnecessary wrapping doesn't occur

2021-07-27 Thread GitBox


gsmiller merged pull request #226:
URL: https://github.com/apache/lucene/pull/226


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10036) Ensure ScoreCachingWrappingScorer doesn't unnecessarily wrap another ScoreCachingWrappingScorer

2021-07-27 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388119#comment-17388119
 ] 

ASF subversion and git services commented on LUCENE-10036:
--

Commit e44636c28080055502bd30a42c8b1c12801d6b75 in lucene's branch 
refs/heads/main from Greg Miller
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=e44636c ]

LUCENE-10036: Add factory method to ScoreCachingWrappingScorer that ensures 
unnecessary wrapping doesn't occur (#226)



> Ensure ScoreCachingWrappingScorer doesn't unnecessarily wrap another 
> ScoreCachingWrappingScorer
> ---
>
> Key: LUCENE-10036
> URL: https://issues.apache.org/jira/browse/LUCENE-10036
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: main (9.0), 8.10
>Reporter: Greg Miller
>Assignee: Greg Miller
>Priority: Trivial
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> This is a trivial issue, but it's easy to mistakenly "double wrap" an 
> instance of {{ScoreCachingWrappingScorer}}, which is a bit wasteful. The 
> calling code currently needs to check the instance type of the {{Scorable}} 
> they intend to wrap to avoid this. {{FieldComparator}} is actually the only 
> calling code that does this check.
> It would be nice to add a factory method that encapsulates this check in 
> {{ScoreCachingWrappingScorer}} so that calling code doesn't need to worry 
> about it.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] gsmiller merged pull request #2534: LUCENE-10036: Add factory method to ScoreCachingWrappingScorer that ensures unnecessary wrapping doesn't occur

2021-07-27 Thread GitBox


gsmiller merged pull request #2534:
URL: https://github.com/apache/lucene-solr/pull/2534


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10036) Ensure ScoreCachingWrappingScorer doesn't unnecessarily wrap another ScoreCachingWrappingScorer

2021-07-27 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388120#comment-17388120
 ] 

ASF subversion and git services commented on LUCENE-10036:
--

Commit f444a69f703fdd4b9617251411b09a648f955732 in lucene-solr's branch 
refs/heads/branch_8x from Greg Miller
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=f444a69 ]

LUCENE-10036: Add factory method to ScoreCachingWrappingScorer that ensures 
unnecessary wrapping doesn't occur (#2534)



> Ensure ScoreCachingWrappingScorer doesn't unnecessarily wrap another 
> ScoreCachingWrappingScorer
> ---
>
> Key: LUCENE-10036
> URL: https://issues.apache.org/jira/browse/LUCENE-10036
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: main (9.0), 8.10
>Reporter: Greg Miller
>Assignee: Greg Miller
>Priority: Trivial
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> This is a trivial issue, but it's easy to mistakenly "double wrap" an 
> instance of {{ScoreCachingWrappingScorer}}, which is a bit wasteful. The 
> calling code currently needs to check the instance type of the {{Scorable}} 
> they intend to wrap to avoid this. {{FieldComparator}} is actually the only 
> calling code that does this check.
> It would be nice to add a factory method that encapsulates this check in 
> {{ScoreCachingWrappingScorer}} so that calling code doesn't need to worry 
> about it.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

2021-07-27 Thread GitBox


mikemccand commented on a change in pull request #225:
URL: https://github.com/apache/lucene/pull/225#discussion_r677506600



##
File path: 
lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
##
@@ -0,0 +1,225 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.hppc.BitMixer;
+
+/**
+ * A RunAutomaton that does not require DFA, it will determinize and memorize 
the generated DFA
+ * state along with the run
+ *
+ * implemented based on: https://swtch.com/~rsc/regexp/regexp1.html
+ */
+public class NFARunAutomaton {
+
+  /** state ordinal of "no such state" */
+  public static final int MISSING = -1;
+
+  private static final int NOT_COMPUTED = -2;
+
+  private final Automaton automaton;
+  private final int[] points;
+  private final Map dStateToOrd = new HashMap<>(); // could 
init lazily?
+  private DState[] dStates;
+  private final int alphabetSize;
+
+  /**
+   * Constructor, assuming alphabet size is the whole codepoint space

Review comment:
   Maybe say `whole Unicode code point space`?

##
File path: 
lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
##
@@ -0,0 +1,225 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.hppc.BitMixer;
+
+/**
+ * A RunAutomaton that does not require DFA, it will determinize and memorize 
the generated DFA
+ * state along with the run
+ *
+ * implemented based on: https://swtch.com/~rsc/regexp/regexp1.html
+ */
+public class NFARunAutomaton {
+
+  /** state ordinal of "no such state" */
+  public static final int MISSING = -1;
+
+  private static final int NOT_COMPUTED = -2;
+
+  private final Automaton automaton;
+  private final int[] points;
+  private final Map dStateToOrd = new HashMap<>(); // could 
init lazily?
+  private DState[] dStates;
+  private final int alphabetSize;
+
+  /**
+   * Constructor, assuming alphabet size is the whole codepoint space
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use 
{@link RunAutomaton} for
+   * better efficiency
+   */
+  public NFARunAutomaton(Automaton automaton) {
+this(automaton, Character.MAX_CODE_POINT);
+  }
+
+  /**
+   * Constructor
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use 
{@link RunAutomaton} *
+   * for better efficiency
+   * @param alphabetSize alphabet size
+   */
+  public NFARunAutomaton(Automaton automaton, int alphabetSize) {
+this.automaton = automaton;
+points = automaton.getStartPoints();
+this.alphabetSize = alphabetSize;
+dStates = new DState[10];
+findDState(new DState(new int[] {0}));
+  }
+
+  /**
+   * For a given state and an incoming character (codepoint), return the next 
state
+   *
+   * @param state incoming state, should either be 0 or some state that is 
returned previously by
+   * this function
+   * @param c codepoint
+   * @return the next state or {@link #MISSING} if the transition doesn't exist
+   */
+  public int step(int state, int c) {
+assert dStates[state] != null;
+return step(d

[GitHub] [lucene] jpountz opened a new pull request #227: LUCENE-10033: Encode numeric doc values and ordinals of SORTED(_SET) doc values in blocks.

2021-07-27 Thread GitBox


jpountz opened a new pull request #227:
URL: https://github.com/apache/lucene/pull/227


   This moves doc values to an approach that is more similar to postings, where
   values are grouped in blocks of 128 values that are compressed together.
   Decoding a single value requires decoding the entire block that contains the
   value.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10033) Encode doc values in smaller blocks of values, like postings

2021-07-27 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388178#comment-17388178
 ] 

Adrien Grand commented on LUCENE-10033:
---

I opened a PR with this idea. Queries that consume most values like the Browse* 
faceting tasks become faster, but queries that only consume a small subset of 
values like some sorting tasks (not all, on of them is faster) become slower.

{noformat}
TaskQPS baseline  StdDev   QPS patch  StdDev
Pct diff p-value
   HighTermMonthSort  101.33  (9.7%)   51.93  (2.8%)  
-48.7% ( -55% -  -40%) 0.000
  TermDTSort  587.24  (6.1%)  404.20  (2.9%)  
-31.2% ( -37% -  -23%) 0.000
  IntNRQ   85.55 (14.7%)   73.16  (1.6%)  
-14.5% ( -26% -2%) 0.000
OrHighNotMed 1301.37  (3.7%) 1218.64  (2.3%)   
-6.4% ( -11% -0%) 0.000
   OrNotHighHigh 1121.91  (4.1%) 1089.27  (2.7%)   
-2.9% (  -9% -4%) 0.008
 MedTerm 2156.71  (3.3%) 2103.32  (3.6%)   
-2.5% (  -9% -4%) 0.022
  Fuzzy2   67.41  (4.6%)   65.74  (4.9%)   
-2.5% ( -11% -7%) 0.098
OrNotHighLow 1099.66  (3.7%) 1078.60  (3.0%)   
-1.9% (  -8% -4%) 0.073
 MedIntervalsOrdered   79.39  (3.0%)   77.94  (3.7%)   
-1.8% (  -8% -5%) 0.088
   MedPhrase  403.62  (2.8%)  397.19  (2.3%)   
-1.6% (  -6% -3%) 0.050
   OrHighMed  130.57  (3.0%)  128.64  (2.6%)   
-1.5% (  -6% -4%) 0.099
 LowIntervalsOrdered   20.82  (2.5%)   20.55  (3.4%)   
-1.3% (  -6% -4%) 0.167
HighIntervalsOrdered2.95  (5.1%)2.91  (5.8%)   
-1.1% ( -11% -   10%) 0.530
   OrHighLow  579.45  (2.9%)  574.45  (2.4%)   
-0.9% (  -5% -4%) 0.306
 LowSpanNear   33.20  (2.9%)   33.06  (3.5%)   
-0.4% (  -6% -6%) 0.668
HighSpanNear9.79  (3.5%)9.79  (3.7%)   
-0.0% (  -7% -7%) 0.996
 Respell  221.47  (2.1%)  221.62  (2.8%)
0.1% (  -4% -4%) 0.931
HighSloppyPhrase   36.64  (3.4%)   36.69  (4.0%)
0.1% (  -7% -7%) 0.915
Wildcard  283.85  (6.5%)  285.06  (7.2%)
0.4% ( -12% -   15%) 0.845
 LowSloppyPhrase  175.77  (4.3%)  176.56  (4.4%)
0.5% (  -7% -9%) 0.740
 AndHighHigh   64.34  (2.5%)   64.84  (3.4%)
0.8% (  -5% -6%) 0.410
HighTerm 2146.56  (3.3%) 2164.26  (4.5%)
0.8% (  -6% -8%) 0.505
HighTermTitleBDVSort   27.18  (4.6%)   27.41  (2.1%)
0.8% (  -5% -7%) 0.461
OrHighNotLow 1261.38  (2.3%) 1274.89  (3.0%)
1.1% (  -4% -6%) 0.210
 MedSpanNear   26.96  (4.1%)   27.28  (3.5%)
1.2% (  -6% -9%) 0.336
 MedSloppyPhrase  102.18  (4.7%)  103.51  (5.1%)
1.3% (  -8% -   11%) 0.399
BrowseDateTaxoFacets3.15  (4.0%)3.19  (4.0%)
1.4% (  -6% -9%) 0.281
BrowseDayOfYearTaxoFacets3.15  (4.0%)3.20  (4.0%)
1.5% (  -6% -9%) 0.250
  AndHighLow 1295.59  (3.3%) 1318.11  (3.4%)
1.7% (  -4% -8%) 0.105
 Prefix3   63.21 (15.4%)   64.49 (17.1%)
2.0% ( -26% -   40%) 0.694
  OrHighHigh   35.41  (3.1%)   36.24  (3.1%)
2.4% (  -3% -8%) 0.015
  Fuzzy1  253.74  (6.1%)  260.89  (7.1%)
2.8% (  -9% -   16%) 0.175
   BrowseMonthTaxoFacets3.42  (7.7%)3.52  (4.1%)
2.9% (  -8% -   15%) 0.135
  AndHighMed  164.48  (2.6%)  169.43  (3.3%)
3.0% (  -2% -9%) 0.001
 LowTerm 2645.26  (4.9%) 2752.43  (5.6%)
4.1% (  -6% -   15%) 0.015
   OrHighNotHigh 1286.12  (3.7%) 1349.66  (4.6%)
4.9% (  -3% -   13%) 0.000
  HighPhrase  105.61  (3.7%)  111.65  (4.8%)
5.7% (  -2% -   14%) 0.000
   LowPhrase   35.85  (2.6%)   38.76  (3.3%)
8.1% (   2% -   14%) 0.000
OrNotHighMed 1241.35  (3.1%) 1368.49  (3.6%)   
10.2% (   3% -   17%) 0.000
   HighTermDayOfYearSort  573.92  (9.5%)  687.19  (7.9%)   
19.7% (   2% -   40%) 0.000
   BrowseMonthSSDVFacets   11.52  (5.1%)   17.81 (23.5%)   
54.6% (  24% -   87%) 0.000
BrowseDayOfYearSSDVFacets   11.24  (3.9%)   18.15 (23.1%)   
61.4% (  33% -   91%) 0.000
{noformat}

> 

[jira] [Commented] (LUCENE-10030) [DrillSidewaysScorer] redundant score() calculations in doQueryFirstScoring

2021-07-27 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388181#comment-17388181
 ] 

Greg Miller commented on LUCENE-10030:
--

Yeah [~gtroitskiy], I caught the missing CHANGES entry right after I pushed and 
took care of it. As for backporting, as [~mikemccand] noted, you'll need to 
create a PR against the "old" repo (lucene-solr) in github with the same 
change. Using cherry-pick is a little "tricky" since there's a ton of 
formatting changes between the two repos, which we'd prefer to leave out of the 
diff. Given that the change is relatively small, you could probably copy/paste 
stuff across. Let me know if you need any help!

> [DrillSidewaysScorer] redundant score() calculations in doQueryFirstScoring
> ---
>
> Key: LUCENE-10030
> URL: https://issues.apache.org/jira/browse/LUCENE-10030
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Grigoriy Troitskiy
>Priority: Major
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> *Diff*
> {code:java}
> @@ -195,11 +195,8 @@ class DrillSidewaysScorer extends BulkScorer {
>  
>        collectDocID = docID;
>  
> -      // TODO: we could score on demand instead since we are
> -      // daat here:
> -      collectScore = baseScorer.score();
> -
>        if (failedCollector == null) {
> +        collectScore = baseScorer.score();
>          // Hit passed all filters, so it's "real":
>          collectHit(collector, dims);
>        } else {
> {code}
>  
>  *Motivation*
>  1. Performance degradation: we have quite heavy custom implementation of 
> score(). So when we started using DrillSideways, this call became top-1 in a 
> profiler snapshot (top-3 with default scoring). We tried doUnionScoring and 
> doDrillDownAdvanceScoring, but no luck:
>  doUnionScoring scores all baseQuery docIds
>  doDrillDownAdvanceScoring avoids some redundant docIds scorings, considering 
> symmetric difference of top two iterator's docIds, but still scores some 
> docIds, that will be filtered out by 3rd, 4th, ... dimension iterators
>  doQueryFirstScoring scores near-miss docIds
>  Best way is to score only true hits (where baseQuery and all N drill-down 
> iterators match). So we suggest a small modification of doQueryFirstScoring.
>   
>  2. Speaking of doQueryFirstScoring, it doesn't look like we need to 
> calculate a score for near-miss hit, because it won't be used anywhere.
>  FacetsCollectorManager creates FacetsCollector with default constructor
>  
> [https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/FacetsCollectorManager.java#L35]
>  so FacetCollector has false for keepScores 
>  
> [https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/FacetsCollector.java#L119]
>  and collectScore is not being used
>  
> [https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/DrillSidewaysScorer.java#L200]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

2021-07-27 Thread GitBox


zhaih commented on a change in pull request #225:
URL: https://github.com/apache/lucene/pull/225#discussion_r677666989



##
File path: 
lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
##
@@ -0,0 +1,225 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.hppc.BitMixer;
+
+/**
+ * A RunAutomaton that does not require DFA, it will determinize and memorize 
the generated DFA
+ * state along with the run
+ *
+ * implemented based on: https://swtch.com/~rsc/regexp/regexp1.html
+ */
+public class NFARunAutomaton {
+
+  /** state ordinal of "no such state" */
+  public static final int MISSING = -1;
+
+  private static final int NOT_COMPUTED = -2;
+
+  private final Automaton automaton;
+  private final int[] points;
+  private final Map dStateToOrd = new HashMap<>(); // could 
init lazily?
+  private DState[] dStates;
+  private final int alphabetSize;
+
+  /**
+   * Constructor, assuming alphabet size is the whole codepoint space
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use 
{@link RunAutomaton} for
+   * better efficiency
+   */
+  public NFARunAutomaton(Automaton automaton) {
+this(automaton, Character.MAX_CODE_POINT);
+  }
+
+  /**
+   * Constructor
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use 
{@link RunAutomaton} *
+   * for better efficiency
+   * @param alphabetSize alphabet size
+   */
+  public NFARunAutomaton(Automaton automaton, int alphabetSize) {
+this.automaton = automaton;
+points = automaton.getStartPoints();
+this.alphabetSize = alphabetSize;
+dStates = new DState[10];
+findDState(new DState(new int[] {0}));
+  }
+
+  /**
+   * For a given state and an incoming character (codepoint), return the next 
state
+   *
+   * @param state incoming state, should either be 0 or some state that is 
returned previously by
+   * this function
+   * @param c codepoint
+   * @return the next state or {@link #MISSING} if the transition doesn't exist
+   */
+  public int step(int state, int c) {
+assert dStates[state] != null;
+return step(dStates[state], c);
+  }
+
+  /**
+   * Run through a given codepoint array, return accepted or not, should only 
be used in test
+   *
+   * @param s String represented by an int array
+   * @return accept or not
+   */
+  boolean run(int[] s) {
+int p = 0;
+for (int c : s) {
+  p = step(p, c);
+  if (p == MISSING) return false;
+}
+return dStates[p].isAccept;
+  }
+
+  /**
+   * From an existing DFA state, step to next DFA state given character c if 
the transition is
+   * previously tried then this operation will just use the cached result, 
otherwise it will call
+   * {@link #step(int[], int)} to get the next state and cache the result
+   */
+  private int step(DState dState, int c) {
+int charClass = getCharClass(c);
+if (dState.nextState(charClass) == NOT_COMPUTED) {
+  // the next dfa state has not been computed yet
+  dState.setNextState(charClass, findDState(step(dState.nfaStates, c)));
+}
+return dState.nextState(charClass);
+  }
+
+  /**
+   * given a list of NFA states and a character c, compute the output list of 
NFA state which is
+   * wrapped as a DFA state
+   */
+  private DState step(int[] nfaStates, int c) {
+Transition transition = new Transition();
+StateSet stateSet = new StateSet(5); // fork IntHashSet from hppc instead?
+int numTransitions;
+for (int nfaState : nfaStates) {
+  numTransitions = automaton.initTransition(nfaState, transition);
+  for (int i = 0; i < numTransitions; i++) {
+automaton.getNextTransition(transition);
+if (transition.min <= c && transition.max >= c) {
+  stateSet.incr(transition.dest);
+}
+  }
+}
+if (stateSet.size() == 0) {
+  return null;
+}
+return new DState(stateSet.getArray());
+  }
+
+  /**
+   * return the ordinal of given DFA state, generate a new ordinal if the 
given DFA state 

[GitHub] [lucene] gautamworah96 commented on a change in pull request #220: LUCENE-9450: Use BinaryDocValue fields in the taxonomy index based on the existing index version

2021-07-27 Thread GitBox


gautamworah96 commented on a change in pull request #220:
URL: https://github.com/apache/lucene/pull/220#discussion_r677669790



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java
##
@@ -475,8 +477,15 @@ private int addCategoryDocument(FacetLabel categoryPath, 
int parent) throws IOEx
 
 String fieldPath = FacetsConfig.pathToString(categoryPath.components, 
categoryPath.length);
 fullPathField.setStringValue(fieldPath);
+
+if (useOlderStoredFieldIndex) {
+  fullPathField = new StringField(Consts.FULL, fieldPath, Field.Store.YES);

Review comment:
   So when we first initialize the `fieldPath` variable it is 
[initialized](https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java#L185)
 with `Field.Store.NO`. Then when we realize that the field has to be stored as 
a StringField, we need its value as well, so I reassign it with a `new 
StringField(Consts.FULL, fieldPath, Field.Store.YES)` statement.
   
   > Maybe just move the above line inside the if?
   
   Sure
   
   > why don't we also re-use the BinaryDocValues field
   
   Hmmm. Could you elaborate a bit? The field depends on the input `FacetLabel 
categoryPath`
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

2021-07-27 Thread GitBox


zhaih commented on a change in pull request #225:
URL: https://github.com/apache/lucene/pull/225#discussion_r677669213



##
File path: 
lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
##
@@ -0,0 +1,225 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.hppc.BitMixer;
+
+/**
+ * A RunAutomaton that does not require DFA, it will determinize and memorize 
the generated DFA
+ * state along with the run
+ *
+ * implemented based on: https://swtch.com/~rsc/regexp/regexp1.html
+ */
+public class NFARunAutomaton {
+
+  /** state ordinal of "no such state" */
+  public static final int MISSING = -1;
+
+  private static final int NOT_COMPUTED = -2;
+
+  private final Automaton automaton;
+  private final int[] points;
+  private final Map dStateToOrd = new HashMap<>(); // could 
init lazily?
+  private DState[] dStates;
+  private final int alphabetSize;
+
+  /**
+   * Constructor, assuming alphabet size is the whole codepoint space

Review comment:
   ++

##
File path: 
lucene/core/src/test/org/apache/lucene/util/automaton/TestNFARunAutomaton.java
##
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import org.apache.lucene.util.IntsRef;
+import org.apache.lucene.util.LuceneTestCase;
+
+public class TestNFARunAutomaton extends LuceneTestCase {
+
+  public void testRandom() {
+for (int i = 0; i < 100; i++) {
+  RegExp regExp = null;
+  while (regExp == null) {
+try {
+  regExp = new RegExp(AutomatonTestUtil.randomRegexp(random()));
+} catch (IllegalArgumentException e) {
+  ignoreException(e);
+}
+  }
+  Automaton dfa = regExp.toAutomaton();

Review comment:
   Yeah I think that's better, will change it.

##
File path: 
lucene/core/src/test/org/apache/lucene/util/automaton/TestNFARunAutomaton.java
##
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import org.apache.lucene.util.IntsRef;
+import org.apache.lucene.util.LuceneTestCase;
+
+public class TestNFARunAutomaton extends LuceneTestCase {
+
+  public void testRandom() {
+for (int i = 0; i < 100; i++) {
+  RegExp regExp = null;
+  while (regExp == null) {
+try {
+  regExp = new RegExp(AutomatonTestUtil.randomRegexp(random()));
+} catch (IllegalArgument

[jira] [Resolved] (LUCENE-10036) Ensure ScoreCachingWrappingScorer doesn't unnecessarily wrap another ScoreCachingWrappingScorer

2021-07-27 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller resolved LUCENE-10036.
--
Fix Version/s: 8.10
   main (9.0)
   Resolution: Fixed

> Ensure ScoreCachingWrappingScorer doesn't unnecessarily wrap another 
> ScoreCachingWrappingScorer
> ---
>
> Key: LUCENE-10036
> URL: https://issues.apache.org/jira/browse/LUCENE-10036
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: main (9.0), 8.10
>Reporter: Greg Miller
>Assignee: Greg Miller
>Priority: Trivial
> Fix For: main (9.0), 8.10
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> This is a trivial issue, but it's easy to mistakenly "double wrap" an 
> instance of {{ScoreCachingWrappingScorer}}, which is a bit wasteful. The 
> calling code currently needs to check the instance type of the {{Scorable}} 
> they intend to wrap to avoid this. {{FieldComparator}} is actually the only 
> calling code that does this check.
> It would be nice to add a factory method that encapsulates this check in 
> {{ScoreCachingWrappingScorer}} so that calling code doesn't need to worry 
> about it.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] gautamworah96 commented on pull request #2521: Backport LUCENE-9948 LUCENE-9964

2021-07-27 Thread GitBox


gautamworah96 commented on pull request #2521:
URL: https://github.com/apache/lucene-solr/pull/2521#issuecomment-887734156


   @gsmiller Yes. I think the change is simple enough to be coded manually. Git 
did not allow direct cherry picking due to conflicts and that is why I 
suggested that I backport the other change as well. I'll revise it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gautamworah96 commented on a change in pull request #220: LUCENE-9450: Use BinaryDocValue fields in the taxonomy index based on the existing index version

2021-07-27 Thread GitBox


gautamworah96 commented on a change in pull request #220:
URL: https://github.com/apache/lucene/pull/220#discussion_r677794298



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java
##
@@ -164,9 +159,16 @@ public DirectoryTaxonomyWriter(Directory directory, 
OpenMode openMode, TaxonomyW
 openMode = config.getOpenMode();
 if (!DirectoryReader.indexExists(directory)) {

Review comment:
   Sure




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10037) Explore a single scoring implementation in DrillSidewaysScorer

2021-07-27 Thread Greg Miller (Jira)
Greg Miller created LUCENE-10037:


 Summary: Explore a single scoring implementation in 
DrillSidewaysScorer
 Key: LUCENE-10037
 URL: https://issues.apache.org/jira/browse/LUCENE-10037
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Affects Versions: main (9.0)
Reporter: Greg Miller


{{DrillSidewaysScorer}} currently implements three separate strategies for bulk 
scoring documents: {{doQueryFirstScoring}}, {{doUnionScoring}} and 
{{doDrillDownAdvanceScoring}}. As far as I can tell, this code dates back to 
2013 and two of the three approaches appear to emulate the {{BooleanScorer}} 
"window scoring" / "term-at-a-time" strategy. While this strategy in 
{{BooleanScorer}} is still useful in some cases, the primary benefit, from what 
I can tell, is to avoid re-heap operations in disjunction cases (as recently 
[described|http://mail-archives.apache.org/mod_mbox/lucene-dev/202106.mbox/%3CCAPsWd%2BMbYckCR2LHxHy4-%3DoZPnvX%3D9Er8hwb%2BG76jHb85JePvw%40mail.gmail.com%3E]
 by [~jpountz]). I can't see any reason why we'd prefer these two approaches 
anymore in {{DrillSidewaysScorer}} since we're doing pure conjunctions (no 
re-heaping to worry about) and {{doQueryFirstScoring}} takes advantage of 
skipping by advancing postings (while the other two approaches iterate their 
postings entirely, only relying on nextDoc functionality). Finally, we added an 
optimization (LUCENE-10030) that can only work for {{doQueryFirstScoring}} that 
lazily evaluates the {{score}} (where-as {{doUnionScoring}} and 
{{doDrillDownAdvanceScoring}} eagerly evaluate it).

 

All this is to say we should try sending all scoring through 
{{doQueryFirstScoring}} and see how it benchmarks. I'm not sure if we have 
benchmarks setup already for drill sideways, but I'd love to see if we can't 
optimize {{DrillSidewaysScorer}} while also reducing its code complexity!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mrkm4ntr opened a new pull request #228: Remove impossible assert condition

2021-07-27 Thread GitBox


mrkm4ntr opened a new pull request #228:
URL: https://github.com/apache/lucene/pull/228


   
   
   
   # Description
   
   DWPTs that have not same deleteQueue with flushingQueue are already filtered.
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to 
Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code 
conforms to the standards described there to the best of my ability.
   - [ ] I have created a Jira issue and added the issue ID to my pull request 
title.
  I think this is too trivial to create Jira issue. But if it is required, 
I will create an issue.
   - [x] I have given Lucene maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `main` branch.
   - [x] I have run `./gradlew check`.
   - [ ] I have added tests for my changes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10037) Explore a single scoring implementation in DrillSidewaysScorer

2021-07-27 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller updated LUCENE-10037:
-
Issue Type: Task  (was: Improvement)

> Explore a single scoring implementation in DrillSidewaysScorer
> --
>
> Key: LUCENE-10037
> URL: https://issues.apache.org/jira/browse/LUCENE-10037
> Project: Lucene - Core
>  Issue Type: Task
>  Components: modules/facet
>Affects Versions: main (9.0)
>Reporter: Greg Miller
>Priority: Minor
>
> {{DrillSidewaysScorer}} currently implements three separate strategies for 
> bulk scoring documents: {{doQueryFirstScoring}}, {{doUnionScoring}} and 
> {{doDrillDownAdvanceScoring}}. As far as I can tell, this code dates back to 
> 2013 and two of the three approaches appear to emulate the {{BooleanScorer}} 
> "window scoring" / "term-at-a-time" strategy. While this strategy in 
> {{BooleanScorer}} is still useful in some cases, the primary benefit, from 
> what I can tell, is to avoid re-heap operations in disjunction cases (as 
> recently 
> [described|http://mail-archives.apache.org/mod_mbox/lucene-dev/202106.mbox/%3CCAPsWd%2BMbYckCR2LHxHy4-%3DoZPnvX%3D9Er8hwb%2BG76jHb85JePvw%40mail.gmail.com%3E]
>  by [~jpountz]). I can't see any reason why we'd prefer these two approaches 
> anymore in {{DrillSidewaysScorer}} since we're doing pure conjunctions (no 
> re-heaping to worry about) and {{doQueryFirstScoring}} takes advantage of 
> skipping by advancing postings (while the other two approaches iterate their 
> postings entirely, only relying on nextDoc functionality). Finally, we added 
> an optimization (LUCENE-10030) that can only work for {{doQueryFirstScoring}} 
> that lazily evaluates the {{score}} (where-as {{doUnionScoring}} and 
> {{doDrillDownAdvanceScoring}} eagerly evaluate it).
>  
> All this is to say we should try sending all scoring through 
> {{doQueryFirstScoring}} and see how it benchmarks. I'm not sure if we have 
> benchmarks setup already for drill sideways, but I'd love to see if we can't 
> optimize {{DrillSidewaysScorer}} while also reducing its code complexity!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on a change in pull request #220: LUCENE-9450: Use BinaryDocValue fields in the taxonomy index based on the existing index version

2021-07-27 Thread GitBox


mikemccand commented on a change in pull request #220:
URL: https://github.com/apache/lucene/pull/220#discussion_r677908217



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java
##
@@ -475,8 +477,15 @@ private int addCategoryDocument(FacetLabel categoryPath, 
int parent) throws IOEx
 
 String fieldPath = FacetsConfig.pathToString(categoryPath.components, 
categoryPath.length);
 fullPathField.setStringValue(fieldPath);
+
+if (useOlderStoredFieldIndex) {
+  fullPathField = new StringField(Consts.FULL, fieldPath, Field.Store.YES);

Review comment:
   OK it's just super confusing that we act like we will reuse it (by 
creating it in ctor) but then sometimes overwrite it and other times re-use it.
   
   Couldn't we init it with the right `Field.Store.YES|NO` option up front, 
since we know in ctor which way it'll go, and then just always re-use it?
   
   And let's scratch my suggestion to also reuse the `BDV` field -- let's just 
make a new one each time like you are doing.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dnhatn commented on pull request #228: Remove impossible assert condition

2021-07-27 Thread GitBox


dnhatn commented on pull request #228:
URL: https://github.com/apache/lucene/pull/228#issuecomment-887965341


   @mrkm4ntr I think we can remove the assertion entirely.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mrkm4ntr commented on pull request #228: Remove unnecessary assertion

2021-07-27 Thread GitBox


mrkm4ntr commented on pull request #228:
URL: https://github.com/apache/lucene/pull/228#issuecomment-887970592


   @dnhatn Thanks! I removed the assertion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10035) Simple text codec add multi level skip list data

2021-07-27 Thread wuda (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuda updated LUCENE-10035:
--
Issue Type: Wish  (was: New Feature)

> Simple text codec add  multi level skip list data 
> --
>
> Key: LUCENE-10035
> URL: https://issues.apache.org/jira/browse/LUCENE-10035
> Project: Lucene - Core
>  Issue Type: Wish
>  Components: core/codecs
>Affects Versions: main (9.0)
>Reporter: wuda
>Priority: Major
>  Labels: Impact, MultiLevelSkipList, SimpleTextCodec
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Simple text codec add skip list data( include impact) to help understand 
> index format,For debugging, curiosity, transparency only!! When term's 
> docFreq greater than or equal to SimpleTextSkipWriter.BLOCK_SIZE (default 
> value is 8), Simple text codec will write skip list, the *.pst (simple text 
> term dictionary file)* file will looks like this
> {code:java}
> field title
>   term args
> doc 2
>   freq 2
>   pos 7
>   pos 10
> ## we omit docs for better view ..
> doc 98
>   freq 2
>   pos 2
>   pos 6
> skipList 
> ?
>   level 1
> skipDoc 65
> skipDocFP 949
> impacts 
>   impact 
> freq 1
> norm 2
>   impact 
> freq 2
> norm 12
>   impact 
> freq 3
> norm 13
> impacts_end 
> ?
>   level 0
> skipDoc 17
> skipDocFP 284
> impacts 
>   impact 
> freq 1
> norm 2
>   impact 
> freq 2
> norm 12
> impacts_end 
> skipDoc 34
> skipDocFP 624
> impacts 
>   impact 
> freq 1
> norm 2
>   impact 
> freq 2
> norm 12
>   impact 
> freq 3
> norm 14
> impacts_end 
> skipDoc 65
> skipDocFP 949
> impacts 
>   impact 
> freq 1
> norm 2
>   impact 
> freq 2
> norm 12
>   impact 
> freq 3
> norm 13
> impacts_end 
> skipDoc 90
> skipDocFP 1311
> impacts 
>   impact 
> freq 1
> norm 2
>   impact 
> freq 2
> norm 10
>   impact 
> freq 3
> norm 13
>   impact 
> freq 4
> norm 14
> impacts_end 
> END
> checksum 000829315543
> {code}
> compare with previous,we add *skipList,level, skipDoc, skipDocFP, impacts, 
> impact, freq, norm* nodes, at the same, simple text codec can support 
> advanceShallow when search time.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] glawson0 commented on pull request #157: LUCENE-9963 Fix issue with FlattenGraphFilter throwing exceptions from holes

2021-07-27 Thread GitBox


glawson0 commented on pull request #157:
URL: https://github.com/apache/lucene/pull/157#issuecomment-888027281


   We have been using this change internally for a few weeks now. We no longer 
encounter the `ArrayIndexOutOfBounds` exceptions that we were previously 
experiencing. Depending on the dataset/analyzer combination we have seen up to 
a 1% increase in the average number of tokens per field. This comes from the 
tokens that had previously been dropped now being correctly indexed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org