date:20211118

[jira] [Updated] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll

2021-11-18 Thread Feng Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10233:
--
Attachment: SparseFixedBitSet.png

> Store docIds as bitset when leafCardinality = 1 to speed up addAll
> --
>
> Key: LUCENE-10233
> URL: https://issues.apache.org/jira/browse/LUCENE-10233
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
> Attachments: SparseFixedBitSet.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In low cardinality points cases, id blocks will usually store doc ids that 
> have the same point value, and {{intersect}} will get into {{addAll}} logic. 
> If we store ids as bitset, and give the IntersectVisitor bulk visiting 
> ability, we can speed up addAll because we can just execute the 'or' logic 
> between the result and the block ids.
> Optimization will be triggered when the following conditions are met at the 
> same time:
>  # leafCardinality = 1
>  # max(docId) - min(docId) <= 16 * pointCount (in order to avoid expanding 
> too much storage)
>  # no duplicate doc id
> I mocked a field that has 10,000,000 docs per value and search it with a 1 
> term PointInSetQuery, the build scorer time decreased from 71ms to 8ms.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll

2021-11-18 Thread Feng Guo (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17445792#comment-17445792
 ] 

Feng Guo commented on LUCENE-10233:
---

Hi [~jpountz], Would you like to give me some suggestions on how to speed up 
SparseFixedBitSet?



Here are some of my current thoughts: 

This is the cpu profile of SparseFixedBitSet: [^SparseFixedBitSet.png], most of 
the cpu is used to constructing SparseFixedBitSet, i think this is because we 
always new big arrays there. Constructing a single global SparseFixedBitSet and 
reusing it for each block may help, but global SparseFixedBitSet needs the 
sgment's maxdoc and maxDoc is not available in the BKDReader. I'm not sure if 
it is worth to changing the BKDReader constrctor signature for this since 
BKDReader is a public class.

> Store docIds as bitset when leafCardinality = 1 to speed up addAll
> --
>
> Key: LUCENE-10233
> URL: https://issues.apache.org/jira/browse/LUCENE-10233
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
> Attachments: SparseFixedBitSet.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In low cardinality points cases, id blocks will usually store doc ids that 
> have the same point value, and {{intersect}} will get into {{addAll}} logic. 
> If we store ids as bitset, and give the IntersectVisitor bulk visiting 
> ability, we can speed up addAll because we can just execute the 'or' logic 
> between the result and the block ids.
> Optimization will be triggered when the following conditions are met at the 
> same time:
>  # leafCardinality = 1
>  # max(docId) - min(docId) <= 16 * pointCount (in order to avoid expanding 
> too much storage)
>  # no duplicate doc id
> I mocked a field that has 10,000,000 docs per value and search it with a 1 
> term PointInSetQuery, the build scorer time decreased from 71ms to 8ms.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] iverase commented on a change in pull request #7: LUCENE-9820: Separate logic for reading the BKD index from logic to intersecting it

2021-11-18 Thread GitBox



iverase commented on a change in pull request #7:
URL: https://github.com/apache/lucene/pull/7#discussion_r752128672



##
File path: lucene/core/src/java/org/apache/lucene/util/bkd/BKDWriter.java
##
@@ -328,31 +288,97 @@ public Relation compare(byte[] minPackedValue, byte[] 
maxPackedValue) {
   // Not deleted!
   docID = mappedDocID;
   System.arraycopy(
-  packedValues,
-  index * bkd.config.packedBytesLength,
-  state.scratchDataPackedValue,
+  mergeIntersectsVisitor.packedValues,
+  index * packedBytesLength,
+  packedValue,
   0,
-  bkd.config.packedBytesLength);
+  packedBytesLength);
   return true;
 }
   }
 }
+
+private boolean collectNextLeaf() throws IOException {
+  assert pointTree.moveToChild() == false;
+  mergeIntersectsVisitor.reset();
+  do {
+if (pointTree.moveToSibling()) {
+  // move to first child of this node and collect docs
+  while (pointTree.moveToChild()) {}
+  pointTree.visitDocValues(mergeIntersectsVisitor);
+  return true;
+}
+  } while (pointTree.moveToParent());
+  return false;
+}
+  }
+
+  private static class MergeIntersectsVisitor implements IntersectVisitor {
+
+int docsInBlock = 0;
+byte[] packedValues;
+int[] docIDs;
+private final int packedBytesLength;
+
+MergeIntersectsVisitor(int packedBytesLength) {
+  this.docIDs = new int[0];
+  this.packedValues = new byte[0];
+  this.packedBytesLength = packedBytesLength;
+}
+
+void reset() {
+  docsInBlock = 0;
+}
+
+@Override
+public void grow(int count) {
+  assert docsInBlock == 0;
+  if (docIDs.length < count) {
+docIDs = ArrayUtil.grow(docIDs, count);
+int packedValuesSize = Math.toIntExact(docIDs.length * (long) 
packedBytesLength);
+if (packedValuesSize > ArrayUtil.MAX_ARRAY_LENGTH) {

Review comment:
   I think that is ok. We control how we are iterating the tree and we are 
visiting one leaf at a time. In one leaf there cannot be more than 
`max_points_in_leaf_node`points, it is not related to the points cardinality?





-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on a change in pull request #7: LUCENE-9820: Separate logic for reading the BKD index from logic to intersecting it

2021-11-18 Thread GitBox



jpountz commented on a change in pull request #7:
URL: https://github.com/apache/lucene/pull/7#discussion_r752133629



##
File path: lucene/core/src/java/org/apache/lucene/util/bkd/BKDWriter.java
##
@@ -328,31 +288,97 @@ public Relation compare(byte[] minPackedValue, byte[] 
maxPackedValue) {
   // Not deleted!
   docID = mappedDocID;
   System.arraycopy(
-  packedValues,
-  index * bkd.config.packedBytesLength,
-  state.scratchDataPackedValue,
+  mergeIntersectsVisitor.packedValues,
+  index * packedBytesLength,
+  packedValue,
   0,
-  bkd.config.packedBytesLength);
+  packedBytesLength);
   return true;
 }
   }
 }
+
+private boolean collectNextLeaf() throws IOException {
+  assert pointTree.moveToChild() == false;
+  mergeIntersectsVisitor.reset();
+  do {
+if (pointTree.moveToSibling()) {
+  // move to first child of this node and collect docs
+  while (pointTree.moveToChild()) {}
+  pointTree.visitDocValues(mergeIntersectsVisitor);
+  return true;
+}
+  } while (pointTree.moveToParent());
+  return false;
+}
+  }
+
+  private static class MergeIntersectsVisitor implements IntersectVisitor {
+
+int docsInBlock = 0;
+byte[] packedValues;
+int[] docIDs;
+private final int packedBytesLength;
+
+MergeIntersectsVisitor(int packedBytesLength) {
+  this.docIDs = new int[0];
+  this.packedValues = new byte[0];
+  this.packedBytesLength = packedBytesLength;
+}
+
+void reset() {
+  docsInBlock = 0;
+}
+
+@Override
+public void grow(int count) {
+  assert docsInBlock == 0;
+  if (docIDs.length < count) {
+docIDs = ArrayUtil.grow(docIDs, count);
+int packedValuesSize = Math.toIntExact(docIDs.length * (long) 
packedBytesLength);
+if (packedValuesSize > ArrayUtil.MAX_ARRAY_LENGTH) {

Review comment:
   Ah, thanks for explaining, maybe this is the thing I was confused about. 
I thought it could be called by e.g. `BKDReader#addAll` which tries to grow by 
as much as possible (up to Integer#MAX_VALUE) but actually this merge visitor 
is always visited leaf-by-leaf?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] janhoy opened a new pull request #2612: SOLR-15804 Allow content-types with ;charset in ShowFileRequestHandler

2021-11-18 Thread GitBox



janhoy opened a new pull request #2612:
URL: https://github.com/apache/lucene-solr/pull/2612


   * Request .html as text/xml, .css as text/plain and .js as 
application/javascript
   * Highlight css as txt (crashed browser)
   * Set extension to 'xml' for managed-schema instead of 'text/xml'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] janhoy merged pull request #2612: SOLR-15804 Allow content-types with ;charset in ShowFileRequestHandler

2021-11-18 Thread GitBox



janhoy merged pull request #2612:
URL: https://github.com/apache/lucene-solr/pull/2612


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] iverase commented on a change in pull request #7: LUCENE-9820: Separate logic for reading the BKD index from logic to intersecting it

2021-11-18 Thread GitBox



iverase commented on a change in pull request #7:
URL: https://github.com/apache/lucene/pull/7#discussion_r752168490



##
File path: lucene/core/src/java/org/apache/lucene/util/bkd/BKDWriter.java
##
@@ -328,31 +288,97 @@ public Relation compare(byte[] minPackedValue, byte[] 
maxPackedValue) {
   // Not deleted!
   docID = mappedDocID;
   System.arraycopy(
-  packedValues,
-  index * bkd.config.packedBytesLength,
-  state.scratchDataPackedValue,
+  mergeIntersectsVisitor.packedValues,
+  index * packedBytesLength,
+  packedValue,
   0,
-  bkd.config.packedBytesLength);
+  packedBytesLength);
   return true;
 }
   }
 }
+
+private boolean collectNextLeaf() throws IOException {
+  assert pointTree.moveToChild() == false;
+  mergeIntersectsVisitor.reset();
+  do {
+if (pointTree.moveToSibling()) {
+  // move to first child of this node and collect docs
+  while (pointTree.moveToChild()) {}
+  pointTree.visitDocValues(mergeIntersectsVisitor);
+  return true;
+}
+  } while (pointTree.moveToParent());
+  return false;
+}
+  }
+
+  private static class MergeIntersectsVisitor implements IntersectVisitor {
+
+int docsInBlock = 0;
+byte[] packedValues;
+int[] docIDs;
+private final int packedBytesLength;
+
+MergeIntersectsVisitor(int packedBytesLength) {
+  this.docIDs = new int[0];
+  this.packedValues = new byte[0];
+  this.packedBytesLength = packedBytesLength;
+}
+
+void reset() {
+  docsInBlock = 0;
+}
+
+@Override
+public void grow(int count) {
+  assert docsInBlock == 0;
+  if (docIDs.length < count) {
+docIDs = ArrayUtil.grow(docIDs, count);
+int packedValuesSize = Math.toIntExact(docIDs.length * (long) 
packedBytesLength);
+if (packedValuesSize > ArrayUtil.MAX_ARRAY_LENGTH) {

Review comment:
   Yes, the magic happens in the method `collectNextLeaf` where we fill up 
the buffers.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10231) Do not cache too large PointInSetQueries

2021-11-18 Thread Feng Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10231:
--
Priority: Minor  (was: Major)

> Do not cache too large PointInSetQueries
> 
>
> Key: LUCENE-10231
> URL: https://issues.apache.org/jira/browse/LUCENE-10231
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Feng Guo
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> TermInSetQuery will avoid caching queries using too much memory, but there is 
> no such limit in PointInSetQuery and PointInSetQueryWithScore, which may have 
> similar problems. I added the logic to these two queries, but I'm not sure 
> whether this is the most reasonable way. May be this logic is common and we 
> can judge this in LRUQueryCache ?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] janhoy commented on pull request #2604: SOLR-15768 Tune zookeeper request handler permissions

2021-11-18 Thread GitBox



janhoy commented on pull request #2604:
URL: https://github.com/apache/lucene-solr/pull/2604#issuecomment-972859726


   @dsmiley I wrapped this up for 8x branch, added changes entry, made the JIRA 
public and ready to merge this. Any other feedback?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10162) Add IntField, LongField, FloatField and DoubleField classes to index both points and doc values

2021-11-18 Thread Quentin Pradet (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17445924#comment-17445924
 ] 

Quentin Pradet commented on LUCENE-10162:
-

This is very interesting, as I've been confused by this when working on 
LUCENE-10085.

 * Is it correct to assume that this is only "syntactic sugar" and that under 
the hood we'll still add IntPoint and NumericDocValuesField the doc?
 * If yes, would it make sense to desugar in Document#add? If the field is 
composite (eg. IntField), then we add the two fields instead of one.
 * If no, at what point should we split IntField in its components?

> Add IntField, LongField, FloatField and DoubleField classes to index both 
> points and doc values
> ---
>
> Key: LUCENE-10162
> URL: https://issues.apache.org/jira/browse/LUCENE-10162
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> Currently we have IntPoint, LongPoint, FloatPoint and DoublePoint on the one 
> hand, and NumericDocValuesField and SortedNumericDocValuesField on the other 
> hand.
> When we introduced these classes, this distinction made sense: use the 
> XXXPoint classes if you want your numeric fields to be searchable and the 
> XXXDocValuesField classes if you want your numeric fields to be 
> sortable/aggregatable.
> However since then, we introduced logic to take advantage of doc values for 
> filtering (IndexOrDocValuesQuery) and enhanced sorting to take advantage of 
> the Points index to skip non-competitive documents. So even if you only need 
> searching, or if you only need sorting, it's likely a good idea to index both 
> with points *and* doc values.
> Could we make this easier on users by having XXXField classes that 
> automatically do it as opposed to requiring users to add both an XXXPoint and 
> an XXXDocValuesField for every numeric field to their index? This could also 
> make consuming these fields easier, e.g. factory methods for range queries 
> could automatically use IndexOrDocValuesQuery.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10162) Add IntField, LongField, FloatField and DoubleField classes to index both points and doc values

2021-11-18 Thread Adrien Grand (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17445929#comment-17445929
 ] 

Adrien Grand commented on LUCENE-10162:
---

bq. Is it correct to assume that this is only "syntactic sugar" and that under 
the hood we'll still add IntPoint and NumericDocValuesField the doc?

Correct.

bq. would it make sense to desugar in Document#add? If the field is composite 
(eg. IntField), then we add the two fields instead of one

I don't think there's an easy way to do this. Lucene Field classes tend to be 
very simple, they mostly wrap a value and configure a FieldType object that 
describes how the data should be indexed. The way I was thinking of addressing 
this issue would consist of implementing these LongField/IntField/... classes 
from scratch without trying to reuse LongPoint/IntPoint/... or 
NumericDocValuesField/SortedNumericDocValuesField.

> Add IntField, LongField, FloatField and DoubleField classes to index both 
> points and doc values
> ---
>
> Key: LUCENE-10162
> URL: https://issues.apache.org/jira/browse/LUCENE-10162
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> Currently we have IntPoint, LongPoint, FloatPoint and DoublePoint on the one 
> hand, and NumericDocValuesField and SortedNumericDocValuesField on the other 
> hand.
> When we introduced these classes, this distinction made sense: use the 
> XXXPoint classes if you want your numeric fields to be searchable and the 
> XXXDocValuesField classes if you want your numeric fields to be 
> sortable/aggregatable.
> However since then, we introduced logic to take advantage of doc values for 
> filtering (IndexOrDocValuesQuery) and enhanced sorting to take advantage of 
> the Points index to skip non-competitive documents. So even if you only need 
> searching, or if you only need sorting, it's likely a good idea to index both 
> with points *and* doc values.
> Could we make this easier on users by having XXXField classes that 
> automatically do it as opposed to requiring users to add both an XXXPoint and 
> an XXXDocValuesField for every numeric field to their index? This could also 
> make consuming these fields easier, e.g. factory methods for range queries 
> could automatically use IndexOrDocValuesQuery.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a change in pull request #443: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals

2021-11-18 Thread GitBox



gsmiller commented on a change in pull request #443:
URL: https://github.com/apache/lucene/pull/443#discussion_r752270520



##
File path: lucene/facet/src/java/org/apache/lucene/facet/FacetsConfig.java
##
@@ -409,9 +410,26 @@ private void processFacetFields(
 indexDrillDownTerms(doc, indexFieldName, dimConfig, facetLabel);
   }
 
-  // Facet counts:
-  // DocValues are considered stored fields:
-  doc.add(new BinaryDocValuesField(indexFieldName, 
dedupAndEncode(ordinals.get(;
+  // Store the taxonomy ordinals associated with each doc. Prefer to use 
SortedNumericDocValues
+  // but "fall back" to a custom binary format to maintain backwards 
compatibility with Lucene 8
+  // indexes.
+  if (taxoWriter.useNumericDocValuesForOrdinals()) {

Review comment:
   OK sounds like this should work then. Thanks @mikemccand!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a change in pull request #443: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals

2021-11-18 Thread GitBox



gsmiller commented on a change in pull request #443:
URL: https://github.com/apache/lucene/pull/443#discussion_r752271392



##
File path: lucene/facet/src/java/org/apache/lucene/facet/FacetUtils.java
##
@@ -81,4 +82,17 @@ public long cost() {
   }
 };
   }
+
+  /**
+   * Determine whether-or-not an index segment is using the older-style binary 
format or the newer
+   * NumericDocValues format for storing taxonomy faceting ordinals (for the 
specified field).
+   *
+   * @deprecated Please do not rely on this method. It is added as a temporary 
measure for providing
+   * index backwards-compatibility with Lucene 8 and earlier indexes, and 
will be removed in
+   * Lucene 10.
+   */
+  @Deprecated
+  public static boolean usesOlderBinaryOrdinals(LeafReader reader) {
+return reader.getMetaData().getCreatedVersionMajor() <= 8;
+  }

Review comment:
   Makes sense. I went ahead and make this change. Thanks for the 
suggestion.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on a change in pull request #452: LUCENE-10239: upgrade jflex (1.7.0 -> 1.8.2)

2021-11-18 Thread GitBox



rmuir commented on a change in pull request #452:
URL: https://github.com/apache/lucene/pull/452#discussion_r752274558



##
File path: 
dev-tools/missing-doclet/src/main/java/org/apache/lucene/missingdoclet/MissingDoclet.java
##
@@ -331,7 +331,6 @@ private boolean hasInheritedJavadocs(Element element) {
 if (hasOverrides) {
   // If an element has explicit @Overrides annotation, assume it does
   // have inherited javadocs somewhere.
-  reporter.print(Diagnostic.Kind.NOTE, element, "javadoc empty but 
@Override declared, skipping.");

Review comment:
   We could add back the debugging guarded by a sysprop. maybe we could 
enable additional output from the gradle build if someone runs `gradlew 
--debug`? I keep thinking there should already be a "proper" way for this (e.g. 
cmdline flag to javadocs tool), but I'm not finding it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss commented on a change in pull request #452: LUCENE-10239: upgrade jflex (1.7.0 -> 1.8.2)

2021-11-18 Thread GitBox



dweiss commented on a change in pull request #452:
URL: https://github.com/apache/lucene/pull/452#discussion_r752281787



##
File path: 
dev-tools/missing-doclet/src/main/java/org/apache/lucene/missingdoclet/MissingDoclet.java
##
@@ -331,7 +331,6 @@ private boolean hasInheritedJavadocs(Element element) {
 if (hasOverrides) {
   // If an element has explicit @Overrides annotation, assume it does
   // have inherited javadocs somewhere.
-  reporter.print(Diagnostic.Kind.NOTE, element, "javadoc empty but 
@Override declared, skipping.");

Review comment:
   I think that code is sort of correct logging at NOTE level - the problem 
is I can't find the right option in javadoc to skip NOTE messages (and leave 
only warnings). I'd comment it out instead of removing it so that perhaps it 
can be tackled separately. There should be a way of doing it somehow.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a change in pull request #443: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals

2021-11-18 Thread GitBox



gsmiller commented on a change in pull request #443:
URL: https://github.com/apache/lucene/pull/443#discussion_r752290342



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/CachedOrdinalsReader.java
##
@@ -45,7 +45,11 @@
  *
  * NOTE: create one instance of this and re-use it for all facet 
implementations (the
  * cache is per-instance, not static).
+ *
+ * @deprecated Custom binary encodings for taxonomy ordinals are no longer 
supported starting with

Review comment:
   Added. Thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on a change in pull request #452: LUCENE-10239: upgrade jflex (1.7.0 -> 1.8.2)

2021-11-18 Thread GitBox



rmuir commented on a change in pull request #452:
URL: https://github.com/apache/lucene/pull/452#discussion_r752297406



##
File path: 
dev-tools/missing-doclet/src/main/java/org/apache/lucene/missingdoclet/MissingDoclet.java
##
@@ -331,7 +331,6 @@ private boolean hasInheritedJavadocs(Element element) {
 if (hasOverrides) {
   // If an element has explicit @Overrides annotation, assume it does
   // have inherited javadocs somewhere.
-  reporter.print(Diagnostic.Kind.NOTE, element, "javadoc empty but 
@Override declared, skipping.");

Review comment:
   yeah, i just havent experimented with all the options: such as default 
level vs `-verbose` vs `-quiet`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a change in pull request #443: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals

2021-11-18 Thread GitBox



gsmiller commented on a change in pull request #443:
URL: https://github.com/apache/lucene/pull/443#discussion_r752298802



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/OrdinalMappingLeafReader.java
##
@@ -107,6 +113,64 @@ public BytesRef binaryValue() {
 }
   }
 
+  private class OrdinalMappingSortedNumericDocValues extends 
FilterSortedNumericDocValues {
+private final IntArrayList currentValues;
+private int currIndex;
+
+OrdinalMappingSortedNumericDocValues(SortedNumericDocValues in) {
+  super(in);
+  currentValues = new IntArrayList(32);
+}
+
+@Override
+public boolean advanceExact(int target) throws IOException {
+  boolean result = in.advanceExact(target);
+  if (result) {
+reloadValues();
+  }
+  return result;
+}
+
+@Override
+public int advance(int target) throws IOException {
+  int result = in.advance(target);
+  if (result != DocIdSetIterator.NO_MORE_DOCS) {
+reloadValues();
+  }
+  return result;
+}
+
+@Override
+public int nextDoc() throws IOException {
+  int result = in.nextDoc();
+  if (result != DocIdSetIterator.NO_MORE_DOCS) {
+reloadValues();
+  }
+  return result;
+}
+
+@Override
+public int docValueCount() {
+  return currentValues.elementsCount;
+}
+
+private void reloadValues() throws IOException {
+  currIndex = 0;
+  currentValues.clear();
+  for (int i = 0; i < in.docValueCount(); i++) {
+currentValues.add(ordinalMap[(int) in.nextValue()]);

Review comment:
   @rmuir Hmm, yeah interesting. We have a couple cases related to taxonomy 
faceting now where we're explicitly storing ordinals in a dv field, but not 
using SORTED_SET since the "mapping" back to the string values is handled by 
the taxonomy index (this change and LUCENE-10122). I wonder if there are any 
other cases like this outside of our taxonomy faceting? I wonder if it would 
make any sense at all to consider a new dv field type for explicitly storing 
ordinals? I suspect we don't have enough general use-cases to support such an 
idea, but just throwing it out there for conversation. Such a field could make 
assumptions about the data that would probably allow it to be more efficient, 
and some of the logic we're currently handling ourselves (e.g., de-duping) 
could be done by the field type. Like I said, probably not enough real-world 
use-cases to justify adding a new type, but figured I'd float the suggestion 
anyway for input.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] costin opened a new pull request #453: Investigate whether should Packed64 use a byte[] plus VarHandles instead of a long[]

2021-11-18 Thread GitBox



costin opened a new pull request #453:
URL: https://github.com/apache/lucene/pull/453


   Add VarHandle variants for `Packed64` to investigate whether using 
VarHandles 
   improved performance or not.
   
   This PR introduces two new Packed64 variants:
   ~ Packed64LongLong
   Same as Packed64 however instead of using direct array access, it
   relies on `VarHandle`. It's main purpose benchmarking.
   ~ Packed64LongByte
   A different implementation of Packed64 backed by a byte[]. It reads the data 
as a
long and in case of overflow reads an extra byte. The algorithm is slightly 
different
   since the overflowing bits need to be computed on a different block size (8 
vs 64)
   which requires extra cycles.
   
   Related LUCENE-10205


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on a change in pull request #443: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals

2021-11-18 Thread GitBox



rmuir commented on a change in pull request #443:
URL: https://github.com/apache/lucene/pull/443#discussion_r752303321



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/OrdinalMappingLeafReader.java
##
@@ -107,6 +113,64 @@ public BytesRef binaryValue() {
 }
   }
 
+  private class OrdinalMappingSortedNumericDocValues extends 
FilterSortedNumericDocValues {
+private final IntArrayList currentValues;
+private int currIndex;
+
+OrdinalMappingSortedNumericDocValues(SortedNumericDocValues in) {
+  super(in);
+  currentValues = new IntArrayList(32);
+}
+
+@Override
+public boolean advanceExact(int target) throws IOException {
+  boolean result = in.advanceExact(target);
+  if (result) {
+reloadValues();
+  }
+  return result;
+}
+
+@Override
+public int advance(int target) throws IOException {
+  int result = in.advance(target);
+  if (result != DocIdSetIterator.NO_MORE_DOCS) {
+reloadValues();
+  }
+  return result;
+}
+
+@Override
+public int nextDoc() throws IOException {
+  int result = in.nextDoc();
+  if (result != DocIdSetIterator.NO_MORE_DOCS) {
+reloadValues();
+  }
+  return result;
+}
+
+@Override
+public int docValueCount() {
+  return currentValues.elementsCount;
+}
+
+private void reloadValues() throws IOException {
+  currIndex = 0;
+  currentValues.clear();
+  for (int i = 0; i < in.docValueCount(); i++) {
+currentValues.add(ordinalMap[(int) in.nextValue()]);

Review comment:
   Currently in the benchmarks isn't the sorted set approach faster? So 
personally i wouldnt want to add another DV type that just lets you do things 
slower?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10162) Add IntField, LongField, FloatField and DoubleField classes to index both points and doc values

2021-11-18 Thread Artem Prigoda (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17445971#comment-17445971
 ] 

Artem Prigoda commented on LUCENE-10162:


Ah, and I tried to create a new Field which combines the functionality of 
***Point and `NumericDocValuesField`! Anyways, that was an interesting journey 
in the Lucene internals.

> Add IntField, LongField, FloatField and DoubleField classes to index both 
> points and doc values
> ---
>
> Key: LUCENE-10162
> URL: https://issues.apache.org/jira/browse/LUCENE-10162
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> Currently we have IntPoint, LongPoint, FloatPoint and DoublePoint on the one 
> hand, and NumericDocValuesField and SortedNumericDocValuesField on the other 
> hand.
> When we introduced these classes, this distinction made sense: use the 
> XXXPoint classes if you want your numeric fields to be searchable and the 
> XXXDocValuesField classes if you want your numeric fields to be 
> sortable/aggregatable.
> However since then, we introduced logic to take advantage of doc values for 
> filtering (IndexOrDocValuesQuery) and enhanced sorting to take advantage of 
> the Points index to skip non-competitive documents. So even if you only need 
> searching, or if you only need sorting, it's likely a good idea to index both 
> with points *and* doc values.
> Could we make this easier on users by having XXXField classes that 
> automatically do it as opposed to requiring users to add both an XXXPoint and 
> an XXXDocValuesField for every numeric field to their index? This could also 
> make consuming these fields easier, e.g. factory methods for range queries 
> could automatically use IndexOrDocValuesQuery.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on a change in pull request #443: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals

2021-11-18 Thread GitBox



rmuir commented on a change in pull request #443:
URL: https://github.com/apache/lucene/pull/443#discussion_r752328265



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/OrdinalMappingLeafReader.java
##
@@ -107,6 +113,64 @@ public BytesRef binaryValue() {
 }
   }
 
+  private class OrdinalMappingSortedNumericDocValues extends 
FilterSortedNumericDocValues {
+private final IntArrayList currentValues;
+private int currIndex;
+
+OrdinalMappingSortedNumericDocValues(SortedNumericDocValues in) {
+  super(in);
+  currentValues = new IntArrayList(32);
+}
+
+@Override
+public boolean advanceExact(int target) throws IOException {
+  boolean result = in.advanceExact(target);
+  if (result) {
+reloadValues();
+  }
+  return result;
+}
+
+@Override
+public int advance(int target) throws IOException {
+  int result = in.advance(target);
+  if (result != DocIdSetIterator.NO_MORE_DOCS) {
+reloadValues();
+  }
+  return result;
+}
+
+@Override
+public int nextDoc() throws IOException {
+  int result = in.nextDoc();
+  if (result != DocIdSetIterator.NO_MORE_DOCS) {
+reloadValues();
+  }
+  return result;
+}
+
+@Override
+public int docValueCount() {
+  return currentValues.elementsCount;
+}
+
+private void reloadValues() throws IOException {
+  currIndex = 0;
+  currentValues.clear();
+  for (int i = 0; i < in.docValueCount(); i++) {
+currentValues.add(ordinalMap[(int) in.nextValue()]);

Review comment:
   @gsmiller put another way, I think it would be much more interesting to 
think about something like a new 'taxonomy' or 'hierarchy' type that supports 
the additional features the current taxo index provides, rather than providing 
a lot of hooks for apps to manage ordinals themselves. It could probably 
ultimately perform better and be easier to use (e.g. not juggle a sidecar 
index).
   
   but this PR is a great step forward as at least we get the performance 
closer :)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #443: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals

2021-11-18 Thread GitBox



gsmiller commented on pull request #443:
URL: https://github.com/apache/lucene/pull/443#issuecomment-972948547


   > Maybe clarify that full index rebuild means start a new 9.x index and 
index all your documents. I.e. merely re-indexing all docs into your previous 
(8.x created) index is not sufficient.
   
   Makes sense. Will tweak the language.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a change in pull request #443: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals

2021-11-18 Thread GitBox



gsmiller commented on a change in pull request #443:
URL: https://github.com/apache/lucene/pull/443#discussion_r752335273



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/FastTaxonomyFacetCounts.java
##
@@ -69,31 +69,34 @@ public FastTaxonomyFacetCounts(
   }
 
   private final void count(List matchingDocs) throws IOException 
{
+if (matchingDocs.isEmpty()) {
+  return;
+}
+
+boolean useBinaryDv = 
FacetUtils.usesOlderBinaryOrdinals(matchingDocs.get(0).context.reader());
+
 for (MatchingDocs hits : matchingDocs) {
-  BinaryDocValues dv = 
hits.context.reader().getBinaryDocValues(indexFieldName);
-  if (dv == null) { // this reader does not have DocValues for the 
requested category list
+  assert useBinaryDv == 
FacetUtils.usesOlderBinaryOrdinals(hits.context.reader());

Review comment:
   Hmm, interesting @jpountz. So I think this still works as expected in 
the case of a `MultiReader`, but do you see any concerns there? In a situation 
where a `MultiReader` wrapped an 8.x and 9.x index, it seems like the 
`LeafReader`s will correctly report the 8.x / 9.x versions, and the older 
format will be retained for the 8.x index and the newer for 9.x. I'm not very 
familiar with these MultiReader cases though, so I'd like to confirm this 
understanding. Thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a change in pull request #443: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals

2021-11-18 Thread GitBox



gsmiller commented on a change in pull request #443:
URL: https://github.com/apache/lucene/pull/443#discussion_r752355654



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/OrdinalMappingLeafReader.java
##
@@ -107,6 +113,64 @@ public BytesRef binaryValue() {
 }
   }
 
+  private class OrdinalMappingSortedNumericDocValues extends 
FilterSortedNumericDocValues {
+private final IntArrayList currentValues;
+private int currIndex;
+
+OrdinalMappingSortedNumericDocValues(SortedNumericDocValues in) {
+  super(in);
+  currentValues = new IntArrayList(32);
+}
+
+@Override
+public boolean advanceExact(int target) throws IOException {
+  boolean result = in.advanceExact(target);
+  if (result) {
+reloadValues();
+  }
+  return result;
+}
+
+@Override
+public int advance(int target) throws IOException {
+  int result = in.advance(target);
+  if (result != DocIdSetIterator.NO_MORE_DOCS) {
+reloadValues();
+  }
+  return result;
+}
+
+@Override
+public int nextDoc() throws IOException {
+  int result = in.nextDoc();
+  if (result != DocIdSetIterator.NO_MORE_DOCS) {
+reloadValues();
+  }
+  return result;
+}
+
+@Override
+public int docValueCount() {
+  return currentValues.elementsCount;
+}
+
+private void reloadValues() throws IOException {
+  currIndex = 0;
+  currentValues.clear();
+  for (int i = 0; i < in.docValueCount(); i++) {
+currentValues.add(ordinalMap[(int) in.nextValue()]);

Review comment:
   > @gsmiller put another way, I think it would be much more interesting 
to think about something like a new 'taxonomy' or 'hierarchy' type that 
supports the additional features the current taxo index provides, rather than 
providing a lot of hooks for apps to manage ordinals themselves. It could 
probably ultimately perform better and be easier to use (e.g. not juggle a 
sidecar index).
   
   I like this way of thinking about the problem. As far as benchmarking 
performance goes, the big improvement was getting away from the custom binary 
format (which was packing ordinals into a single binary payload using a v-int 
style approach) and using SORTED_NUMERIC instead to encode the ordinals for 
each doc. So there could be room for more improvement (or at least remaining 
performance neutral) by having an "ordinal specific" type doc value field. But 
I like your way of thinking and wouldn't hold up this PR by any means to 
explore something different at this point. Going to SORTED_NUMERIC is a good 
step forward for now. Thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] spyk commented on a change in pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

2021-11-18 Thread GitBox



spyk commented on a change in pull request #380:
URL: https://github.com/apache/lucene/pull/380#discussion_r752359873



##
File path: 
lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/OpenNLPOpsFactory.java
##
@@ -169,11 +169,14 @@ public static String getLemmatizerDictionary(String 
dictionaryFile, ResourceLoad
 builder.append(chars, 0, numRead);
   }
 } while (numRead > 0);
-dictionary = builder.toString();
-lemmaDictionaries.put(dictionaryFile, dictionary);
+String dictionary = builder.toString();
+InputStream dictionaryInputStream =
+new 
ByteArrayInputStream(dictionary.getBytes(StandardCharsets.UTF_8));
+dictionaryLemmatizer = new DictionaryLemmatizer(dictionaryInputStream);

Review comment:
   Oh, you're right, it's the same exact issue with all 
`DictionaryLemmatizer` constructors. So I guess it can be replaced anyway by 
`loader.openResource(dictionaryFile)` for the sake of simplicity. I was not 
aware of the ReaderInputStream solution btw, but yes, looks like it's not on 
the classpath.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz merged pull request #450: LUCENE-10242: The TopScoreDocCollector::createSharedManager should use ScoreDoc instead of FieldDoc

2021-11-18 Thread GitBox



jpountz merged pull request #450:
URL: https://github.com/apache/lucene/pull/450


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] reta commented on pull request #450: LUCENE-10242: The TopScoreDocCollector::createSharedManager should use ScoreDoc instead of FieldDoc

2021-11-18 Thread GitBox



reta commented on pull request #450:
URL: https://github.com/apache/lucene/pull/450#issuecomment-972980540


   Thank you @jpountz !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10242) The TopScoreDocCollector::createSharedManager should use ScoreDoc instead of FieldDoc

2021-11-18 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17445993#comment-17445993
 ] 

ASF subversion and git services commented on LUCENE-10242:
--

Commit 6bd5c14bf3c6113244773d2e1eb39233dc33ad71 in lucene's branch 
refs/heads/main from Andriy Redko
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=6bd5c14 ]

LUCENE-10242: The TopScoreDocCollector::createSharedManager should use ScoreDoc 
instead of FieldDoc (#450)

Signed-off-by: Andriy Redko 

> The TopScoreDocCollector::createSharedManager should use ScoreDoc instead of 
> FieldDoc
> -
>
> Key: LUCENE-10242
> URL: https://issues.apache.org/jira/browse/LUCENE-10242
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 8.9, 8.10
>Reporter: Andriy Redko
>Priority: Minor
> Fix For: 9.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The `TopScoreDocCollector::createSharedManager` should use `ScoreDoc` instead 
> of `FieldDoc` as a type of `after` argument. The `FieldDoc` is really needed 
> only in case of `TopFieldCollector` but now its usage is mandated by 
> `TopScoreDocCollector` as well.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-10242) The TopScoreDocCollector::createSharedManager should use ScoreDoc instead of FieldDoc

2021-11-18 Thread Adrien Grand (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-10242.
---
Fix Version/s: 9.0
   Resolution: Fixed

Thank you!

> The TopScoreDocCollector::createSharedManager should use ScoreDoc instead of 
> FieldDoc
> -
>
> Key: LUCENE-10242
> URL: https://issues.apache.org/jira/browse/LUCENE-10242
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 8.9, 8.10
>Reporter: Andriy Redko
>Priority: Minor
> Fix For: 9.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The `TopScoreDocCollector::createSharedManager` should use `ScoreDoc` instead 
> of `FieldDoc` as a type of `after` argument. The `FieldDoc` is really needed 
> only in case of `TopFieldCollector` but now its usage is mandated by 
> `TopScoreDocCollector` as well.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10242) The TopScoreDocCollector::createSharedManager should use ScoreDoc instead of FieldDoc

2021-11-18 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17445994#comment-17445994
 ] 

ASF subversion and git services commented on LUCENE-10242:
--

Commit 42bee6f22302dad73b2a336db4b4725f69282e33 in lucene's branch 
refs/heads/branch_9x from Andriy Redko
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=42bee6f ]

LUCENE-10242: The TopScoreDocCollector::createSharedManager should use ScoreDoc 
instead of FieldDoc (#450)

Signed-off-by: Andriy Redko 

> The TopScoreDocCollector::createSharedManager should use ScoreDoc instead of 
> FieldDoc
> -
>
> Key: LUCENE-10242
> URL: https://issues.apache.org/jira/browse/LUCENE-10242
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 8.9, 8.10
>Reporter: Andriy Redko
>Priority: Minor
> Fix For: 9.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The `TopScoreDocCollector::createSharedManager` should use `ScoreDoc` instead 
> of `FieldDoc` as a type of `after` argument. The `FieldDoc` is really needed 
> only in case of `TopFieldCollector` but now its usage is mandated by 
> `TopScoreDocCollector` as well.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] costin commented on pull request #453: Investigate whether should Packed64 use a byte[] plus VarHandles instead of a long[]

2021-11-18 Thread GitBox



costin commented on pull request #453:
URL: https://github.com/apache/lucene/pull/453#issuecomment-972988872


   I've tested `Packed64LongByte` multiple times against `TestPackedInts` and 
it seems to pass. Note that I've focused mainly on the `get` and `set` methods 
and removed the specialized fill and bulk get/set methods due to slightly 
different encoding.
   
   tl;dr - `Packed64` has best performance, followed by 
`Packed64VarHandleLongLong` and lastly by `Packed64LongByte` with a significant 
gap. My guess is the long[] array is more cache friendly while the byte[] isn't 
- by sliding across the byte[] to read longs, the caches are thrashed which is 
much more expensive than doing bitwise operations.
   The data in case of byte[] support this - performance is great for the 1 and 
64 case (where the sliding is minimal since new positions belong to the same 
byte block), while for the rest it falls to the ground.
   
   I've included the benchmark in the commit; the tests are running on latest 
JDK 17.0.1 on an Ryzen 5900x.
   
   ```
   openjdk version "17.0.1" 2021-10-19
   OpenJDK Runtime Environment Temurin-17.0.1+12 (build 17.0.1+12)
   OpenJDK 64-Bit Server VM Temurin-17.0.1+12 (build 17.0.1+12, mixed mode, 
sharing)
   ```
   
   ```
   # JMH version: 1.33
   # VM version: JDK 17.0.1, OpenJDK 64-Bit Server VM, 17.0.1+12
   
   Benchmark   (bpv)  (size)   Mode  Cnt
  Score  Error  Units
   Packed64Benchmark.packed64  1   10240  thrpt3  
80129.482 ± 2603.161  ops/s
   Packed64Benchmark.packed64  4   10240  thrpt3  
87680.030 ± 2725.245  ops/s
   Packed64Benchmark.packed64  5   10240  thrpt3  
56768.849 ±  633.855  ops/s
   Packed64Benchmark.packed64  8   10240  thrpt3  
91869.977 ± 1579.788  ops/s
   Packed64Benchmark.packed64 11   10240  thrpt3  
71802.824 ± 2124.577  ops/s
   Packed64Benchmark.packed64 16   10240  thrpt3  
94892.804 ± 1876.298  ops/s
   Packed64Benchmark.packed64 23   10240  thrpt3  
65401.053 ±  690.302  ops/s
   Packed64Benchmark.packed64 25   10240  thrpt3  
69623.043 ±  679.354  ops/s
   Packed64Benchmark.packed64 31   10240  thrpt3  
66732.829 ± 1788.781  ops/s
   Packed64Benchmark.packed64 32   10240  thrpt3  
99938.841 ±  197.846  ops/s
   Packed64Benchmark.packed64 47   10240  thrpt3  
52174.874 ± 2369.976  ops/s
   Packed64Benchmark.packed64 59   10240  thrpt3  
55240.166 ±  165.920  ops/s
   Packed64Benchmark.packed64 61   10240  thrpt3  
54909.784 ± 5064.028  ops/s
   Packed64Benchmark.packed64 64   10240  thrpt3  
99261.182 ± 2746.589  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte  1   10240  thrpt3  
56909.631 ± 1986.897  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte  4   10240  thrpt3  
31584.303 ± 1063.071  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte  5   10240  thrpt3  
23703.179 ± 2127.908  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte  8   10240  thrpt3  
17958.177 ±  825.106  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte 11   10240  thrpt3  
17810.271 ±  379.489  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte 16   10240  thrpt3  
19043.959 ± 2185.749  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte 23   10240  thrpt3  
17784.221 ±  824.539  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte 25   10240  thrpt3  
17728.510 ± 1063.605  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte 31   10240  thrpt3  
17742.034 ±  198.159  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte 32   10240  thrpt3  
21951.514 ± 7648.112  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte 47   10240  thrpt3  
17741.987 ±  866.598  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte 59   10240  thrpt3  
19481.548 ±  446.086  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte 61   10240  thrpt3  
18999.020 ±  949.164  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte 64   10240  thrpt3  
74178.395 ± 1219.876  ops/s
   Packed64Benchmark.packed64VarHandleLongLong 1   10240  thrpt3  
73824.221 ± 3210.970  ops/s
   Packed64Benchmark.packed64VarHandleLongLong 4   10240  thrpt3  
82908.298 ± 8099.224  ops/s
   Packed64Benchmark.packed64VarHandleLongLong 5   10240  thrpt3  
54176.175 ± 2005.872  ops/s
   Packed64Benchmark.packed64VarHandleLongLong 8   10240  thrpt3  
83556.558 ± 1658.929  ops/s
   Packed64Benchmark.packed64VarHandleLongLong11   10240  thrpt3  
56512.522 ±  380.595  ops/s

[GitHub] [lucene] jpountz commented on a change in pull request #443: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals

2021-11-18 Thread GitBox



jpountz commented on a change in pull request #443:
URL: https://github.com/apache/lucene/pull/443#discussion_r752375283



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/FastTaxonomyFacetCounts.java
##
@@ -69,31 +69,34 @@ public FastTaxonomyFacetCounts(
   }
 
   private final void count(List matchingDocs) throws IOException 
{
+if (matchingDocs.isEmpty()) {
+  return;
+}
+
+boolean useBinaryDv = 
FacetUtils.usesOlderBinaryOrdinals(matchingDocs.get(0).context.reader());
+
 for (MatchingDocs hits : matchingDocs) {
-  BinaryDocValues dv = 
hits.context.reader().getBinaryDocValues(indexFieldName);
-  if (dv == null) { // this reader does not have DocValues for the 
requested category list
+  assert useBinaryDv == 
FacetUtils.usesOlderBinaryOrdinals(hits.context.reader());

Review comment:
   @gsmiller Maybe we should recompute `useBinaryDv` on every 
`MatchingDocs` instead of pre-computing it ahead of the loop based on the first 
leaf of the index?
   
   It's probably a corner-case for taxonomy-based faceting though since we 
expect the main index and the taxonomy index to have been built side-by-side, 
so we wouldn't expect users to bring in an index that has been built on the 
side with a different creation version and plug it in through a MultiReader?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10242) The TopScoreDocCollector::createSharedManager should use ScoreDoc instead of FieldDoc

2021-11-18 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17446004#comment-17446004
 ] 

ASF subversion and git services commented on LUCENE-10242:
--

Commit de4402cf622921141aa90ceca826979c14cf0db7 in lucene's branch 
refs/heads/branch_9_0 from Andriy Redko
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=de4402c ]

LUCENE-10242: The TopScoreDocCollector::createSharedManager should use ScoreDoc 
instead of FieldDoc (#450)

Signed-off-by: Andriy Redko 

> The TopScoreDocCollector::createSharedManager should use ScoreDoc instead of 
> FieldDoc
> -
>
> Key: LUCENE-10242
> URL: https://issues.apache.org/jira/browse/LUCENE-10242
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 8.9, 8.10
>Reporter: Andriy Redko
>Priority: Minor
> Fix For: 9.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The `TopScoreDocCollector::createSharedManager` should use `ScoreDoc` instead 
> of `FieldDoc` as a type of `after` argument. The `FieldDoc` is really needed 
> only in case of `TopFieldCollector` but now its usage is mandated by 
> `TopScoreDocCollector` as well.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] magibney commented on a change in pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

2021-11-18 Thread GitBox



magibney commented on a change in pull request #380:
URL: https://github.com/apache/lucene/pull/380#discussion_r752385177



##
File path: 
lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/OpenNLPOpsFactory.java
##
@@ -169,11 +169,14 @@ public static String getLemmatizerDictionary(String 
dictionaryFile, ResourceLoad
 builder.append(chars, 0, numRead);
   }
 } while (numRead > 0);
-dictionary = builder.toString();
-lemmaDictionaries.put(dictionaryFile, dictionary);
+String dictionary = builder.toString();
+InputStream dictionaryInputStream =
+new 
ByteArrayInputStream(dictionary.getBytes(StandardCharsets.UTF_8));
+dictionaryLemmatizer = new DictionaryLemmatizer(dictionaryInputStream);

Review comment:
   All `ReaderInputStream` (from Apache commons-io) does is re-encode a 
`Reader` as an input-stream in some specified encoding. It would be relatively 
straightforward to do this manually, but again is orthogonal to the change 
you're looking to introduce. Ideally, opennlp would have a 
`DictionaryLemmatizer` ctor that accepts a `Reader` directly -- I can't imagine 
that would be a controversial upstream PR?
   
   >for the sake of simplicity
   Yes; I think in its current state this PR would make things better in some 
respects, even if leaving some problems unchanged.
   
   fwiw, I'm not necessarily inclined to assume a logical motivation in the 
current recreation-from-String of `DictionaryLemmatizer` for every call to 
`.create()`. Even if there _had_ been concerns for thread safety or whatever, 
that wouldn't explain why the dictionary would be stored as a `String` as 
opposed to a `byte[]`, given that as a consequence every invocation of 
`.create()` superfluously re-encodes the String as UTF-8 :-/
   
   Ultimately it looks like the heap requirement here is currently ~3x what it 
could be, given the on-heap String, encoded to a transient monolithic `byte[]`, 
parsed into a `DictionaryLemmatizer` ... 

##
File path: 
lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/OpenNLPOpsFactory.java
##
@@ -169,11 +169,14 @@ public static String getLemmatizerDictionary(String 
dictionaryFile, ResourceLoad
 builder.append(chars, 0, numRead);
   }
 } while (numRead > 0);
-dictionary = builder.toString();
-lemmaDictionaries.put(dictionaryFile, dictionary);
+String dictionary = builder.toString();
+InputStream dictionaryInputStream =
+new 
ByteArrayInputStream(dictionary.getBytes(StandardCharsets.UTF_8));
+dictionaryLemmatizer = new DictionaryLemmatizer(dictionaryInputStream);

Review comment:
   All `ReaderInputStream` (from Apache commons-io) does is re-encode a 
`Reader` as an input-stream in some specified encoding. It would be relatively 
straightforward to do this manually, but again is orthogonal to the change 
you're looking to introduce. Ideally, opennlp would have a 
`DictionaryLemmatizer` ctor that accepts a `Reader` directly -- I can't imagine 
that would be a controversial upstream PR?
   
   >for the sake of simplicity
   
   Yes; I think in its current state this PR would make things better in some 
respects, even if leaving some problems unchanged.
   
   fwiw, I'm not necessarily inclined to assume a logical motivation in the 
current recreation-from-String of `DictionaryLemmatizer` for every call to 
`.create()`. Even if there _had_ been concerns for thread safety or whatever, 
that wouldn't explain why the dictionary would be stored as a `String` as 
opposed to a `byte[]`, given that as a consequence every invocation of 
`.create()` superfluously re-encodes the String as UTF-8 :-/
   
   Ultimately it looks like the heap requirement here is currently ~3x what it 
could be, given the on-heap String, encoded to a transient monolithic `byte[]`, 
parsed into a `DictionaryLemmatizer` ... 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] costin commented on pull request #453: Investigate whether should Packed64 use a byte[] plus VarHandles instead of a long[]

2021-11-18 Thread GitBox



costin commented on pull request #453:
URL: https://github.com/apache/lucene/pull/453#issuecomment-972999769


   Adding a few more info to this ticket regarding the underlying assembly.
   The p64lb method is the shortest, followed by p64 and p64ll.
   
   ```
   = C1-compiled nmethod 
==
   --- Assembly 
---
   
   Compiled method (c1) 194  774   3   
org.apache.lucene.util.packed.Packed64::set (145 bytes)
total in heap  [0x01de36e94990,0x01de36e94f20] = 1424
relocation [0x01de36e94ae8,0x01de36e94b30] = 72
main code  [0x01de36e94b40,0x01de36e94dc0] = 640
stub code  [0x01de36e94dc0,0x01de36e94df8] = 56
oops   [0x01de36e94df8,0x01de36e94e00] = 8
metadata   [0x01de36e94e00,0x01de36e94e08] = 8
scopes data[0x01de36e94e08,0x01de36e94e50] = 72
scopes pcs [0x01de36e94e50,0x01de36e94ef0] = 160
dependencies   [0x01de36e94ef0,0x01de36e94ef8] = 8
nul chk table  [0x01de36e94ef8,0x01de36e94f20] = 40
   
   --- Assembly 
---
   
   Compiled method (c1) 173  799   3   
org.apache.lucene.util.packed.Packed64VHLongAndByte::set (144 bytes)
total in heap  [0x01c7d511be10,0x01c7d511d720] = 6416
relocation [0x01c7d511bf68,0x01c7d511c0c8] = 352
main code  [0x01c7d511c0e0,0x01c7d511d0e0] = 4096
stub code  [0x01c7d511d0e0,0x01c7d511d158] = 120
oops   [0x01c7d511d158,0x01c7d511d190] = 56
metadata   [0x01c7d511d190,0x01c7d511d1e8] = 88
scopes data[0x01c7d511d1e8,0x01c7d511d500] = 792
scopes pcs [0x01c7d511d500,0x01c7d511d6e0] = 480
dependencies   [0x01c7d511d6e0,0x01c7d511d6e8] = 8
nul chk table  [0x01c7d511d6e8,0x01c7d511d720] = 56
   
   --- Assembly 
---
   
   Compiled method (c1) 177  797   3   
org.apache.lucene.util.packed.Packed64VHLongLong::set (204 bytes)
total in heap  [0x018a00893b90,0x018a00897390] = 14336
relocation [0x018a00893ce8,0x018a00894060] = 888
main code  [0x018a00894060,0x018a008965a0] = 9536
stub code  [0x018a008965a0,0x018a00896698] = 248
oops   [0x018a00896698,0x018a008966d0] = 56
metadata   [0x018a008966d0,0x018a00896728] = 88
scopes data[0x018a00896728,0x018a00896e70] = 1864
scopes pcs [0x018a00896e70,0x018a00897320] = 1200
dependencies   [0x018a00897320,0x018a00897328] = 8
nul chk table  [0x018a00897328,0x018a00897390] = 104
   
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a change in pull request #443: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals

2021-11-18 Thread GitBox



gsmiller commented on a change in pull request #443:
URL: https://github.com/apache/lucene/pull/443#discussion_r752406697



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/FastTaxonomyFacetCounts.java
##
@@ -69,31 +69,34 @@ public FastTaxonomyFacetCounts(
   }
 
   private final void count(List matchingDocs) throws IOException 
{
+if (matchingDocs.isEmpty()) {
+  return;
+}
+
+boolean useBinaryDv = 
FacetUtils.usesOlderBinaryOrdinals(matchingDocs.get(0).context.reader());
+
 for (MatchingDocs hits : matchingDocs) {
-  BinaryDocValues dv = 
hits.context.reader().getBinaryDocValues(indexFieldName);
-  if (dv == null) { // this reader does not have DocValues for the 
requested category list
+  assert useBinaryDv == 
FacetUtils.usesOlderBinaryOrdinals(hits.context.reader());

Review comment:
   Thanks @jpountz. I _think_ I'm effectively doing this now in the latest 
revision with since `FacetUtil#loadOrdinalValues` does the check for the 
provided leaf when determining what format to expect, and that is invoked for 
each `MatchingDocs` instance. So I _think_ we're covered here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10122) Explore using NumericDocValue to store taxonomy parent array

2021-11-18 Thread Adrien Grand (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17446030#comment-17446030
 ] 

Adrien Grand commented on LUCENE-10122:
---

It looks like this change is missing from branch_9x?

> Explore using NumericDocValue to store taxonomy parent array
> 
>
> Key: LUCENE-10122
> URL: https://issues.apache.org/jira/browse/LUCENE-10122
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: 9.0
>Reporter: Haoyu Zhai
>Priority: Minor
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> We currently use term position of a hardcoded term in a hardcoded field to 
> represent the parent ordinal of each taxonomy label. That is an old way and 
> perhaps could be dated back to the time where doc values didn't exist.
> We probably would want to use NumericDocValues instead given we have spent 
> quite a lot of effort optimizing them.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on a change in pull request #443: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals

2021-11-18 Thread GitBox



jpountz commented on a change in pull request #443:
URL: https://github.com/apache/lucene/pull/443#discussion_r752431752



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/FastTaxonomyFacetCounts.java
##
@@ -69,31 +69,34 @@ public FastTaxonomyFacetCounts(
   }
 
   private final void count(List matchingDocs) throws IOException 
{
+if (matchingDocs.isEmpty()) {
+  return;
+}
+
+boolean useBinaryDv = 
FacetUtils.usesOlderBinaryOrdinals(matchingDocs.get(0).context.reader());
+
 for (MatchingDocs hits : matchingDocs) {
-  BinaryDocValues dv = 
hits.context.reader().getBinaryDocValues(indexFieldName);
-  if (dv == null) { // this reader does not have DocValues for the 
requested category list
+  assert useBinaryDv == 
FacetUtils.usesOlderBinaryOrdinals(hits.context.reader());

Review comment:
   Ah, you are right. I got confused because this comment was linked to a 
previous version of your code that didn't use `FacetUtil#loadOrdinalValues` yet!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10208) Minimum score can decrease in concurrent search

2021-11-18 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17446041#comment-17446041
 ] 

ASF subversion and git services commented on LUCENE-10208:
--

Commit 2e5c4bb5a53f8367fb44f8db1bf6454d2d9b8e37 in lucene's branch 
refs/heads/branch_9x from Jim Ferenczi
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=2e5c4bb ]

LUCENE-10208: Ensure that the minimum competitive score does not decrease in 
concurrent search (#431)

Co-authored-by: Adrien Grand 

> Minimum score can decrease in concurrent search
> ---
>
> Key: LUCENE-10208
> URL: https://issues.apache.org/jira/browse/LUCENE-10208
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Jim Ferenczi
>Priority: Minor
> Fix For: 9.0, 8.11
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> TestLatLonPointDistanceFeatureQuery#testCompareSorting started to fail 
> sporadically after https://github.com/apache/lucene/pull/331. 
> The test change added in this PR exposes an existing bug in top docs 
> collector. 
> They re-set the minimum score multiple times per segment when a bulk scorer 
> is used.
> In practice this is not a problem because the local minimum score cannot 
> decrease. 
> However when concurrent search is used,  the global minimum score is updated 
> after the local one so that breaks the assertion.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] spyk commented on a change in pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

2021-11-18 Thread GitBox



spyk commented on a change in pull request #380:
URL: https://github.com/apache/lucene/pull/380#discussion_r752443863



##
File path: 
lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/OpenNLPOpsFactory.java
##
@@ -169,11 +169,14 @@ public static String getLemmatizerDictionary(String 
dictionaryFile, ResourceLoad
 builder.append(chars, 0, numRead);
   }
 } while (numRead > 0);
-dictionary = builder.toString();
-lemmaDictionaries.put(dictionaryFile, dictionary);
+String dictionary = builder.toString();
+InputStream dictionaryInputStream =
+new 
ByteArrayInputStream(dictionary.getBytes(StandardCharsets.UTF_8));
+dictionaryLemmatizer = new DictionaryLemmatizer(dictionaryInputStream);

Review comment:
   Just to clarify the main issue with the String being cached instead of 
the DictionaryLemmatizer (aside any thread safety concerns) is that the 
DictionaryLemmatizer parses and creates a new HashMap each time `create()` is 
called. So, in our case with a 5MB lemmas.txt file across several fields, it 
crashed a 64GB cluster with OOM with the heap containing several hundred 
DictionaryLemmatizer instances, each with its own ~80MB internal hashmap.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] magibney commented on a change in pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

2021-11-18 Thread GitBox



magibney commented on a change in pull request #380:
URL: https://github.com/apache/lucene/pull/380#discussion_r752448693



##
File path: 
lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/OpenNLPOpsFactory.java
##
@@ -169,11 +169,14 @@ public static String getLemmatizerDictionary(String 
dictionaryFile, ResourceLoad
 builder.append(chars, 0, numRead);
   }
 } while (numRead > 0);
-dictionary = builder.toString();
-lemmaDictionaries.put(dictionaryFile, dictionary);
+String dictionary = builder.toString();
+InputStream dictionaryInputStream =
+new 
ByteArrayInputStream(dictionary.getBytes(StandardCharsets.UTF_8));
+dictionaryLemmatizer = new DictionaryLemmatizer(dictionaryInputStream);

Review comment:
   Yes, sorry -- thanks for the clarification. What I mentioned was 
tangential, and narrowly focused on/reading between the lines of one minor 
aspect of the current implementation. I didn't mean to imply the 
String/re-parsing _per se_ was the main issue. The hashmaps are surely more 
important, as you say.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10208) Minimum score can decrease in concurrent search

2021-11-18 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17446050#comment-17446050
 ] 

ASF subversion and git services commented on LUCENE-10208:
--

Commit 0018567e8b16f3ba2f88a912afdc6367bb3ce675 in lucene's branch 
refs/heads/branch_9_0 from Jim Ferenczi
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=0018567 ]

LUCENE-10208: Ensure that the minimum competitive score does not decrease in 
concurrent search (#431)

Co-authored-by: Adrien Grand 

> Minimum score can decrease in concurrent search
> ---
>
> Key: LUCENE-10208
> URL: https://issues.apache.org/jira/browse/LUCENE-10208
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Jim Ferenczi
>Priority: Minor
> Fix For: 9.0, 8.11
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> TestLatLonPointDistanceFeatureQuery#testCompareSorting started to fail 
> sporadically after https://github.com/apache/lucene/pull/331. 
> The test change added in this PR exposes an existing bug in top docs 
> collector. 
> They re-set the minimum score multiple times per segment when a bulk scorer 
> is used.
> In practice this is not a problem because the local minimum score cannot 
> decrease. 
> However when concurrent search is used,  the global minimum score is updated 
> after the local one so that breaks the assertion.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] zhaih commented on a change in pull request #442: LUCENE-10122 Use NumericDocValue to store taxonomy parent array

2021-11-18 Thread GitBox



zhaih commented on a change in pull request #442:
URL: https://github.com/apache/lucene/pull/442#discussion_r752459454



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java
##
@@ -130,6 +125,46 @@ private void initParents(IndexReader reader, int first) 
throws IOException {
   return;
 }
 
+if (getMajorVersion(reader) <= 8) {
+  loadParentUsingTermPosition(reader, first);
+  return;
+}
+
+for (LeafReaderContext leafContext : reader.leaves()) {
+  int leafDocNum = leafContext.reader().maxDoc();
+  if (leafContext.docBase + leafDocNum <= first) {
+// skip this leaf if it does not contain new categories
+continue;
+  }
+  NumericDocValues parentValues =
+  
leafContext.reader().getNumericDocValues(Consts.FIELD_PARENT_ORDINAL_NDV);
+  if (parentValues == null) {
+throw new CorruptIndexException(
+"Parent data field " + Consts.FIELD_PARENT_ORDINAL_NDV + " not 
exists",
+leafContext.reader().toString());
+  }
+
+  for (int doc = Math.max(first - leafContext.docBase, 0); doc < 
leafDocNum; doc++) {
+if (parentValues.advanceExact(doc) == false) {
+  throw new CorruptIndexException(
+  "Missing parent data for category " + (doc + 
leafContext.docBase), reader.toString());
+}
+// we're putting an int and converting it back so it should be safe
+parents[doc + leafContext.docBase] = 
Math.toIntExact(parentValues.longValue());
+  }
+}
+  }
+
+  private static int getMajorVersion(IndexReader reader) {
+return 
reader.leaves().get(0).reader().getMetaData().getCreatedVersionMajor();

Review comment:
   Yeah it is guaranteed to be non-empty, but it might be a good idea 
adding an assertion (fortunately this is just a private method




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] zhaih opened a new pull request #454: LUCENE-10122 Use NumericDocValue to store taxonomy parent array

2021-11-18 Thread GitBox



zhaih opened a new pull request #454:
URL: https://github.com/apache/lucene/pull/454


   # Note
   PR towards branch_9x, mostly the same as 
https://github.com/apache/lucene/pull/442, adding one additional assertion to 
make sure when checking the index version the index is not empty
   
   # Checklist
   
   Please review the following and check all that apply:
   ](https://github.com/apache/lucene/pull/442)
   - [x] I have reviewed the guidelines for [How to 
Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code 
conforms to the standards described there to the best of my ability.
   - [x] I have created a Jira issue and added the issue ID to my pull request 
title.
   - [x] I have given Lucene maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `main` branch.
   - [x] I have run `./gradlew check`.
   - [ ] I have added tests for my changes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10122) Explore using NumericDocValue to store taxonomy parent array

2021-11-18 Thread Haoyu Zhai (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17446058#comment-17446058
 ] 

Haoyu Zhai commented on LUCENE-10122:
-

Ah Thanks [~jpountz] for reminding, I forgot that, here we go: 
https://github.com/apache/lucene/pull/454

> Explore using NumericDocValue to store taxonomy parent array
> 
>
> Key: LUCENE-10122
> URL: https://issues.apache.org/jira/browse/LUCENE-10122
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: 9.0
>Reporter: Haoyu Zhai
>Priority: Minor
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> We currently use term position of a hardcoded term in a hardcoded field to 
> represent the parent ordinal of each taxonomy label. That is an old way and 
> perhaps could be dated back to the time where doc values didn't exist.
> We probably would want to use NumericDocValues instead given we have spent 
> quite a lot of effort optimizing them.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] msokolov commented on pull request #264: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals (instead of custom binary format)

2021-11-18 Thread GitBox



msokolov commented on pull request #264:
URL: https://github.com/apache/lucene/pull/264#issuecomment-973094350


   >  I still suggest we refrain from reviewing this PR in detail
   
   oops, I only read that at the end; still and all I don't have much to say; 
this area of the code is fairly new for me. Looks like a great simplification 
and the speedups you found are awesome! On a related topic, while I was 
stumbling around in here I found we have the option to accumulate scores in 
MatchingDocs which I'd be excited to explore for doing relevance-weighted sums.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] msokolov commented on a change in pull request #416: LUCENE-10054 Make HnswGraph hierarchical

2021-11-18 Thread GitBox



msokolov commented on a change in pull request #416:
URL: https://github.com/apache/lucene/pull/416#discussion_r752476787



##
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##
@@ -56,31 +59,50 @@
 public final class HnswGraph extends KnnGraphValues {
 
   private final int maxConn;
+  private int numLevels; // the current number of levels in the graph
+  private int entryNode; // the current graph entry node on the top level
 
-  // Each entry lists the top maxConn neighbors of a node. The nodes 
correspond to vectors added to
-  // HnswBuilder, and the
-  // node values are the ordinals of those vectors.
-  private final List graph;
+  // Nodes by level expressed as the level 0's nodes' ordinals.
+  // As level 0 contains all nodes, nodesByLevel.get(0) is null.
+  private final List nodesByLevel;
+
+  // graph is a list of graph levels.
+  // Each level is represented as List – nodes' connections on 
this level.
+  // Each entry in the list has the top maxConn neighbors of a node. The nodes 
correspond to vectors
+  // added to HnswBuilder, and the node values are the ordinals of those 
vectors.
+  // Thus, on all levels, neighbors expressed as the level 0's nodes' ordinals.
+  private final List> graph;
 
   // KnnGraphValues iterator members
   private int upto;
   private NeighborArray cur;
 
-  HnswGraph(int maxConn) {
-graph = new ArrayList<>();
-// Typically with diversity criteria we see nodes not fully occupied; 
average fanout seems to be
-// about 1/2 maxConn. There is some indexing time penalty for 
under-allocating, but saves RAM
-graph.add(new NeighborArray(Math.max(32, maxConn / 4)));
+  HnswGraph(int maxConn, int levelOfFirstNode) {
 this.maxConn = maxConn;
+this.numLevels = levelOfFirstNode + 1;
+this.graph = new ArrayList<>(numLevels);
+this.entryNode = 0;
+for (int i = 0; i < numLevels; i++) {
+  graph.add(new ArrayList<>());
+  // Typically with diversity criteria we see nodes not fully occupied;
+  // average fanout seems to be about 1/2 maxConn.
+  // There is some indexing time penalty for under-allocating, but saves 
RAM
+  graph.get(i).add(new NeighborArray(Math.max(32, maxConn / 4)));

Review comment:
   heh, this makes no sense at all! Git-spelunking, it looks like I added 
this fancy dancing when adding the diversity criteria, but overlooked `addNode` 
- which does the majority of the allocation! It could be worth taking a lok to 
see if we can cheaply save some RAM




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] msokolov commented on a change in pull request #416: LUCENE-10054 Make HnswGraph hierarchical

2021-11-18 Thread GitBox



msokolov commented on a change in pull request #416:
URL: https://github.com/apache/lucene/pull/416#discussion_r752477271



##
File path: lucene/core/src/java/org/apache/lucene/index/KnnGraphValues.java
##
@@ -32,25 +33,41 @@
   protected KnnGraphValues() {}
 
   /**
-   * Move the pointer to exactly {@code target}, the id of a node in the 
graph. After this method
+   * Move the pointer to exactly the given {@code level}'s {@code target}. 
After this method
* returns, call {@link #nextNeighbor()} to return successive (ordered) 
connected node ordinals.
*
-   * @param target must be a valid node in the graph, ie. ≥ 0 and < 
{@link
+   * @param level level of the graph
+   * @param target ordinal of a node in the graph, must be ≥ 0 and < 
{@link
* VectorValues#size()}.
*/
-  public abstract void seek(int target) throws IOException;
+  public abstract void seek(int level, int target) throws IOException;
 
   /** Returns the number of nodes in the graph */
   public abstract int size();
 
   /**
* Iterates over the neighbor list. It is illegal to call this method after 
it returns
-   * NO_MORE_DOCS without calling {@link #seek(int)}, which resets the 
iterator.
+   * NO_MORE_DOCS without calling {@link #seek(int, int)}, which resets the 
iterator.
*
* @return a node ordinal in the graph, or NO_MORE_DOCS if the iteration is 
complete.
*/
   public abstract int nextNeighbor() throws IOException;
 
+  /** Returns the number of levels of the graph */
+  public abstract int numLevels() throws IOException;
+
+  /** Returns graph's entry point on the top level * */
+  public abstract int entryNode() throws IOException;
+
+  /**
+   * Get all nodes on a given level as node 0th ordinals
+   *
+   * @param level level for which to get all nodes
+   * @return an iterator over nodes where {@code nextDoc} returns a next node 
ordinal
+   */
+  // TODO: return a more suitable iterator over nodes than DocIdSetIterator

Review comment:
   I guess we could make one up




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #443: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals

2021-11-18 Thread GitBox



gsmiller commented on pull request #443:
URL: https://github.com/apache/lucene/pull/443#issuecomment-973151611


   For posterity, I re-ran benchmarks with this most recent change, comparing 
against `branch_9_0` as a baseline (so it includes all the back-compat checks 
and such). Results still look very good:
   
   ```
   TaskQPS baseline  StdDevQPS candidate  
StdDevPct diff p-value
  HighTermDayOfYearSort  127.51 (15.4%)  124.04 
(12.1%)   -2.7% ( -26% -   29%) 0.534
 TermDTSort   96.36 (15.3%)   94.54 
(15.1%)   -1.9% ( -28% -   33%) 0.695
  HighTermMonthSort  135.11 (13.3%)  133.31 
(16.3%)   -1.3% ( -27% -   32%) 0.778
 Fuzzy1   63.28  (8.3%)   62.55  
(8.6%)   -1.2% ( -16% -   17%) 0.662
   PKLookup  152.55  (2.1%)  151.14  
(1.8%)   -0.9% (  -4% -3%) 0.139
MedIntervalsOrdered   82.93  (3.3%)   82.62  
(2.4%)   -0.4% (  -5% -5%) 0.681
  BrowseDayOfYearSSDVFacets   14.45 (14.5%)   14.40 
(14.1%)   -0.3% ( -25% -   33%) 0.945
   HighTermTitleBDVSort  106.51 (21.3%)  106.18 
(21.6%)   -0.3% ( -35% -   54%) 0.964
MedTerm 1238.25  (2.3%) 1235.30  
(2.5%)   -0.2% (  -4% -4%) 0.753
LowIntervalsOrdered  112.01  (2.2%)  111.75  
(1.9%)   -0.2% (  -4% -4%) 0.724
Prefix3  246.62  (2.2%)  246.10  
(2.2%)   -0.2% (  -4% -4%) 0.764
MedSloppyPhrase   72.18  (2.9%)   72.06  
(3.0%)   -0.2% (  -5% -5%) 0.861
  OrHighNotHigh  530.67  (2.9%)  529.98  
(2.1%)   -0.1% (  -5% -5%) 0.873
   OrHighNotLow  581.58  (3.3%)  581.28  
(2.3%)   -0.1% (  -5% -5%) 0.954
  BrowseMonthSSDVFacets   14.32  (4.5%)   14.32  
(4.5%)   -0.0% (  -8% -9%) 0.977
  MedPhrase  200.56  (1.9%)  200.54  
(1.9%)   -0.0% (  -3% -3%) 0.992
   OrHighNotMed  600.02  (2.7%)  599.99  
(3.1%)   -0.0% (  -5% -5%) 0.995
 IntNRQ   56.03  (3.5%)   56.03  
(3.5%)0.0% (  -6% -7%) 0.995
   HighIntervalsOrdered1.86  (2.2%)1.86  
(2.2%)0.1% (  -4% -4%) 0.908
LowSpanNear   44.66  (1.7%)   44.70  
(1.6%)0.1% (  -3% -3%) 0.874
   HighTerm  974.66  (3.1%)  975.67  
(2.8%)0.1% (  -5% -6%) 0.912
Respell   59.10  (1.8%)   59.17  
(1.9%)0.1% (  -3% -3%) 0.849
 HighPhrase   30.48  (1.9%)   30.52  
(1.8%)0.1% (  -3% -3%) 0.819
  OrHighLow  310.16  (2.6%)  310.61  
(2.3%)0.1% (  -4% -5%) 0.849
LowTerm 1419.43  (2.7%) 1421.64  
(2.5%)0.2% (  -4% -5%) 0.851
MedSpanNear   14.05  (2.3%)   14.08  
(2.2%)0.2% (  -4% -4%) 0.798
  LowPhrase  218.41  (4.5%)  218.81  
(4.7%)0.2% (  -8% -9%) 0.899
  OrNotHighHigh  531.27  (2.5%)  532.41  
(2.1%)0.2% (  -4% -4%) 0.768
LowSloppyPhrase   23.57  (1.6%)   23.63  
(1.8%)0.2% (  -3% -3%) 0.649
   HighSloppyPhrase   34.72  (2.0%)   34.83  
(2.0%)0.3% (  -3% -4%) 0.595
   OrNotHighLow  786.86  (2.0%)  789.56  
(1.9%)0.3% (  -3% -4%) 0.580
   HighSpanNear   32.06  (2.9%)   32.17  
(2.6%)0.3% (  -4% -5%) 0.690
 AndHighMed  210.90  (3.2%)  212.04  
(3.1%)0.5% (  -5% -7%) 0.589
   Wildcard  106.13  (3.0%)  106.83  
(3.3%)0.7% (  -5% -7%) 0.505
  OrHighMed   60.24  (2.3%)   60.69  
(2.4%)0.7% (  -3% -5%) 0.318
AndHighHigh   54.27  (3.3%)   54.70  
(3.3%)0.8% (  -5% -7%) 0.448
 AndHighLow  514.75  (2.5%)  518.95  
(2.3%)0.8% (  -3% -5%) 0.284
 OrHighHigh   25.30  (2.5%)   25.51  
(2.3%)0.8% (  -3% -5%) 0.274
   OrNotHighMed  636.45  (2.8%)  643.57  
(2.8%)1.1% (  -4% -6%) 0.204
 Fuzzy2   49.98  (7.1%)   50.55

[GitHub] [lucene] gsmiller commented on a change in pull request #443: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals

2021-11-18 Thread GitBox



gsmiller commented on a change in pull request #443:
URL: https://github.com/apache/lucene/pull/443#discussion_r752525277



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/FastTaxonomyFacetCounts.java
##
@@ -69,31 +69,34 @@ public FastTaxonomyFacetCounts(
   }
 
   private final void count(List matchingDocs) throws IOException 
{
+if (matchingDocs.isEmpty()) {
+  return;
+}
+
+boolean useBinaryDv = 
FacetUtils.usesOlderBinaryOrdinals(matchingDocs.get(0).context.reader());
+
 for (MatchingDocs hits : matchingDocs) {
-  BinaryDocValues dv = 
hits.context.reader().getBinaryDocValues(indexFieldName);
-  if (dv == null) { // this reader does not have DocValues for the 
requested category list
+  assert useBinaryDv == 
FacetUtils.usesOlderBinaryOrdinals(hits.context.reader());

Review comment:
   Yeah, no problem. Thanks for the suggestions!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #443: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals

2021-11-18 Thread GitBox



gsmiller commented on pull request #443:
URL: https://github.com/apache/lucene/pull/443#issuecomment-973173432


   OK, I believe this change is good-to-go now. I _think_ I've addressed all 
the feedback so far and just did another benchmark run to make sure all the 
performance benefits are still showing up with the back-compat considerations 
in place, etc. Please let me know if anyone has additional feedback. Thanks 
again everyone!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a change in pull request #442: LUCENE-10122 Use NumericDocValue to store taxonomy parent array

2021-11-18 Thread GitBox



gsmiller commented on a change in pull request #442:
URL: https://github.com/apache/lucene/pull/442#discussion_r752558188



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java
##
@@ -92,15 +93,18 @@
   private final Directory dir;
   private final IndexWriter indexWriter;
   private final boolean useOlderStoredFieldIndex;
+  private final boolean useOlderTermPositionForParentOrdinal;

Review comment:
   Just merged this into my working branch. Thanks for making that merge so 
easy!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] costin commented on pull request #453: Investigate whether should Packed64 use a byte[] plus VarHandles instead of a long[]

2021-11-18 Thread GitBox



costin commented on pull request #453:
URL: https://github.com/apache/lucene/pull/453#issuecomment-973193877


   Thanks to @jpountz I've added two more tests beside JMH.
   `PackedRandomTest` which does a brute force for testing performance of 
random locations instead of consecutive ones. Surprisingly the VHLongByte 
implementation comes ahead, followed by Packed64 and finally Packed64VHLongLong 
- the results on my machine are roughly:
   ```
   1. VHLongAndByte 939100
   2. Packed64 1358500
   3. VHLongLong   1510400
   ```
   
   Adrien suggested another workload, more realistic `PackedMacroTest` which 
does `SortedDocValues` merging, which creates an `OrdinalMap` backed by 
`PackedInts` and then recorded the merge times.
   In all cases I manually updated the PackedInt class to returned the desired 
implementation.  Based on these set of results, Packed64 is on the first place, 
followed by VHLongAndByte with VHLongLong coming in third.
   
   I'm attaching the merging times found for reference:
   [p64vhll.log](https://github.com/apache/lucene/files/7565523/p64vhll.log)
   [p64.log](https://github.com/apache/lucene/files/7565524/p64.log)
   [p64vhlb.log](https://github.com/apache/lucene/files/7565525/p64vhlb.log)
   .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10244) Please consider opening MultiCollector::getCollectors for public use

2021-11-18 Thread Andriy Redko (Jira)

Andriy Redko created LUCENE-10244:
-

 Summary: Please consider opening MultiCollector::getCollectors for 
public use
 Key: LUCENE-10244
 URL: https://issues.apache.org/jira/browse/LUCENE-10244
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/search
Affects Versions: 8.10.1
Reporter: Andriy Redko


The usage of *MultiCollector* sometimes is very convenient along with 
{*}CollectorManager{*}s instances (as an alternative to 
{*}MultiCollectorManager{*}), for example:
{code:java}
class SomeCollectorManager implements CollectorManager {
    @Override
    public Collector newCollector() throws IOException {
      return MultiCollector.wrap(new Collector1(), new Collector2(), ...);
    }
    @Override
    public List reduce(Collection collectors) throws 
IOException {
      return ...;
    }
}{code}
Unfortunately, the *reduce()* phase is lacking the access to  
*MultiCollector::getCollectors* method, since it is package protected at the 
moment. Making *MultiCollector::getCollectors*  public would make it convenient 
to use *MultiCollector +* *CollectorManager* and implement the reduce phase by 
accessing individual collectors.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] costin edited a comment on pull request #453: Investigate whether should Packed64 use a byte[] plus VarHandles instead of a long[]

2021-11-18 Thread GitBox



costin edited a comment on pull request #453:
URL: https://github.com/apache/lucene/pull/453#issuecomment-973193877


   Thanks to @jpountz I've added two more tests beside JMH.
   `PackedRandomTest` which does a brute force for testing performance of 
random locations instead of consecutive ones. Surprisingly the VHLongByte 
implementation comes ahead, followed by Packed64 and finally Packed64VHLongLong 
- the results on my machine are roughly:
   ```
   1. VHLongAndByte 939100
   2. Packed64 1358500
   3. VHLongLong   1510400
   ```
   
   Adrien suggested another workload, more realistic `PackedMacroTest` which 
does `SortedDocValues` merging, which creates an `OrdinalMap` backed by 
`PackedInts` and then recorded the merge times.
   In all cases I manually updated the PackedInt class to returned the desired 
implementation.  Based on these set of results, Packed64 is on the first place, 
followed by VHLongAndByte with VHLongLong coming in third.
   
   I'm attaching the merging times found for reference:
   [p64vhll.log](https://github.com/apache/lucene/files/7565523/p64vhll.log)
   [p64.log](https://github.com/apache/lucene/files/7565524/p64.log)
   [p64vhlb.log](https://github.com/apache/lucene/files/7565525/p64vhlb.log)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10244) Please consider opening MultiCollector::getCollectors for public use

2021-11-18 Thread Andriy Redko (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17446118#comment-17446118
 ] 

Andriy Redko commented on LUCENE-10244:
---

https://github.com/apache/lucene/pull/455

> Please consider opening MultiCollector::getCollectors for public use
> 
>
> Key: LUCENE-10244
> URL: https://issues.apache.org/jira/browse/LUCENE-10244
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: 8.10.1
>Reporter: Andriy Redko
>Priority: Minor
>
> The usage of *MultiCollector* sometimes is very convenient along with 
> {*}CollectorManager{*}s instances (as an alternative to 
> {*}MultiCollectorManager{*}), for example:
> {code:java}
> class SomeCollectorManager implements CollectorManager {
>     @Override
>     public Collector newCollector() throws IOException {
>       return MultiCollector.wrap(new Collector1(), new Collector2(), ...);
>     }
>     @Override
>     public List reduce(Collection collectors) throws 
> IOException {
>       return ...;
>     }
> }{code}
> Unfortunately, the *reduce()* phase is lacking the access to  
> *MultiCollector::getCollectors* method, since it is package protected at the 
> moment. Making *MultiCollector::getCollectors*  public would make it 
> convenient to use *MultiCollector +* *CollectorManager* and implement the 
> reduce phase by accessing individual collectors.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] reta opened a new pull request #455: LUCENE-10244: Please consider opening MultiCollector::getCollectors for public use

2021-11-18 Thread GitBox



reta opened a new pull request #455:
URL: https://github.com/apache/lucene/pull/455


   Signed-off-by: Andriy Redko 
   
   # Description
   
   
   
   The usage of `MultiCollector` sometimes is very convenient along with 
`CollectorManager` instances (as an alternative to `MultiCollectorManager`), 
for example:
   
   ```
   class SomeCollectorManager implements CollectorManager {
   @Override
   public Collector newCollector() throws IOException {
 return MultiCollector.wrap(new Collector1(), new Collector2(), ...);
   }
   @Override
   public List reduce(Collection collectors) throws 
IOException {
 return ...;
   }
   }
   ```
   
   Unfortunately, the reduce() phase is lacking the access to  
`MultiCollector::getCollectors` method, since it is package protected at the 
moment. Making `MultiCollector::getCollectors`  public would make it convenient 
to use `MultiCollector` + `CollectorManager` and implement the reduce phase by 
accessing individual collectors.
   
   # Solution
   
   Make `MultiCollector::getCollectors`  public 
   
   # Tests
   
   The test which illustrates the usage of the `MultiCollector` + 
`CollectorManager` has been added.
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [X] I have reviewed the guidelines for [How to 
Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code 
conforms to the standards described there to the best of my ability.
   - [X] I have created a Jira issue and added the issue ID to my pull request 
title.
   - [X] I have given Lucene maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [X] I have developed this patch against the `main` branch.
   - [X] I have run `./gradlew check`.
   - [X] I have added tests for my changes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] reta commented on pull request #455: LUCENE-10244: Please consider opening MultiCollector::getCollectors for public use

2021-11-18 Thread GitBox



reta commented on pull request #455:
URL: https://github.com/apache/lucene/pull/455#issuecomment-973198716


   @jpountz if you don't mind taking a look, really appreciate it, thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mayya-sharipova edited a comment on pull request #416: LUCENE-10054 Make HnswGraph hierarchical

2021-11-18 Thread GitBox



mayya-sharipova edited a comment on pull request #416:
URL: https://github.com/apache/lucene/pull/416#issuecomment-972170715


   @jtibshirani Thanks for your extensive review. I've tried to address some 
comments, and some comments I think we still need to discuss
   
   > does it have an impact on indexing speed?
   
   I've run another benchmarking experiment with `KnnGraphTester` , and as can 
be seen below there seems to be a small impact on indexing speed:  6% increase 
in indexing time;.
   
    Comparison of glove-100
   
   baseline: main branch
   candidate: this PR
   
   | Baseline | candidate   
 |
   |  | 
 |
   | built 118 in 23061/2280885 ms| built 
118 in 23113/2295566 ms|
   | flushed: segment=_1 ramUsed=474.061 MB newFlushedSize=513.284 MB 
docs/MB=2,305.768 | flushed: segment=_0 ramUsed=474.061 MB 
newFlushedSize=517.153 MB docs/MB=2,288.517 |
   
   
   
   | | baseline recall | baseline QPS | candidate recall | 
candidate QPS |
   | --- | --: | ---: | ---: | 
: |
   | n_cands=10  |   0.482 | 4199.190 |0.496 |  
3982.133 |
   | n_cands=20  |   0.553 | 3563.974 |0.561 |  
3412.251 |
   | n_cands=40  |   0.631 | 2713.881 |0.635 |  
2455.465 |
   | n_cands=80  |   0.707 | 1902.787 |0.709 |  
1863.012 |
   | n_cands=120 |   0.748 | 1497.979 |0.747 |  
1400.822 |
   | n_cands=200 |   0.790 | 1031.353 |0.791 |  
1018.244 |
   | n_cands=400 |   0.840 |  577.106 |0.841 |   
600.762 |
   | n_cands=600 |   0.865 |  417.961 |0.865 |   
436.488 |
   | n_cands=800 |   0.881 |  319.900 |0.881 |   
334.503 |


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] thelabdude opened a new pull request #2613: SOLR-15774: Avoid weird off-by-one errors with Angular's 'chosen' select box directive for the security and schema-designer screens i

2021-11-18 Thread GitBox



thelabdude opened a new pull request #2613:
URL: https://github.com/apache/lucene-solr/pull/2613


   backport of https://github.com/apache/solr/pull/404


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10243) increase unicode versions of tokenizers to unicode 12.1

2021-11-18 Thread Robert Muir (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17446161#comment-17446161
 ] 

Robert Muir commented on LUCENE-10243:
--

I tried this out on top of LUCENE-10239 branch, and also ran perl scripts to 
regenerate new unicode 12.1 tests. But some conformance tests fail.

I haven't debugged it yet, but my guess might be some changes to the UAX#29 
between 9 and 12.1, I will look into it.

> increase unicode versions of tokenizers to unicode 12.1
> ---
>
> Key: LUCENE-10243
> URL: https://issues.apache.org/jira/browse/LUCENE-10243
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> Followup from LUCENE-10239
> Bump the Unicode version of these tokenizers from Unicode 9 to 12.1, which is 
> the most recent supported by the jflex release.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] thelabdude merged pull request #2613: SOLR-15774: Avoid weird off-by-one errors with Angular's 'chosen' select box directive for the security and schema-designer screens in Admi

2021-11-18 Thread GitBox



thelabdude merged pull request #2613:
URL: https://github.com/apache/lucene-solr/pull/2613


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #452: LUCENE-10239: upgrade jflex (1.7.0 -> 1.8.2)

2021-11-18 Thread GitBox



rmuir commented on pull request #452:
URL: https://github.com/apache/lucene/pull/452#issuecomment-973323310


   For convenience of reviewing: here is the diff between default skeleton and 
"buffer-expansion-disabled" skeleton. It is kinda the only way to review it 
since we brought in all the upstream changes.
   
   ```
   think:lucene[LUCENE-10239]$ diff -u 
gradle/generation/jflex/skeleton.default.txt 
gradle/generation/jflex/skeleton.disable.buffer.expansion.txt
   --- gradle/generation/jflex/skeleton.default.txt 2021-11-17 
19:04:46.844620167 -0500
   +++ gradle/generation/jflex/skeleton.disable.buffer.expansion.txt
2021-11-17 19:05:01.124853267 -0500
   @@ -5,7 +5,7 @@
  /** Initial size of the lookahead buffer. */
--- private static final int ZZ_BUFFERSIZE = ...;
   
   -  /** Lexical states. */
   +  /** Lexical States. */
---  lexical states, charmap
   
  /** Error code for "Unknown internal scanner error". */
   @@ -94,18 +94,11 @@
  zzStartRead = 0;
}
   
   -/* is the buffer big enough? */
   -if (zzCurrentPos >= zzBuffer.length - zzFinalHighSurrogate) {
   -  /* if not: blow it up */
   -  char newBuffer[] = new char[zzBuffer.length * 2];
   -  System.arraycopy(zzBuffer, 0, newBuffer, 0, zzBuffer.length);
   -  zzBuffer = newBuffer;
   -  zzEndRead += zzFinalHighSurrogate;
   -  zzFinalHighSurrogate = 0;
   -}
   -
/* fill the buffer with new input */
   -int requested = zzBuffer.length - zzEndRead;
   +int requested = zzBuffer.length - zzEndRead - zzFinalHighSurrogate;
   +if (requested == 0) {
   +  return true;
   +}
int numRead = zzReader.read(zzBuffer, zzEndRead, requested);
   
/* not supposed to occur according to specification of java.io.Reader */
   @@ -119,6 +112,9 @@
if (numRead == requested) { // We requested too few chars to encode 
a full Unicode character
  --zzEndRead;
  zzFinalHighSurrogate = 1;
   +  if (numRead == 1) {
   +return true;
   +  }
} else {// There is room in the buffer for at 
least one more char
  int c = zzReader.read();  // Expecting to read a paired low 
surrogate char
  if (c == -1) {
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] thelabdude opened a new pull request #2614: SOLR-15774: Avoid weird off-by-one errors with Angular's 'chosen' select box directive for the security and schema-designer screens i

2021-11-18 Thread GitBox



thelabdude opened a new pull request #2614:
URL: https://github.com/apache/lucene-solr/pull/2614


   backport of https://github.com/apache/lucene-solr/pull/2613


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] thelabdude merged pull request #2614: SOLR-15774: Avoid weird off-by-one errors with Angular's 'chosen' select box directive for the security and schema-designer screens in Admi

2021-11-18 Thread GitBox



thelabdude merged pull request #2614:
URL: https://github.com/apache/lucene-solr/pull/2614


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] LifeIsStrange commented on pull request #2012: SOLR-14935: Solr can forward request ( remoteQuery ) even if there are local cores present

2021-11-18 Thread GitBox



LifeIsStrange commented on pull request #2012:
URL: https://github.com/apache/lucene-solr/pull/2012#issuecomment-973596388


   @vthacker any update?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] zacharymorn merged pull request #444: LUCENE-10236: Updated field-weight used in CombinedFieldQuery scoring calculation, and added a test

2021-11-18 Thread GitBox



zacharymorn merged pull request #444:
URL: https://github.com/apache/lucene/pull/444


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10236) CombinedFieldsQuery to use fieldAndWeights.values() when constructing MultiNormsLeafSimScorer for scoring

2021-11-18 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17446291#comment-17446291
 ] 

ASF subversion and git services commented on LUCENE-10236:
--

Commit 07ee3ba83a4c9f3abc24bf9d3fbb3c3102c4a102 in lucene's branch 
refs/heads/main from zacharymorn
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=07ee3ba ]

LUCENE-10236: Update field-weight used in CombinedFieldQuery scoring 
calculation (#444)



> CombinedFieldsQuery to use fieldAndWeights.values() when constructing 
> MultiNormsLeafSimScorer for scoring
> -
>
> Key: LUCENE-10236
> URL: https://issues.apache.org/jira/browse/LUCENE-10236
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/sandbox
>Reporter: Zach Chen
>Assignee: Zach Chen
>Priority: Minor
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> This is a spin-off issue from discussion in 
> [https://github.com/apache/lucene/pull/418#issuecomment-967790816], for a 
> quick fix in CombinedFieldsQuery scoring.
> Currently CombinedFieldsQuery would use a constructed 
> [fields|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L420-L421]
>  object to create a MultiNormsLeafSimScorer for scoring, but the fields 
> object may contain duplicated field-weight pairs as it is [built from looping 
> over 
> fieldTerms|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L404-L414],
>  resulting into duplicated norms being added during scoring calculation in 
> MultiNormsLeafSimScorer. 
> E.g. for CombinedFieldsQuery with two fields and two values matching a 
> particular doc:
> {code:java}
> CombinedFieldQuery query =
> new CombinedFieldQuery.Builder()
> .addField("field1", (float) 1.0)
> .addField("field2", (float) 1.0)
> .addTerm(new BytesRef("foo"))
> .addTerm(new BytesRef("zoo"))
> .build(); {code}
> I would imagine the scoring to be based on the following:
>  # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + 
> freq(field1:zoo) + freq(field2:zoo)
>  # Sum of norms on doc = norm(field1) + norm(field2)
> but the current logic would use the following for scoring:
>  # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + 
> freq(field1:zoo) + freq(field2:zoo)
>  # Sum of norms on doc = norm(field1) + norm(field2) + norm(field1) + 
> norm(field2)
>  
> In addition, this differs from how MultiNormsLeafSimScorer is constructed 
> from CombinedFieldsQuery explain function, which [uses 
> fieldAndWeights.values()|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L387-L389]
>  and does not contain duplicated field-weight pairs. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] zacharymorn commented on pull request #444: LUCENE-10236: Updated field-weight used in CombinedFieldQuery scoring calculation, and added a test

2021-11-18 Thread GitBox



zacharymorn commented on pull request #444:
URL: https://github.com/apache/lucene/pull/444#issuecomment-973796460


   Will backport this to 8.11 and 8x.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] iverase merged pull request #7: LUCENE-9820: Separate logic for reading the BKD index from logic to intersecting it

2021-11-18 Thread GitBox



iverase merged pull request #7:
URL: https://github.com/apache/lucene/pull/7


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9820) Separate logic for reading the BKD index from logic to intersecting it.

2021-11-18 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17446335#comment-17446335
 ] 

ASF subversion and git services commented on LUCENE-9820:
-

Commit ad911df2605591a63ca58d237182f84aa8384325 in lucene's branch 
refs/heads/main from Ignacio Vera
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=ad911df ]

LUCENE-9820: Separate logic for reading the BKD index from logic to 
intersecting it (#7)

Extract BKD tree interface and move intersecting logic to the PointValues 
abstract class.

> Separate logic for reading the BKD index from logic to intersecting it.
> ---
>
> Key: LUCENE-9820
> URL: https://issues.apache.org/jira/browse/LUCENE-9820
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ignacio Vera
>Priority: Major
>  Time Spent: 10h 40m
>  Remaining Estimate: 0h
>
> Currently the class BKDReader contains all the logic for traversing the KD 
> tree and the logic to read the actual index. This makes difficult to develop 
> new visiting strategies, for example LUCENE-9619, where it is proposed to 
> move Points from a visitor API to a custor-style API.
> The first step is to isolate the logic the read the index from the logic that 
> visits the the tree. Another benefit of doing this, is that it will help 
> evolving the index, for example moving the current index format to backwards 
> codec without moving the visiting logic.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] iverase opened a new pull request #457: [Backport] LUCENE-9820: Separate logic for reading the BKD index from logic to intersecting it

2021-11-18 Thread GitBox



iverase opened a new pull request #457:
URL: https://github.com/apache/lucene/pull/457


   backport of #7. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-9820) Separate logic for reading the BKD index from logic to intersecting it.

2021-11-18 Thread Ignacio Vera (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-9820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ignacio Vera resolved LUCENE-9820.
--
Fix Version/s: 9.1
 Assignee: Ignacio Vera
   Resolution: Fixed

> Separate logic for reading the BKD index from logic to intersecting it.
> ---
>
> Key: LUCENE-9820
> URL: https://issues.apache.org/jira/browse/LUCENE-9820
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ignacio Vera
>Assignee: Ignacio Vera
>Priority: Major
> Fix For: 9.1
>
>  Time Spent: 10h 50m
>  Remaining Estimate: 0h
>
> Currently the class BKDReader contains all the logic for traversing the KD 
> tree and the logic to read the actual index. This makes difficult to develop 
> new visiting strategies, for example LUCENE-9619, where it is proposed to 
> move Points from a visitor API to a custor-style API.
> The first step is to isolate the logic the read the index from the logic that 
> visits the the tree. Another benefit of doing this, is that it will help 
> evolving the index, for example moving the current index format to backwards 
> codec without moving the visiting logic.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] iverase merged pull request #457: [Backport] LUCENE-9820: Separate logic for reading the BKD index from logic to intersecting it

2021-11-18 Thread GitBox



iverase merged pull request #457:
URL: https://github.com/apache/lucene/pull/457


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9820) Separate logic for reading the BKD index from logic to intersecting it.

2021-11-18 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17446338#comment-17446338
 ] 

ASF subversion and git services commented on LUCENE-9820:
-

Commit 9adf7e27f97b76751429c3bc83eaf13006c07982 in lucene's branch 
refs/heads/branch_9x from Ignacio Vera
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=9adf7e2 ]

LUCENE-9820: Separate logic for reading the BKD index from logic to 
intersecting it (#7) (#457)

Extract BKD tree interface and move intersecting logic to the PointValues 
abstract class.

> Separate logic for reading the BKD index from logic to intersecting it.
> ---
>
> Key: LUCENE-9820
> URL: https://issues.apache.org/jira/browse/LUCENE-9820
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ignacio Vera
>Assignee: Ignacio Vera
>Priority: Major
> Fix For: 9.1
>
>  Time Spent: 10h 50m
>  Remaining Estimate: 0h
>
> Currently the class BKDReader contains all the logic for traversing the KD 
> tree and the logic to read the actual index. This makes difficult to develop 
> new visiting strategies, for example LUCENE-9619, where it is proposed to 
> move Points from a visitor API to a custor-style API.
> The first step is to isolate the logic the read the index from the logic that 
> visits the the tree. Another benefit of doing this, is that it will help 
> evolving the index, for example moving the current index format to backwards 
> codec without moving the visiting logic.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] zacharymorn commented on a change in pull request #418: LUCENE-10061: Implements dynamic pruning support for CombinedFieldsQuery

2021-11-18 Thread GitBox



zacharymorn commented on a change in pull request #418:
URL: https://github.com/apache/lucene/pull/418#discussion_r752930339



##
File path: 
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java
##
@@ -441,6 +491,273 @@ public boolean isCacheable(LeafReaderContext ctx) {
 }
   }
 
+  /** Merge impacts for combined field. */
+  static ImpactsSource mergeImpacts(
+  Map> fieldsWithImpactsEnums,
+  Map> fieldsWithImpacts,
+  Map fieldWeights) {
+return new ImpactsSource() {
+
+  class SubIterator {
+final Iterator iterator;
+int previousFreq;
+Impact current;
+
+SubIterator(Iterator iterator) {
+  this.iterator = iterator;
+  this.current = iterator.next();
+}
+
+void next() {
+  previousFreq = current.freq;
+  if (iterator.hasNext() == false) {
+current = null;
+  } else {
+current = iterator.next();
+  }
+}
+  }
+
+  @Override
+  public Impacts getImpacts() throws IOException {
+// Use the impacts that have the lower next boundary (doc id in skip 
entry) as a lead for
+// each field
+// They collectively will decide on the number of levels and the block 
boundaries.
+Map leadingImpactsPerField = new 
HashMap<>(fieldsWithImpactsEnums.size());
+
+for (Map.Entry> fieldImpacts :
+fieldsWithImpactsEnums.entrySet()) {
+  String field = fieldImpacts.getKey();
+  List impactsEnums = fieldImpacts.getValue();
+  fieldsWithImpacts.put(field, new ArrayList<>(impactsEnums.size()));
+
+  Impacts tmpLead = null;
+  // find the impact that has the lowest next boundary for this field
+  for (int i = 0; i < impactsEnums.size(); ++i) {
+Impacts impacts = impactsEnums.get(i).getImpacts();
+fieldsWithImpacts.get(field).add(impacts);
+
+if (tmpLead == null || impacts.getDocIdUpTo(0) < 
tmpLead.getDocIdUpTo(0)) {
+  tmpLead = impacts;
+}
+  }
+
+  leadingImpactsPerField.put(field, tmpLead);
+}
+
+return new Impacts() {
+
+  @Override
+  public int numLevels() {
+// max of levels across fields' impactEnums
+int result = 0;
+
+for (Impacts impacts : leadingImpactsPerField.values()) {
+  result = Math.max(result, impacts.numLevels());
+}
+
+return result;
+  }
+
+  @Override
+  public int getDocIdUpTo(int level) {
+// min of docIdUpTo across fields' impactEnums
+int result = Integer.MAX_VALUE;
+
+for (Impacts impacts : leadingImpactsPerField.values()) {
+  if (impacts.numLevels() > level) {
+result = Math.min(result, impacts.getDocIdUpTo(level));
+  }
+}
+
+return result;
+  }

Review comment:
   Hi @jpountz , I just noticed a place in the implementation that was 
called multiple times, and thus tried to "cache" the result to save the 
repeated computation 
https://github.com/apache/lucene/pull/418/commits/6d5e7808fd847d5849e7c4ff3b552aeb9c196bbb.
 It turned out the improvement was quite obvious:
   
   Results from `python3 src/python/localrun.py -source combinedFieldsBig`
   
   Run 1
   ```
   TaskQPS baseline  StdDevQPS my_modified_version  StdDev  
  Pct diff p-value 
CFQHighHigh3.17  (3.5%)2.70  (4.6%)  -14.8% ( 
-22% -   -6%) 0.000
   PKLookup  117.15  (6.7%)  116.31 (10.3%)   -0.7% ( 
-16% -   17%) 0.792
 CFQHighMed   12.98  (5.2%)   18.26 (13.9%)   40.7% (  
20% -   63%) 0.000
 CFQHighLow   17.30  (3.6%)   36.19 (20.1%)  109.2% (  
82% -  137%) 0.000
   ```
   
   Run 2
   ```
   TaskQPS baseline  StdDevQPS my_modified_version  StdDev  
  Pct diff p-value 
   PKLookup  104.98  (6.9%)  115.76  (8.5%)   10.3% (  
-4% -   27%) 0.000
 CFQHighMed5.73  (3.9%)6.53  (8.4%)   14.0% (   
1% -   27%) 0.000
 CFQHighLow   17.87  (3.4%)   20.37  (8.1%)   14.0% (   
2% -   26%) 0.000
CFQHighHigh5.41  (4.2%)6.53  (8.9%)   20.7% (   
7% -   35%) 0.000
   ```
   
   Run 3
   ```
   TaskQPS baseline  StdDevQPS my_modified_version  StdDev  
  Pct diff p-value 
CFQHighHigh3.18  (3.6%)2.66  (5.3%)  -16.4% ( 
-24% -   -7%) 0.000
 CFQHighMed5.75  (4.0%)5.07  (5.5%)  -11.7% ( 
-20% -   -2%) 0.000
   PKLookup  136.87  (5.9%)  133.97  (9.4%)   -2.1% ( 
-16% -   14%) 0.393
 CFQHighLow   23.19  (2.9%)   38.74 (16.0%)   67

77 matches

Mail list logo