[jira] [Updated] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll
[ https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10233: -- Attachment: SparseFixedBitSet.png > Store docIds as bitset when leafCardinality = 1 to speed up addAll > -- > > Key: LUCENE-10233 > URL: https://issues.apache.org/jira/browse/LUCENE-10233 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > Attachments: SparseFixedBitSet.png > > Time Spent: 20m > Remaining Estimate: 0h > > In low cardinality points cases, id blocks will usually store doc ids that > have the same point value, and {{intersect}} will get into {{addAll}} logic. > If we store ids as bitset, and give the IntersectVisitor bulk visiting > ability, we can speed up addAll because we can just execute the 'or' logic > between the result and the block ids. > Optimization will be triggered when the following conditions are met at the > same time: > # leafCardinality = 1 > # max(docId) - min(docId) <= 16 * pointCount (in order to avoid expanding > too much storage) > # no duplicate doc id > I mocked a field that has 10,000,000 docs per value and search it with a 1 > term PointInSetQuery, the build scorer time decreased from 71ms to 8ms. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll
[ https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17445792#comment-17445792 ] Feng Guo commented on LUCENE-10233: --- Hi [~jpountz], Would you like to give me some suggestions on how to speed up SparseFixedBitSet? Here are some of my current thoughts: This is the cpu profile of SparseFixedBitSet: [^SparseFixedBitSet.png], most of the cpu is used to constructing SparseFixedBitSet, i think this is because we always new big arrays there. Constructing a single global SparseFixedBitSet and reusing it for each block may help, but global SparseFixedBitSet needs the sgment's maxdoc and maxDoc is not available in the BKDReader. I'm not sure if it is worth to changing the BKDReader constrctor signature for this since BKDReader is a public class. > Store docIds as bitset when leafCardinality = 1 to speed up addAll > -- > > Key: LUCENE-10233 > URL: https://issues.apache.org/jira/browse/LUCENE-10233 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > Attachments: SparseFixedBitSet.png > > Time Spent: 20m > Remaining Estimate: 0h > > In low cardinality points cases, id blocks will usually store doc ids that > have the same point value, and {{intersect}} will get into {{addAll}} logic. > If we store ids as bitset, and give the IntersectVisitor bulk visiting > ability, we can speed up addAll because we can just execute the 'or' logic > between the result and the block ids. > Optimization will be triggered when the following conditions are met at the > same time: > # leafCardinality = 1 > # max(docId) - min(docId) <= 16 * pointCount (in order to avoid expanding > too much storage) > # no duplicate doc id > I mocked a field that has 10,000,000 docs per value and search it with a 1 > term PointInSetQuery, the build scorer time decreased from 71ms to 8ms. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase commented on a change in pull request #7: LUCENE-9820: Separate logic for reading the BKD index from logic to intersecting it
iverase commented on a change in pull request #7: URL: https://github.com/apache/lucene/pull/7#discussion_r752128672 ## File path: lucene/core/src/java/org/apache/lucene/util/bkd/BKDWriter.java ## @@ -328,31 +288,97 @@ public Relation compare(byte[] minPackedValue, byte[] maxPackedValue) { // Not deleted! docID = mappedDocID; System.arraycopy( - packedValues, - index * bkd.config.packedBytesLength, - state.scratchDataPackedValue, + mergeIntersectsVisitor.packedValues, + index * packedBytesLength, + packedValue, 0, - bkd.config.packedBytesLength); + packedBytesLength); return true; } } } + +private boolean collectNextLeaf() throws IOException { + assert pointTree.moveToChild() == false; + mergeIntersectsVisitor.reset(); + do { +if (pointTree.moveToSibling()) { + // move to first child of this node and collect docs + while (pointTree.moveToChild()) {} + pointTree.visitDocValues(mergeIntersectsVisitor); + return true; +} + } while (pointTree.moveToParent()); + return false; +} + } + + private static class MergeIntersectsVisitor implements IntersectVisitor { + +int docsInBlock = 0; +byte[] packedValues; +int[] docIDs; +private final int packedBytesLength; + +MergeIntersectsVisitor(int packedBytesLength) { + this.docIDs = new int[0]; + this.packedValues = new byte[0]; + this.packedBytesLength = packedBytesLength; +} + +void reset() { + docsInBlock = 0; +} + +@Override +public void grow(int count) { + assert docsInBlock == 0; + if (docIDs.length < count) { +docIDs = ArrayUtil.grow(docIDs, count); +int packedValuesSize = Math.toIntExact(docIDs.length * (long) packedBytesLength); +if (packedValuesSize > ArrayUtil.MAX_ARRAY_LENGTH) { Review comment: I think that is ok. We control how we are iterating the tree and we are visiting one leaf at a time. In one leaf there cannot be more than `max_points_in_leaf_node`points, it is not related to the points cardinality? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a change in pull request #7: LUCENE-9820: Separate logic for reading the BKD index from logic to intersecting it
jpountz commented on a change in pull request #7: URL: https://github.com/apache/lucene/pull/7#discussion_r752133629 ## File path: lucene/core/src/java/org/apache/lucene/util/bkd/BKDWriter.java ## @@ -328,31 +288,97 @@ public Relation compare(byte[] minPackedValue, byte[] maxPackedValue) { // Not deleted! docID = mappedDocID; System.arraycopy( - packedValues, - index * bkd.config.packedBytesLength, - state.scratchDataPackedValue, + mergeIntersectsVisitor.packedValues, + index * packedBytesLength, + packedValue, 0, - bkd.config.packedBytesLength); + packedBytesLength); return true; } } } + +private boolean collectNextLeaf() throws IOException { + assert pointTree.moveToChild() == false; + mergeIntersectsVisitor.reset(); + do { +if (pointTree.moveToSibling()) { + // move to first child of this node and collect docs + while (pointTree.moveToChild()) {} + pointTree.visitDocValues(mergeIntersectsVisitor); + return true; +} + } while (pointTree.moveToParent()); + return false; +} + } + + private static class MergeIntersectsVisitor implements IntersectVisitor { + +int docsInBlock = 0; +byte[] packedValues; +int[] docIDs; +private final int packedBytesLength; + +MergeIntersectsVisitor(int packedBytesLength) { + this.docIDs = new int[0]; + this.packedValues = new byte[0]; + this.packedBytesLength = packedBytesLength; +} + +void reset() { + docsInBlock = 0; +} + +@Override +public void grow(int count) { + assert docsInBlock == 0; + if (docIDs.length < count) { +docIDs = ArrayUtil.grow(docIDs, count); +int packedValuesSize = Math.toIntExact(docIDs.length * (long) packedBytesLength); +if (packedValuesSize > ArrayUtil.MAX_ARRAY_LENGTH) { Review comment: Ah, thanks for explaining, maybe this is the thing I was confused about. I thought it could be called by e.g. `BKDReader#addAll` which tries to grow by as much as possible (up to Integer#MAX_VALUE) but actually this merge visitor is always visited leaf-by-leaf? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] janhoy opened a new pull request #2612: SOLR-15804 Allow content-types with ;charset in ShowFileRequestHandler
janhoy opened a new pull request #2612: URL: https://github.com/apache/lucene-solr/pull/2612 * Request .html as text/xml, .css as text/plain and .js as application/javascript * Highlight css as txt (crashed browser) * Set extension to 'xml' for managed-schema instead of 'text/xml' -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] janhoy merged pull request #2612: SOLR-15804 Allow content-types with ;charset in ShowFileRequestHandler
janhoy merged pull request #2612: URL: https://github.com/apache/lucene-solr/pull/2612 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase commented on a change in pull request #7: LUCENE-9820: Separate logic for reading the BKD index from logic to intersecting it
iverase commented on a change in pull request #7: URL: https://github.com/apache/lucene/pull/7#discussion_r752168490 ## File path: lucene/core/src/java/org/apache/lucene/util/bkd/BKDWriter.java ## @@ -328,31 +288,97 @@ public Relation compare(byte[] minPackedValue, byte[] maxPackedValue) { // Not deleted! docID = mappedDocID; System.arraycopy( - packedValues, - index * bkd.config.packedBytesLength, - state.scratchDataPackedValue, + mergeIntersectsVisitor.packedValues, + index * packedBytesLength, + packedValue, 0, - bkd.config.packedBytesLength); + packedBytesLength); return true; } } } + +private boolean collectNextLeaf() throws IOException { + assert pointTree.moveToChild() == false; + mergeIntersectsVisitor.reset(); + do { +if (pointTree.moveToSibling()) { + // move to first child of this node and collect docs + while (pointTree.moveToChild()) {} + pointTree.visitDocValues(mergeIntersectsVisitor); + return true; +} + } while (pointTree.moveToParent()); + return false; +} + } + + private static class MergeIntersectsVisitor implements IntersectVisitor { + +int docsInBlock = 0; +byte[] packedValues; +int[] docIDs; +private final int packedBytesLength; + +MergeIntersectsVisitor(int packedBytesLength) { + this.docIDs = new int[0]; + this.packedValues = new byte[0]; + this.packedBytesLength = packedBytesLength; +} + +void reset() { + docsInBlock = 0; +} + +@Override +public void grow(int count) { + assert docsInBlock == 0; + if (docIDs.length < count) { +docIDs = ArrayUtil.grow(docIDs, count); +int packedValuesSize = Math.toIntExact(docIDs.length * (long) packedBytesLength); +if (packedValuesSize > ArrayUtil.MAX_ARRAY_LENGTH) { Review comment: Yes, the magic happens in the method `collectNextLeaf` where we fill up the buffers. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10231) Do not cache too large PointInSetQueries
[ https://issues.apache.org/jira/browse/LUCENE-10231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10231: -- Priority: Minor (was: Major) > Do not cache too large PointInSetQueries > > > Key: LUCENE-10231 > URL: https://issues.apache.org/jira/browse/LUCENE-10231 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Reporter: Feng Guo >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > TermInSetQuery will avoid caching queries using too much memory, but there is > no such limit in PointInSetQuery and PointInSetQueryWithScore, which may have > similar problems. I added the logic to these two queries, but I'm not sure > whether this is the most reasonable way. May be this logic is common and we > can judge this in LRUQueryCache ? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] janhoy commented on pull request #2604: SOLR-15768 Tune zookeeper request handler permissions
janhoy commented on pull request #2604: URL: https://github.com/apache/lucene-solr/pull/2604#issuecomment-972859726 @dsmiley I wrapped this up for 8x branch, added changes entry, made the JIRA public and ready to merge this. Any other feedback? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10162) Add IntField, LongField, FloatField and DoubleField classes to index both points and doc values
[ https://issues.apache.org/jira/browse/LUCENE-10162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17445924#comment-17445924 ] Quentin Pradet commented on LUCENE-10162: - This is very interesting, as I've been confused by this when working on LUCENE-10085. * Is it correct to assume that this is only "syntactic sugar" and that under the hood we'll still add IntPoint and NumericDocValuesField the doc? * If yes, would it make sense to desugar in Document#add? If the field is composite (eg. IntField), then we add the two fields instead of one. * If no, at what point should we split IntField in its components? > Add IntField, LongField, FloatField and DoubleField classes to index both > points and doc values > --- > > Key: LUCENE-10162 > URL: https://issues.apache.org/jira/browse/LUCENE-10162 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > > Currently we have IntPoint, LongPoint, FloatPoint and DoublePoint on the one > hand, and NumericDocValuesField and SortedNumericDocValuesField on the other > hand. > When we introduced these classes, this distinction made sense: use the > XXXPoint classes if you want your numeric fields to be searchable and the > XXXDocValuesField classes if you want your numeric fields to be > sortable/aggregatable. > However since then, we introduced logic to take advantage of doc values for > filtering (IndexOrDocValuesQuery) and enhanced sorting to take advantage of > the Points index to skip non-competitive documents. So even if you only need > searching, or if you only need sorting, it's likely a good idea to index both > with points *and* doc values. > Could we make this easier on users by having XXXField classes that > automatically do it as opposed to requiring users to add both an XXXPoint and > an XXXDocValuesField for every numeric field to their index? This could also > make consuming these fields easier, e.g. factory methods for range queries > could automatically use IndexOrDocValuesQuery. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10162) Add IntField, LongField, FloatField and DoubleField classes to index both points and doc values
[ https://issues.apache.org/jira/browse/LUCENE-10162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17445929#comment-17445929 ] Adrien Grand commented on LUCENE-10162: --- bq. Is it correct to assume that this is only "syntactic sugar" and that under the hood we'll still add IntPoint and NumericDocValuesField the doc? Correct. bq. would it make sense to desugar in Document#add? If the field is composite (eg. IntField), then we add the two fields instead of one I don't think there's an easy way to do this. Lucene Field classes tend to be very simple, they mostly wrap a value and configure a FieldType object that describes how the data should be indexed. The way I was thinking of addressing this issue would consist of implementing these LongField/IntField/... classes from scratch without trying to reuse LongPoint/IntPoint/... or NumericDocValuesField/SortedNumericDocValuesField. > Add IntField, LongField, FloatField and DoubleField classes to index both > points and doc values > --- > > Key: LUCENE-10162 > URL: https://issues.apache.org/jira/browse/LUCENE-10162 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > > Currently we have IntPoint, LongPoint, FloatPoint and DoublePoint on the one > hand, and NumericDocValuesField and SortedNumericDocValuesField on the other > hand. > When we introduced these classes, this distinction made sense: use the > XXXPoint classes if you want your numeric fields to be searchable and the > XXXDocValuesField classes if you want your numeric fields to be > sortable/aggregatable. > However since then, we introduced logic to take advantage of doc values for > filtering (IndexOrDocValuesQuery) and enhanced sorting to take advantage of > the Points index to skip non-competitive documents. So even if you only need > searching, or if you only need sorting, it's likely a good idea to index both > with points *and* doc values. > Could we make this easier on users by having XXXField classes that > automatically do it as opposed to requiring users to add both an XXXPoint and > an XXXDocValuesField for every numeric field to their index? This could also > make consuming these fields easier, e.g. factory methods for range queries > could automatically use IndexOrDocValuesQuery. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a change in pull request #443: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals
gsmiller commented on a change in pull request #443: URL: https://github.com/apache/lucene/pull/443#discussion_r752270520 ## File path: lucene/facet/src/java/org/apache/lucene/facet/FacetsConfig.java ## @@ -409,9 +410,26 @@ private void processFacetFields( indexDrillDownTerms(doc, indexFieldName, dimConfig, facetLabel); } - // Facet counts: - // DocValues are considered stored fields: - doc.add(new BinaryDocValuesField(indexFieldName, dedupAndEncode(ordinals.get(; + // Store the taxonomy ordinals associated with each doc. Prefer to use SortedNumericDocValues + // but "fall back" to a custom binary format to maintain backwards compatibility with Lucene 8 + // indexes. + if (taxoWriter.useNumericDocValuesForOrdinals()) { Review comment: OK sounds like this should work then. Thanks @mikemccand! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a change in pull request #443: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals
gsmiller commented on a change in pull request #443: URL: https://github.com/apache/lucene/pull/443#discussion_r752271392 ## File path: lucene/facet/src/java/org/apache/lucene/facet/FacetUtils.java ## @@ -81,4 +82,17 @@ public long cost() { } }; } + + /** + * Determine whether-or-not an index segment is using the older-style binary format or the newer + * NumericDocValues format for storing taxonomy faceting ordinals (for the specified field). + * + * @deprecated Please do not rely on this method. It is added as a temporary measure for providing + * index backwards-compatibility with Lucene 8 and earlier indexes, and will be removed in + * Lucene 10. + */ + @Deprecated + public static boolean usesOlderBinaryOrdinals(LeafReader reader) { +return reader.getMetaData().getCreatedVersionMajor() <= 8; + } Review comment: Makes sense. I went ahead and make this change. Thanks for the suggestion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a change in pull request #452: LUCENE-10239: upgrade jflex (1.7.0 -> 1.8.2)
rmuir commented on a change in pull request #452: URL: https://github.com/apache/lucene/pull/452#discussion_r752274558 ## File path: dev-tools/missing-doclet/src/main/java/org/apache/lucene/missingdoclet/MissingDoclet.java ## @@ -331,7 +331,6 @@ private boolean hasInheritedJavadocs(Element element) { if (hasOverrides) { // If an element has explicit @Overrides annotation, assume it does // have inherited javadocs somewhere. - reporter.print(Diagnostic.Kind.NOTE, element, "javadoc empty but @Override declared, skipping."); Review comment: We could add back the debugging guarded by a sysprop. maybe we could enable additional output from the gradle build if someone runs `gradlew --debug`? I keep thinking there should already be a "proper" way for this (e.g. cmdline flag to javadocs tool), but I'm not finding it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on a change in pull request #452: LUCENE-10239: upgrade jflex (1.7.0 -> 1.8.2)
dweiss commented on a change in pull request #452: URL: https://github.com/apache/lucene/pull/452#discussion_r752281787 ## File path: dev-tools/missing-doclet/src/main/java/org/apache/lucene/missingdoclet/MissingDoclet.java ## @@ -331,7 +331,6 @@ private boolean hasInheritedJavadocs(Element element) { if (hasOverrides) { // If an element has explicit @Overrides annotation, assume it does // have inherited javadocs somewhere. - reporter.print(Diagnostic.Kind.NOTE, element, "javadoc empty but @Override declared, skipping."); Review comment: I think that code is sort of correct logging at NOTE level - the problem is I can't find the right option in javadoc to skip NOTE messages (and leave only warnings). I'd comment it out instead of removing it so that perhaps it can be tackled separately. There should be a way of doing it somehow. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a change in pull request #443: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals
gsmiller commented on a change in pull request #443: URL: https://github.com/apache/lucene/pull/443#discussion_r752290342 ## File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/CachedOrdinalsReader.java ## @@ -45,7 +45,11 @@ * * NOTE: create one instance of this and re-use it for all facet implementations (the * cache is per-instance, not static). + * + * @deprecated Custom binary encodings for taxonomy ordinals are no longer supported starting with Review comment: Added. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a change in pull request #452: LUCENE-10239: upgrade jflex (1.7.0 -> 1.8.2)
rmuir commented on a change in pull request #452: URL: https://github.com/apache/lucene/pull/452#discussion_r752297406 ## File path: dev-tools/missing-doclet/src/main/java/org/apache/lucene/missingdoclet/MissingDoclet.java ## @@ -331,7 +331,6 @@ private boolean hasInheritedJavadocs(Element element) { if (hasOverrides) { // If an element has explicit @Overrides annotation, assume it does // have inherited javadocs somewhere. - reporter.print(Diagnostic.Kind.NOTE, element, "javadoc empty but @Override declared, skipping."); Review comment: yeah, i just havent experimented with all the options: such as default level vs `-verbose` vs `-quiet` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a change in pull request #443: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals
gsmiller commented on a change in pull request #443: URL: https://github.com/apache/lucene/pull/443#discussion_r752298802 ## File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/OrdinalMappingLeafReader.java ## @@ -107,6 +113,64 @@ public BytesRef binaryValue() { } } + private class OrdinalMappingSortedNumericDocValues extends FilterSortedNumericDocValues { +private final IntArrayList currentValues; +private int currIndex; + +OrdinalMappingSortedNumericDocValues(SortedNumericDocValues in) { + super(in); + currentValues = new IntArrayList(32); +} + +@Override +public boolean advanceExact(int target) throws IOException { + boolean result = in.advanceExact(target); + if (result) { +reloadValues(); + } + return result; +} + +@Override +public int advance(int target) throws IOException { + int result = in.advance(target); + if (result != DocIdSetIterator.NO_MORE_DOCS) { +reloadValues(); + } + return result; +} + +@Override +public int nextDoc() throws IOException { + int result = in.nextDoc(); + if (result != DocIdSetIterator.NO_MORE_DOCS) { +reloadValues(); + } + return result; +} + +@Override +public int docValueCount() { + return currentValues.elementsCount; +} + +private void reloadValues() throws IOException { + currIndex = 0; + currentValues.clear(); + for (int i = 0; i < in.docValueCount(); i++) { +currentValues.add(ordinalMap[(int) in.nextValue()]); Review comment: @rmuir Hmm, yeah interesting. We have a couple cases related to taxonomy faceting now where we're explicitly storing ordinals in a dv field, but not using SORTED_SET since the "mapping" back to the string values is handled by the taxonomy index (this change and LUCENE-10122). I wonder if there are any other cases like this outside of our taxonomy faceting? I wonder if it would make any sense at all to consider a new dv field type for explicitly storing ordinals? I suspect we don't have enough general use-cases to support such an idea, but just throwing it out there for conversation. Such a field could make assumptions about the data that would probably allow it to be more efficient, and some of the logic we're currently handling ourselves (e.g., de-duping) could be done by the field type. Like I said, probably not enough real-world use-cases to justify adding a new type, but figured I'd float the suggestion anyway for input. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] costin opened a new pull request #453: Investigate whether should Packed64 use a byte[] plus VarHandles instead of a long[]
costin opened a new pull request #453: URL: https://github.com/apache/lucene/pull/453 Add VarHandle variants for `Packed64` to investigate whether using VarHandles improved performance or not. This PR introduces two new Packed64 variants: ~ Packed64LongLong Same as Packed64 however instead of using direct array access, it relies on `VarHandle`. It's main purpose benchmarking. ~ Packed64LongByte A different implementation of Packed64 backed by a byte[]. It reads the data as a long and in case of overflow reads an extra byte. The algorithm is slightly different since the overflowing bits need to be computed on a different block size (8 vs 64) which requires extra cycles. Related LUCENE-10205 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a change in pull request #443: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals
rmuir commented on a change in pull request #443: URL: https://github.com/apache/lucene/pull/443#discussion_r752303321 ## File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/OrdinalMappingLeafReader.java ## @@ -107,6 +113,64 @@ public BytesRef binaryValue() { } } + private class OrdinalMappingSortedNumericDocValues extends FilterSortedNumericDocValues { +private final IntArrayList currentValues; +private int currIndex; + +OrdinalMappingSortedNumericDocValues(SortedNumericDocValues in) { + super(in); + currentValues = new IntArrayList(32); +} + +@Override +public boolean advanceExact(int target) throws IOException { + boolean result = in.advanceExact(target); + if (result) { +reloadValues(); + } + return result; +} + +@Override +public int advance(int target) throws IOException { + int result = in.advance(target); + if (result != DocIdSetIterator.NO_MORE_DOCS) { +reloadValues(); + } + return result; +} + +@Override +public int nextDoc() throws IOException { + int result = in.nextDoc(); + if (result != DocIdSetIterator.NO_MORE_DOCS) { +reloadValues(); + } + return result; +} + +@Override +public int docValueCount() { + return currentValues.elementsCount; +} + +private void reloadValues() throws IOException { + currIndex = 0; + currentValues.clear(); + for (int i = 0; i < in.docValueCount(); i++) { +currentValues.add(ordinalMap[(int) in.nextValue()]); Review comment: Currently in the benchmarks isn't the sorted set approach faster? So personally i wouldnt want to add another DV type that just lets you do things slower? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10162) Add IntField, LongField, FloatField and DoubleField classes to index both points and doc values
[ https://issues.apache.org/jira/browse/LUCENE-10162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17445971#comment-17445971 ] Artem Prigoda commented on LUCENE-10162: Ah, and I tried to create a new Field which combines the functionality of ***Point and `NumericDocValuesField`! Anyways, that was an interesting journey in the Lucene internals. > Add IntField, LongField, FloatField and DoubleField classes to index both > points and doc values > --- > > Key: LUCENE-10162 > URL: https://issues.apache.org/jira/browse/LUCENE-10162 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > > Currently we have IntPoint, LongPoint, FloatPoint and DoublePoint on the one > hand, and NumericDocValuesField and SortedNumericDocValuesField on the other > hand. > When we introduced these classes, this distinction made sense: use the > XXXPoint classes if you want your numeric fields to be searchable and the > XXXDocValuesField classes if you want your numeric fields to be > sortable/aggregatable. > However since then, we introduced logic to take advantage of doc values for > filtering (IndexOrDocValuesQuery) and enhanced sorting to take advantage of > the Points index to skip non-competitive documents. So even if you only need > searching, or if you only need sorting, it's likely a good idea to index both > with points *and* doc values. > Could we make this easier on users by having XXXField classes that > automatically do it as opposed to requiring users to add both an XXXPoint and > an XXXDocValuesField for every numeric field to their index? This could also > make consuming these fields easier, e.g. factory methods for range queries > could automatically use IndexOrDocValuesQuery. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a change in pull request #443: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals
rmuir commented on a change in pull request #443: URL: https://github.com/apache/lucene/pull/443#discussion_r752328265 ## File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/OrdinalMappingLeafReader.java ## @@ -107,6 +113,64 @@ public BytesRef binaryValue() { } } + private class OrdinalMappingSortedNumericDocValues extends FilterSortedNumericDocValues { +private final IntArrayList currentValues; +private int currIndex; + +OrdinalMappingSortedNumericDocValues(SortedNumericDocValues in) { + super(in); + currentValues = new IntArrayList(32); +} + +@Override +public boolean advanceExact(int target) throws IOException { + boolean result = in.advanceExact(target); + if (result) { +reloadValues(); + } + return result; +} + +@Override +public int advance(int target) throws IOException { + int result = in.advance(target); + if (result != DocIdSetIterator.NO_MORE_DOCS) { +reloadValues(); + } + return result; +} + +@Override +public int nextDoc() throws IOException { + int result = in.nextDoc(); + if (result != DocIdSetIterator.NO_MORE_DOCS) { +reloadValues(); + } + return result; +} + +@Override +public int docValueCount() { + return currentValues.elementsCount; +} + +private void reloadValues() throws IOException { + currIndex = 0; + currentValues.clear(); + for (int i = 0; i < in.docValueCount(); i++) { +currentValues.add(ordinalMap[(int) in.nextValue()]); Review comment: @gsmiller put another way, I think it would be much more interesting to think about something like a new 'taxonomy' or 'hierarchy' type that supports the additional features the current taxo index provides, rather than providing a lot of hooks for apps to manage ordinals themselves. It could probably ultimately perform better and be easier to use (e.g. not juggle a sidecar index). but this PR is a great step forward as at least we get the performance closer :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #443: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals
gsmiller commented on pull request #443: URL: https://github.com/apache/lucene/pull/443#issuecomment-972948547 > Maybe clarify that full index rebuild means start a new 9.x index and index all your documents. I.e. merely re-indexing all docs into your previous (8.x created) index is not sufficient. Makes sense. Will tweak the language. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a change in pull request #443: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals
gsmiller commented on a change in pull request #443: URL: https://github.com/apache/lucene/pull/443#discussion_r752335273 ## File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/FastTaxonomyFacetCounts.java ## @@ -69,31 +69,34 @@ public FastTaxonomyFacetCounts( } private final void count(List matchingDocs) throws IOException { +if (matchingDocs.isEmpty()) { + return; +} + +boolean useBinaryDv = FacetUtils.usesOlderBinaryOrdinals(matchingDocs.get(0).context.reader()); + for (MatchingDocs hits : matchingDocs) { - BinaryDocValues dv = hits.context.reader().getBinaryDocValues(indexFieldName); - if (dv == null) { // this reader does not have DocValues for the requested category list + assert useBinaryDv == FacetUtils.usesOlderBinaryOrdinals(hits.context.reader()); Review comment: Hmm, interesting @jpountz. So I think this still works as expected in the case of a `MultiReader`, but do you see any concerns there? In a situation where a `MultiReader` wrapped an 8.x and 9.x index, it seems like the `LeafReader`s will correctly report the 8.x / 9.x versions, and the older format will be retained for the 8.x index and the newer for 9.x. I'm not very familiar with these MultiReader cases though, so I'd like to confirm this understanding. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a change in pull request #443: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals
gsmiller commented on a change in pull request #443: URL: https://github.com/apache/lucene/pull/443#discussion_r752355654 ## File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/OrdinalMappingLeafReader.java ## @@ -107,6 +113,64 @@ public BytesRef binaryValue() { } } + private class OrdinalMappingSortedNumericDocValues extends FilterSortedNumericDocValues { +private final IntArrayList currentValues; +private int currIndex; + +OrdinalMappingSortedNumericDocValues(SortedNumericDocValues in) { + super(in); + currentValues = new IntArrayList(32); +} + +@Override +public boolean advanceExact(int target) throws IOException { + boolean result = in.advanceExact(target); + if (result) { +reloadValues(); + } + return result; +} + +@Override +public int advance(int target) throws IOException { + int result = in.advance(target); + if (result != DocIdSetIterator.NO_MORE_DOCS) { +reloadValues(); + } + return result; +} + +@Override +public int nextDoc() throws IOException { + int result = in.nextDoc(); + if (result != DocIdSetIterator.NO_MORE_DOCS) { +reloadValues(); + } + return result; +} + +@Override +public int docValueCount() { + return currentValues.elementsCount; +} + +private void reloadValues() throws IOException { + currIndex = 0; + currentValues.clear(); + for (int i = 0; i < in.docValueCount(); i++) { +currentValues.add(ordinalMap[(int) in.nextValue()]); Review comment: > @gsmiller put another way, I think it would be much more interesting to think about something like a new 'taxonomy' or 'hierarchy' type that supports the additional features the current taxo index provides, rather than providing a lot of hooks for apps to manage ordinals themselves. It could probably ultimately perform better and be easier to use (e.g. not juggle a sidecar index). I like this way of thinking about the problem. As far as benchmarking performance goes, the big improvement was getting away from the custom binary format (which was packing ordinals into a single binary payload using a v-int style approach) and using SORTED_NUMERIC instead to encode the ordinals for each doc. So there could be room for more improvement (or at least remaining performance neutral) by having an "ordinal specific" type doc value field. But I like your way of thinking and wouldn't hold up this PR by any means to explore something different at this point. Going to SORTED_NUMERIC is a good step forward for now. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] spyk commented on a change in pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue
spyk commented on a change in pull request #380: URL: https://github.com/apache/lucene/pull/380#discussion_r752359873 ## File path: lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/OpenNLPOpsFactory.java ## @@ -169,11 +169,14 @@ public static String getLemmatizerDictionary(String dictionaryFile, ResourceLoad builder.append(chars, 0, numRead); } } while (numRead > 0); -dictionary = builder.toString(); -lemmaDictionaries.put(dictionaryFile, dictionary); +String dictionary = builder.toString(); +InputStream dictionaryInputStream = +new ByteArrayInputStream(dictionary.getBytes(StandardCharsets.UTF_8)); +dictionaryLemmatizer = new DictionaryLemmatizer(dictionaryInputStream); Review comment: Oh, you're right, it's the same exact issue with all `DictionaryLemmatizer` constructors. So I guess it can be replaced anyway by `loader.openResource(dictionaryFile)` for the sake of simplicity. I was not aware of the ReaderInputStream solution btw, but yes, looks like it's not on the classpath. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz merged pull request #450: LUCENE-10242: The TopScoreDocCollector::createSharedManager should use ScoreDoc instead of FieldDoc
jpountz merged pull request #450: URL: https://github.com/apache/lucene/pull/450 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] reta commented on pull request #450: LUCENE-10242: The TopScoreDocCollector::createSharedManager should use ScoreDoc instead of FieldDoc
reta commented on pull request #450: URL: https://github.com/apache/lucene/pull/450#issuecomment-972980540 Thank you @jpountz ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10242) The TopScoreDocCollector::createSharedManager should use ScoreDoc instead of FieldDoc
[ https://issues.apache.org/jira/browse/LUCENE-10242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17445993#comment-17445993 ] ASF subversion and git services commented on LUCENE-10242: -- Commit 6bd5c14bf3c6113244773d2e1eb39233dc33ad71 in lucene's branch refs/heads/main from Andriy Redko [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=6bd5c14 ] LUCENE-10242: The TopScoreDocCollector::createSharedManager should use ScoreDoc instead of FieldDoc (#450) Signed-off-by: Andriy Redko > The TopScoreDocCollector::createSharedManager should use ScoreDoc instead of > FieldDoc > - > > Key: LUCENE-10242 > URL: https://issues.apache.org/jira/browse/LUCENE-10242 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 8.9, 8.10 >Reporter: Andriy Redko >Priority: Minor > Fix For: 9.0 > > Time Spent: 20m > Remaining Estimate: 0h > > The `TopScoreDocCollector::createSharedManager` should use `ScoreDoc` instead > of `FieldDoc` as a type of `after` argument. The `FieldDoc` is really needed > only in case of `TopFieldCollector` but now its usage is mandated by > `TopScoreDocCollector` as well. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10242) The TopScoreDocCollector::createSharedManager should use ScoreDoc instead of FieldDoc
[ https://issues.apache.org/jira/browse/LUCENE-10242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-10242. --- Fix Version/s: 9.0 Resolution: Fixed Thank you! > The TopScoreDocCollector::createSharedManager should use ScoreDoc instead of > FieldDoc > - > > Key: LUCENE-10242 > URL: https://issues.apache.org/jira/browse/LUCENE-10242 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 8.9, 8.10 >Reporter: Andriy Redko >Priority: Minor > Fix For: 9.0 > > Time Spent: 20m > Remaining Estimate: 0h > > The `TopScoreDocCollector::createSharedManager` should use `ScoreDoc` instead > of `FieldDoc` as a type of `after` argument. The `FieldDoc` is really needed > only in case of `TopFieldCollector` but now its usage is mandated by > `TopScoreDocCollector` as well. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10242) The TopScoreDocCollector::createSharedManager should use ScoreDoc instead of FieldDoc
[ https://issues.apache.org/jira/browse/LUCENE-10242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17445994#comment-17445994 ] ASF subversion and git services commented on LUCENE-10242: -- Commit 42bee6f22302dad73b2a336db4b4725f69282e33 in lucene's branch refs/heads/branch_9x from Andriy Redko [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=42bee6f ] LUCENE-10242: The TopScoreDocCollector::createSharedManager should use ScoreDoc instead of FieldDoc (#450) Signed-off-by: Andriy Redko > The TopScoreDocCollector::createSharedManager should use ScoreDoc instead of > FieldDoc > - > > Key: LUCENE-10242 > URL: https://issues.apache.org/jira/browse/LUCENE-10242 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 8.9, 8.10 >Reporter: Andriy Redko >Priority: Minor > Fix For: 9.0 > > Time Spent: 20m > Remaining Estimate: 0h > > The `TopScoreDocCollector::createSharedManager` should use `ScoreDoc` instead > of `FieldDoc` as a type of `after` argument. The `FieldDoc` is really needed > only in case of `TopFieldCollector` but now its usage is mandated by > `TopScoreDocCollector` as well. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] costin commented on pull request #453: Investigate whether should Packed64 use a byte[] plus VarHandles instead of a long[]
costin commented on pull request #453: URL: https://github.com/apache/lucene/pull/453#issuecomment-972988872 I've tested `Packed64LongByte` multiple times against `TestPackedInts` and it seems to pass. Note that I've focused mainly on the `get` and `set` methods and removed the specialized fill and bulk get/set methods due to slightly different encoding. tl;dr - `Packed64` has best performance, followed by `Packed64VarHandleLongLong` and lastly by `Packed64LongByte` with a significant gap. My guess is the long[] array is more cache friendly while the byte[] isn't - by sliding across the byte[] to read longs, the caches are thrashed which is much more expensive than doing bitwise operations. The data in case of byte[] support this - performance is great for the 1 and 64 case (where the sliding is minimal since new positions belong to the same byte block), while for the rest it falls to the ground. I've included the benchmark in the commit; the tests are running on latest JDK 17.0.1 on an Ryzen 5900x. ``` openjdk version "17.0.1" 2021-10-19 OpenJDK Runtime Environment Temurin-17.0.1+12 (build 17.0.1+12) OpenJDK 64-Bit Server VM Temurin-17.0.1+12 (build 17.0.1+12, mixed mode, sharing) ``` ``` # JMH version: 1.33 # VM version: JDK 17.0.1, OpenJDK 64-Bit Server VM, 17.0.1+12 Benchmark (bpv) (size) Mode Cnt Score Error Units Packed64Benchmark.packed64 1 10240 thrpt3 80129.482 ± 2603.161 ops/s Packed64Benchmark.packed64 4 10240 thrpt3 87680.030 ± 2725.245 ops/s Packed64Benchmark.packed64 5 10240 thrpt3 56768.849 ± 633.855 ops/s Packed64Benchmark.packed64 8 10240 thrpt3 91869.977 ± 1579.788 ops/s Packed64Benchmark.packed64 11 10240 thrpt3 71802.824 ± 2124.577 ops/s Packed64Benchmark.packed64 16 10240 thrpt3 94892.804 ± 1876.298 ops/s Packed64Benchmark.packed64 23 10240 thrpt3 65401.053 ± 690.302 ops/s Packed64Benchmark.packed64 25 10240 thrpt3 69623.043 ± 679.354 ops/s Packed64Benchmark.packed64 31 10240 thrpt3 66732.829 ± 1788.781 ops/s Packed64Benchmark.packed64 32 10240 thrpt3 99938.841 ± 197.846 ops/s Packed64Benchmark.packed64 47 10240 thrpt3 52174.874 ± 2369.976 ops/s Packed64Benchmark.packed64 59 10240 thrpt3 55240.166 ± 165.920 ops/s Packed64Benchmark.packed64 61 10240 thrpt3 54909.784 ± 5064.028 ops/s Packed64Benchmark.packed64 64 10240 thrpt3 99261.182 ± 2746.589 ops/s Packed64Benchmark.packed64VarHandleLongAndByte 1 10240 thrpt3 56909.631 ± 1986.897 ops/s Packed64Benchmark.packed64VarHandleLongAndByte 4 10240 thrpt3 31584.303 ± 1063.071 ops/s Packed64Benchmark.packed64VarHandleLongAndByte 5 10240 thrpt3 23703.179 ± 2127.908 ops/s Packed64Benchmark.packed64VarHandleLongAndByte 8 10240 thrpt3 17958.177 ± 825.106 ops/s Packed64Benchmark.packed64VarHandleLongAndByte 11 10240 thrpt3 17810.271 ± 379.489 ops/s Packed64Benchmark.packed64VarHandleLongAndByte 16 10240 thrpt3 19043.959 ± 2185.749 ops/s Packed64Benchmark.packed64VarHandleLongAndByte 23 10240 thrpt3 17784.221 ± 824.539 ops/s Packed64Benchmark.packed64VarHandleLongAndByte 25 10240 thrpt3 17728.510 ± 1063.605 ops/s Packed64Benchmark.packed64VarHandleLongAndByte 31 10240 thrpt3 17742.034 ± 198.159 ops/s Packed64Benchmark.packed64VarHandleLongAndByte 32 10240 thrpt3 21951.514 ± 7648.112 ops/s Packed64Benchmark.packed64VarHandleLongAndByte 47 10240 thrpt3 17741.987 ± 866.598 ops/s Packed64Benchmark.packed64VarHandleLongAndByte 59 10240 thrpt3 19481.548 ± 446.086 ops/s Packed64Benchmark.packed64VarHandleLongAndByte 61 10240 thrpt3 18999.020 ± 949.164 ops/s Packed64Benchmark.packed64VarHandleLongAndByte 64 10240 thrpt3 74178.395 ± 1219.876 ops/s Packed64Benchmark.packed64VarHandleLongLong 1 10240 thrpt3 73824.221 ± 3210.970 ops/s Packed64Benchmark.packed64VarHandleLongLong 4 10240 thrpt3 82908.298 ± 8099.224 ops/s Packed64Benchmark.packed64VarHandleLongLong 5 10240 thrpt3 54176.175 ± 2005.872 ops/s Packed64Benchmark.packed64VarHandleLongLong 8 10240 thrpt3 83556.558 ± 1658.929 ops/s Packed64Benchmark.packed64VarHandleLongLong11 10240 thrpt3 56512.522 ± 380.595 ops/s
[GitHub] [lucene] jpountz commented on a change in pull request #443: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals
jpountz commented on a change in pull request #443: URL: https://github.com/apache/lucene/pull/443#discussion_r752375283 ## File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/FastTaxonomyFacetCounts.java ## @@ -69,31 +69,34 @@ public FastTaxonomyFacetCounts( } private final void count(List matchingDocs) throws IOException { +if (matchingDocs.isEmpty()) { + return; +} + +boolean useBinaryDv = FacetUtils.usesOlderBinaryOrdinals(matchingDocs.get(0).context.reader()); + for (MatchingDocs hits : matchingDocs) { - BinaryDocValues dv = hits.context.reader().getBinaryDocValues(indexFieldName); - if (dv == null) { // this reader does not have DocValues for the requested category list + assert useBinaryDv == FacetUtils.usesOlderBinaryOrdinals(hits.context.reader()); Review comment: @gsmiller Maybe we should recompute `useBinaryDv` on every `MatchingDocs` instead of pre-computing it ahead of the loop based on the first leaf of the index? It's probably a corner-case for taxonomy-based faceting though since we expect the main index and the taxonomy index to have been built side-by-side, so we wouldn't expect users to bring in an index that has been built on the side with a different creation version and plug it in through a MultiReader? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10242) The TopScoreDocCollector::createSharedManager should use ScoreDoc instead of FieldDoc
[ https://issues.apache.org/jira/browse/LUCENE-10242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17446004#comment-17446004 ] ASF subversion and git services commented on LUCENE-10242: -- Commit de4402cf622921141aa90ceca826979c14cf0db7 in lucene's branch refs/heads/branch_9_0 from Andriy Redko [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=de4402c ] LUCENE-10242: The TopScoreDocCollector::createSharedManager should use ScoreDoc instead of FieldDoc (#450) Signed-off-by: Andriy Redko > The TopScoreDocCollector::createSharedManager should use ScoreDoc instead of > FieldDoc > - > > Key: LUCENE-10242 > URL: https://issues.apache.org/jira/browse/LUCENE-10242 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 8.9, 8.10 >Reporter: Andriy Redko >Priority: Minor > Fix For: 9.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > The `TopScoreDocCollector::createSharedManager` should use `ScoreDoc` instead > of `FieldDoc` as a type of `after` argument. The `FieldDoc` is really needed > only in case of `TopFieldCollector` but now its usage is mandated by > `TopScoreDocCollector` as well. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] magibney commented on a change in pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue
magibney commented on a change in pull request #380: URL: https://github.com/apache/lucene/pull/380#discussion_r752385177 ## File path: lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/OpenNLPOpsFactory.java ## @@ -169,11 +169,14 @@ public static String getLemmatizerDictionary(String dictionaryFile, ResourceLoad builder.append(chars, 0, numRead); } } while (numRead > 0); -dictionary = builder.toString(); -lemmaDictionaries.put(dictionaryFile, dictionary); +String dictionary = builder.toString(); +InputStream dictionaryInputStream = +new ByteArrayInputStream(dictionary.getBytes(StandardCharsets.UTF_8)); +dictionaryLemmatizer = new DictionaryLemmatizer(dictionaryInputStream); Review comment: All `ReaderInputStream` (from Apache commons-io) does is re-encode a `Reader` as an input-stream in some specified encoding. It would be relatively straightforward to do this manually, but again is orthogonal to the change you're looking to introduce. Ideally, opennlp would have a `DictionaryLemmatizer` ctor that accepts a `Reader` directly -- I can't imagine that would be a controversial upstream PR? >for the sake of simplicity Yes; I think in its current state this PR would make things better in some respects, even if leaving some problems unchanged. fwiw, I'm not necessarily inclined to assume a logical motivation in the current recreation-from-String of `DictionaryLemmatizer` for every call to `.create()`. Even if there _had_ been concerns for thread safety or whatever, that wouldn't explain why the dictionary would be stored as a `String` as opposed to a `byte[]`, given that as a consequence every invocation of `.create()` superfluously re-encodes the String as UTF-8 :-/ Ultimately it looks like the heap requirement here is currently ~3x what it could be, given the on-heap String, encoded to a transient monolithic `byte[]`, parsed into a `DictionaryLemmatizer` ... ## File path: lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/OpenNLPOpsFactory.java ## @@ -169,11 +169,14 @@ public static String getLemmatizerDictionary(String dictionaryFile, ResourceLoad builder.append(chars, 0, numRead); } } while (numRead > 0); -dictionary = builder.toString(); -lemmaDictionaries.put(dictionaryFile, dictionary); +String dictionary = builder.toString(); +InputStream dictionaryInputStream = +new ByteArrayInputStream(dictionary.getBytes(StandardCharsets.UTF_8)); +dictionaryLemmatizer = new DictionaryLemmatizer(dictionaryInputStream); Review comment: All `ReaderInputStream` (from Apache commons-io) does is re-encode a `Reader` as an input-stream in some specified encoding. It would be relatively straightforward to do this manually, but again is orthogonal to the change you're looking to introduce. Ideally, opennlp would have a `DictionaryLemmatizer` ctor that accepts a `Reader` directly -- I can't imagine that would be a controversial upstream PR? >for the sake of simplicity Yes; I think in its current state this PR would make things better in some respects, even if leaving some problems unchanged. fwiw, I'm not necessarily inclined to assume a logical motivation in the current recreation-from-String of `DictionaryLemmatizer` for every call to `.create()`. Even if there _had_ been concerns for thread safety or whatever, that wouldn't explain why the dictionary would be stored as a `String` as opposed to a `byte[]`, given that as a consequence every invocation of `.create()` superfluously re-encodes the String as UTF-8 :-/ Ultimately it looks like the heap requirement here is currently ~3x what it could be, given the on-heap String, encoded to a transient monolithic `byte[]`, parsed into a `DictionaryLemmatizer` ... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] costin commented on pull request #453: Investigate whether should Packed64 use a byte[] plus VarHandles instead of a long[]
costin commented on pull request #453: URL: https://github.com/apache/lucene/pull/453#issuecomment-972999769 Adding a few more info to this ticket regarding the underlying assembly. The p64lb method is the shortest, followed by p64 and p64ll. ``` = C1-compiled nmethod == --- Assembly --- Compiled method (c1) 194 774 3 org.apache.lucene.util.packed.Packed64::set (145 bytes) total in heap [0x01de36e94990,0x01de36e94f20] = 1424 relocation [0x01de36e94ae8,0x01de36e94b30] = 72 main code [0x01de36e94b40,0x01de36e94dc0] = 640 stub code [0x01de36e94dc0,0x01de36e94df8] = 56 oops [0x01de36e94df8,0x01de36e94e00] = 8 metadata [0x01de36e94e00,0x01de36e94e08] = 8 scopes data[0x01de36e94e08,0x01de36e94e50] = 72 scopes pcs [0x01de36e94e50,0x01de36e94ef0] = 160 dependencies [0x01de36e94ef0,0x01de36e94ef8] = 8 nul chk table [0x01de36e94ef8,0x01de36e94f20] = 40 --- Assembly --- Compiled method (c1) 173 799 3 org.apache.lucene.util.packed.Packed64VHLongAndByte::set (144 bytes) total in heap [0x01c7d511be10,0x01c7d511d720] = 6416 relocation [0x01c7d511bf68,0x01c7d511c0c8] = 352 main code [0x01c7d511c0e0,0x01c7d511d0e0] = 4096 stub code [0x01c7d511d0e0,0x01c7d511d158] = 120 oops [0x01c7d511d158,0x01c7d511d190] = 56 metadata [0x01c7d511d190,0x01c7d511d1e8] = 88 scopes data[0x01c7d511d1e8,0x01c7d511d500] = 792 scopes pcs [0x01c7d511d500,0x01c7d511d6e0] = 480 dependencies [0x01c7d511d6e0,0x01c7d511d6e8] = 8 nul chk table [0x01c7d511d6e8,0x01c7d511d720] = 56 --- Assembly --- Compiled method (c1) 177 797 3 org.apache.lucene.util.packed.Packed64VHLongLong::set (204 bytes) total in heap [0x018a00893b90,0x018a00897390] = 14336 relocation [0x018a00893ce8,0x018a00894060] = 888 main code [0x018a00894060,0x018a008965a0] = 9536 stub code [0x018a008965a0,0x018a00896698] = 248 oops [0x018a00896698,0x018a008966d0] = 56 metadata [0x018a008966d0,0x018a00896728] = 88 scopes data[0x018a00896728,0x018a00896e70] = 1864 scopes pcs [0x018a00896e70,0x018a00897320] = 1200 dependencies [0x018a00897320,0x018a00897328] = 8 nul chk table [0x018a00897328,0x018a00897390] = 104 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a change in pull request #443: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals
gsmiller commented on a change in pull request #443: URL: https://github.com/apache/lucene/pull/443#discussion_r752406697 ## File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/FastTaxonomyFacetCounts.java ## @@ -69,31 +69,34 @@ public FastTaxonomyFacetCounts( } private final void count(List matchingDocs) throws IOException { +if (matchingDocs.isEmpty()) { + return; +} + +boolean useBinaryDv = FacetUtils.usesOlderBinaryOrdinals(matchingDocs.get(0).context.reader()); + for (MatchingDocs hits : matchingDocs) { - BinaryDocValues dv = hits.context.reader().getBinaryDocValues(indexFieldName); - if (dv == null) { // this reader does not have DocValues for the requested category list + assert useBinaryDv == FacetUtils.usesOlderBinaryOrdinals(hits.context.reader()); Review comment: Thanks @jpountz. I _think_ I'm effectively doing this now in the latest revision with since `FacetUtil#loadOrdinalValues` does the check for the provided leaf when determining what format to expect, and that is invoked for each `MatchingDocs` instance. So I _think_ we're covered here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10122) Explore using NumericDocValue to store taxonomy parent array
[ https://issues.apache.org/jira/browse/LUCENE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17446030#comment-17446030 ] Adrien Grand commented on LUCENE-10122: --- It looks like this change is missing from branch_9x? > Explore using NumericDocValue to store taxonomy parent array > > > Key: LUCENE-10122 > URL: https://issues.apache.org/jira/browse/LUCENE-10122 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: 9.0 >Reporter: Haoyu Zhai >Priority: Minor > Time Spent: 5h 10m > Remaining Estimate: 0h > > We currently use term position of a hardcoded term in a hardcoded field to > represent the parent ordinal of each taxonomy label. That is an old way and > perhaps could be dated back to the time where doc values didn't exist. > We probably would want to use NumericDocValues instead given we have spent > quite a lot of effort optimizing them. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a change in pull request #443: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals
jpountz commented on a change in pull request #443: URL: https://github.com/apache/lucene/pull/443#discussion_r752431752 ## File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/FastTaxonomyFacetCounts.java ## @@ -69,31 +69,34 @@ public FastTaxonomyFacetCounts( } private final void count(List matchingDocs) throws IOException { +if (matchingDocs.isEmpty()) { + return; +} + +boolean useBinaryDv = FacetUtils.usesOlderBinaryOrdinals(matchingDocs.get(0).context.reader()); + for (MatchingDocs hits : matchingDocs) { - BinaryDocValues dv = hits.context.reader().getBinaryDocValues(indexFieldName); - if (dv == null) { // this reader does not have DocValues for the requested category list + assert useBinaryDv == FacetUtils.usesOlderBinaryOrdinals(hits.context.reader()); Review comment: Ah, you are right. I got confused because this comment was linked to a previous version of your code that didn't use `FacetUtil#loadOrdinalValues` yet! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10208) Minimum score can decrease in concurrent search
[ https://issues.apache.org/jira/browse/LUCENE-10208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17446041#comment-17446041 ] ASF subversion and git services commented on LUCENE-10208: -- Commit 2e5c4bb5a53f8367fb44f8db1bf6454d2d9b8e37 in lucene's branch refs/heads/branch_9x from Jim Ferenczi [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=2e5c4bb ] LUCENE-10208: Ensure that the minimum competitive score does not decrease in concurrent search (#431) Co-authored-by: Adrien Grand > Minimum score can decrease in concurrent search > --- > > Key: LUCENE-10208 > URL: https://issues.apache.org/jira/browse/LUCENE-10208 > Project: Lucene - Core > Issue Type: Bug >Reporter: Jim Ferenczi >Priority: Minor > Fix For: 9.0, 8.11 > > Time Spent: 1h > Remaining Estimate: 0h > > TestLatLonPointDistanceFeatureQuery#testCompareSorting started to fail > sporadically after https://github.com/apache/lucene/pull/331. > The test change added in this PR exposes an existing bug in top docs > collector. > They re-set the minimum score multiple times per segment when a bulk scorer > is used. > In practice this is not a problem because the local minimum score cannot > decrease. > However when concurrent search is used, the global minimum score is updated > after the local one so that breaks the assertion. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] spyk commented on a change in pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue
spyk commented on a change in pull request #380: URL: https://github.com/apache/lucene/pull/380#discussion_r752443863 ## File path: lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/OpenNLPOpsFactory.java ## @@ -169,11 +169,14 @@ public static String getLemmatizerDictionary(String dictionaryFile, ResourceLoad builder.append(chars, 0, numRead); } } while (numRead > 0); -dictionary = builder.toString(); -lemmaDictionaries.put(dictionaryFile, dictionary); +String dictionary = builder.toString(); +InputStream dictionaryInputStream = +new ByteArrayInputStream(dictionary.getBytes(StandardCharsets.UTF_8)); +dictionaryLemmatizer = new DictionaryLemmatizer(dictionaryInputStream); Review comment: Just to clarify the main issue with the String being cached instead of the DictionaryLemmatizer (aside any thread safety concerns) is that the DictionaryLemmatizer parses and creates a new HashMap each time `create()` is called. So, in our case with a 5MB lemmas.txt file across several fields, it crashed a 64GB cluster with OOM with the heap containing several hundred DictionaryLemmatizer instances, each with its own ~80MB internal hashmap. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] magibney commented on a change in pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue
magibney commented on a change in pull request #380: URL: https://github.com/apache/lucene/pull/380#discussion_r752448693 ## File path: lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/OpenNLPOpsFactory.java ## @@ -169,11 +169,14 @@ public static String getLemmatizerDictionary(String dictionaryFile, ResourceLoad builder.append(chars, 0, numRead); } } while (numRead > 0); -dictionary = builder.toString(); -lemmaDictionaries.put(dictionaryFile, dictionary); +String dictionary = builder.toString(); +InputStream dictionaryInputStream = +new ByteArrayInputStream(dictionary.getBytes(StandardCharsets.UTF_8)); +dictionaryLemmatizer = new DictionaryLemmatizer(dictionaryInputStream); Review comment: Yes, sorry -- thanks for the clarification. What I mentioned was tangential, and narrowly focused on/reading between the lines of one minor aspect of the current implementation. I didn't mean to imply the String/re-parsing _per se_ was the main issue. The hashmaps are surely more important, as you say. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10208) Minimum score can decrease in concurrent search
[ https://issues.apache.org/jira/browse/LUCENE-10208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17446050#comment-17446050 ] ASF subversion and git services commented on LUCENE-10208: -- Commit 0018567e8b16f3ba2f88a912afdc6367bb3ce675 in lucene's branch refs/heads/branch_9_0 from Jim Ferenczi [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=0018567 ] LUCENE-10208: Ensure that the minimum competitive score does not decrease in concurrent search (#431) Co-authored-by: Adrien Grand > Minimum score can decrease in concurrent search > --- > > Key: LUCENE-10208 > URL: https://issues.apache.org/jira/browse/LUCENE-10208 > Project: Lucene - Core > Issue Type: Bug >Reporter: Jim Ferenczi >Priority: Minor > Fix For: 9.0, 8.11 > > Time Spent: 1h > Remaining Estimate: 0h > > TestLatLonPointDistanceFeatureQuery#testCompareSorting started to fail > sporadically after https://github.com/apache/lucene/pull/331. > The test change added in this PR exposes an existing bug in top docs > collector. > They re-set the minimum score multiple times per segment when a bulk scorer > is used. > In practice this is not a problem because the local minimum score cannot > decrease. > However when concurrent search is used, the global minimum score is updated > after the local one so that breaks the assertion. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zhaih commented on a change in pull request #442: LUCENE-10122 Use NumericDocValue to store taxonomy parent array
zhaih commented on a change in pull request #442: URL: https://github.com/apache/lucene/pull/442#discussion_r752459454 ## File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java ## @@ -130,6 +125,46 @@ private void initParents(IndexReader reader, int first) throws IOException { return; } +if (getMajorVersion(reader) <= 8) { + loadParentUsingTermPosition(reader, first); + return; +} + +for (LeafReaderContext leafContext : reader.leaves()) { + int leafDocNum = leafContext.reader().maxDoc(); + if (leafContext.docBase + leafDocNum <= first) { +// skip this leaf if it does not contain new categories +continue; + } + NumericDocValues parentValues = + leafContext.reader().getNumericDocValues(Consts.FIELD_PARENT_ORDINAL_NDV); + if (parentValues == null) { +throw new CorruptIndexException( +"Parent data field " + Consts.FIELD_PARENT_ORDINAL_NDV + " not exists", +leafContext.reader().toString()); + } + + for (int doc = Math.max(first - leafContext.docBase, 0); doc < leafDocNum; doc++) { +if (parentValues.advanceExact(doc) == false) { + throw new CorruptIndexException( + "Missing parent data for category " + (doc + leafContext.docBase), reader.toString()); +} +// we're putting an int and converting it back so it should be safe +parents[doc + leafContext.docBase] = Math.toIntExact(parentValues.longValue()); + } +} + } + + private static int getMajorVersion(IndexReader reader) { +return reader.leaves().get(0).reader().getMetaData().getCreatedVersionMajor(); Review comment: Yeah it is guaranteed to be non-empty, but it might be a good idea adding an assertion (fortunately this is just a private method -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zhaih opened a new pull request #454: LUCENE-10122 Use NumericDocValue to store taxonomy parent array
zhaih opened a new pull request #454: URL: https://github.com/apache/lucene/pull/454 # Note PR towards branch_9x, mostly the same as https://github.com/apache/lucene/pull/442, adding one additional assertion to make sure when checking the index version the index is not empty # Checklist Please review the following and check all that apply: ](https://github.com/apache/lucene/pull/442) - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code conforms to the standards described there to the best of my ability. - [x] I have created a Jira issue and added the issue ID to my pull request title. - [x] I have given Lucene maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [x] I have developed this patch against the `main` branch. - [x] I have run `./gradlew check`. - [ ] I have added tests for my changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10122) Explore using NumericDocValue to store taxonomy parent array
[ https://issues.apache.org/jira/browse/LUCENE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17446058#comment-17446058 ] Haoyu Zhai commented on LUCENE-10122: - Ah Thanks [~jpountz] for reminding, I forgot that, here we go: https://github.com/apache/lucene/pull/454 > Explore using NumericDocValue to store taxonomy parent array > > > Key: LUCENE-10122 > URL: https://issues.apache.org/jira/browse/LUCENE-10122 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: 9.0 >Reporter: Haoyu Zhai >Priority: Minor > Time Spent: 5.5h > Remaining Estimate: 0h > > We currently use term position of a hardcoded term in a hardcoded field to > represent the parent ordinal of each taxonomy label. That is an old way and > perhaps could be dated back to the time where doc values didn't exist. > We probably would want to use NumericDocValues instead given we have spent > quite a lot of effort optimizing them. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on pull request #264: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals (instead of custom binary format)
msokolov commented on pull request #264: URL: https://github.com/apache/lucene/pull/264#issuecomment-973094350 > I still suggest we refrain from reviewing this PR in detail oops, I only read that at the end; still and all I don't have much to say; this area of the code is fairly new for me. Looks like a great simplification and the speedups you found are awesome! On a related topic, while I was stumbling around in here I found we have the option to accumulate scores in MatchingDocs which I'd be excited to explore for doing relevance-weighted sums. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on a change in pull request #416: LUCENE-10054 Make HnswGraph hierarchical
msokolov commented on a change in pull request #416: URL: https://github.com/apache/lucene/pull/416#discussion_r752476787 ## File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java ## @@ -56,31 +59,50 @@ public final class HnswGraph extends KnnGraphValues { private final int maxConn; + private int numLevels; // the current number of levels in the graph + private int entryNode; // the current graph entry node on the top level - // Each entry lists the top maxConn neighbors of a node. The nodes correspond to vectors added to - // HnswBuilder, and the - // node values are the ordinals of those vectors. - private final List graph; + // Nodes by level expressed as the level 0's nodes' ordinals. + // As level 0 contains all nodes, nodesByLevel.get(0) is null. + private final List nodesByLevel; + + // graph is a list of graph levels. + // Each level is represented as List – nodes' connections on this level. + // Each entry in the list has the top maxConn neighbors of a node. The nodes correspond to vectors + // added to HnswBuilder, and the node values are the ordinals of those vectors. + // Thus, on all levels, neighbors expressed as the level 0's nodes' ordinals. + private final List> graph; // KnnGraphValues iterator members private int upto; private NeighborArray cur; - HnswGraph(int maxConn) { -graph = new ArrayList<>(); -// Typically with diversity criteria we see nodes not fully occupied; average fanout seems to be -// about 1/2 maxConn. There is some indexing time penalty for under-allocating, but saves RAM -graph.add(new NeighborArray(Math.max(32, maxConn / 4))); + HnswGraph(int maxConn, int levelOfFirstNode) { this.maxConn = maxConn; +this.numLevels = levelOfFirstNode + 1; +this.graph = new ArrayList<>(numLevels); +this.entryNode = 0; +for (int i = 0; i < numLevels; i++) { + graph.add(new ArrayList<>()); + // Typically with diversity criteria we see nodes not fully occupied; + // average fanout seems to be about 1/2 maxConn. + // There is some indexing time penalty for under-allocating, but saves RAM + graph.get(i).add(new NeighborArray(Math.max(32, maxConn / 4))); Review comment: heh, this makes no sense at all! Git-spelunking, it looks like I added this fancy dancing when adding the diversity criteria, but overlooked `addNode` - which does the majority of the allocation! It could be worth taking a lok to see if we can cheaply save some RAM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on a change in pull request #416: LUCENE-10054 Make HnswGraph hierarchical
msokolov commented on a change in pull request #416: URL: https://github.com/apache/lucene/pull/416#discussion_r752477271 ## File path: lucene/core/src/java/org/apache/lucene/index/KnnGraphValues.java ## @@ -32,25 +33,41 @@ protected KnnGraphValues() {} /** - * Move the pointer to exactly {@code target}, the id of a node in the graph. After this method + * Move the pointer to exactly the given {@code level}'s {@code target}. After this method * returns, call {@link #nextNeighbor()} to return successive (ordered) connected node ordinals. * - * @param target must be a valid node in the graph, ie. ≥ 0 and < {@link + * @param level level of the graph + * @param target ordinal of a node in the graph, must be ≥ 0 and < {@link * VectorValues#size()}. */ - public abstract void seek(int target) throws IOException; + public abstract void seek(int level, int target) throws IOException; /** Returns the number of nodes in the graph */ public abstract int size(); /** * Iterates over the neighbor list. It is illegal to call this method after it returns - * NO_MORE_DOCS without calling {@link #seek(int)}, which resets the iterator. + * NO_MORE_DOCS without calling {@link #seek(int, int)}, which resets the iterator. * * @return a node ordinal in the graph, or NO_MORE_DOCS if the iteration is complete. */ public abstract int nextNeighbor() throws IOException; + /** Returns the number of levels of the graph */ + public abstract int numLevels() throws IOException; + + /** Returns graph's entry point on the top level * */ + public abstract int entryNode() throws IOException; + + /** + * Get all nodes on a given level as node 0th ordinals + * + * @param level level for which to get all nodes + * @return an iterator over nodes where {@code nextDoc} returns a next node ordinal + */ + // TODO: return a more suitable iterator over nodes than DocIdSetIterator Review comment: I guess we could make one up -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #443: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals
gsmiller commented on pull request #443: URL: https://github.com/apache/lucene/pull/443#issuecomment-973151611 For posterity, I re-ran benchmarks with this most recent change, comparing against `branch_9_0` as a baseline (so it includes all the back-compat checks and such). Results still look very good: ``` TaskQPS baseline StdDevQPS candidate StdDevPct diff p-value HighTermDayOfYearSort 127.51 (15.4%) 124.04 (12.1%) -2.7% ( -26% - 29%) 0.534 TermDTSort 96.36 (15.3%) 94.54 (15.1%) -1.9% ( -28% - 33%) 0.695 HighTermMonthSort 135.11 (13.3%) 133.31 (16.3%) -1.3% ( -27% - 32%) 0.778 Fuzzy1 63.28 (8.3%) 62.55 (8.6%) -1.2% ( -16% - 17%) 0.662 PKLookup 152.55 (2.1%) 151.14 (1.8%) -0.9% ( -4% -3%) 0.139 MedIntervalsOrdered 82.93 (3.3%) 82.62 (2.4%) -0.4% ( -5% -5%) 0.681 BrowseDayOfYearSSDVFacets 14.45 (14.5%) 14.40 (14.1%) -0.3% ( -25% - 33%) 0.945 HighTermTitleBDVSort 106.51 (21.3%) 106.18 (21.6%) -0.3% ( -35% - 54%) 0.964 MedTerm 1238.25 (2.3%) 1235.30 (2.5%) -0.2% ( -4% -4%) 0.753 LowIntervalsOrdered 112.01 (2.2%) 111.75 (1.9%) -0.2% ( -4% -4%) 0.724 Prefix3 246.62 (2.2%) 246.10 (2.2%) -0.2% ( -4% -4%) 0.764 MedSloppyPhrase 72.18 (2.9%) 72.06 (3.0%) -0.2% ( -5% -5%) 0.861 OrHighNotHigh 530.67 (2.9%) 529.98 (2.1%) -0.1% ( -5% -5%) 0.873 OrHighNotLow 581.58 (3.3%) 581.28 (2.3%) -0.1% ( -5% -5%) 0.954 BrowseMonthSSDVFacets 14.32 (4.5%) 14.32 (4.5%) -0.0% ( -8% -9%) 0.977 MedPhrase 200.56 (1.9%) 200.54 (1.9%) -0.0% ( -3% -3%) 0.992 OrHighNotMed 600.02 (2.7%) 599.99 (3.1%) -0.0% ( -5% -5%) 0.995 IntNRQ 56.03 (3.5%) 56.03 (3.5%)0.0% ( -6% -7%) 0.995 HighIntervalsOrdered1.86 (2.2%)1.86 (2.2%)0.1% ( -4% -4%) 0.908 LowSpanNear 44.66 (1.7%) 44.70 (1.6%)0.1% ( -3% -3%) 0.874 HighTerm 974.66 (3.1%) 975.67 (2.8%)0.1% ( -5% -6%) 0.912 Respell 59.10 (1.8%) 59.17 (1.9%)0.1% ( -3% -3%) 0.849 HighPhrase 30.48 (1.9%) 30.52 (1.8%)0.1% ( -3% -3%) 0.819 OrHighLow 310.16 (2.6%) 310.61 (2.3%)0.1% ( -4% -5%) 0.849 LowTerm 1419.43 (2.7%) 1421.64 (2.5%)0.2% ( -4% -5%) 0.851 MedSpanNear 14.05 (2.3%) 14.08 (2.2%)0.2% ( -4% -4%) 0.798 LowPhrase 218.41 (4.5%) 218.81 (4.7%)0.2% ( -8% -9%) 0.899 OrNotHighHigh 531.27 (2.5%) 532.41 (2.1%)0.2% ( -4% -4%) 0.768 LowSloppyPhrase 23.57 (1.6%) 23.63 (1.8%)0.2% ( -3% -3%) 0.649 HighSloppyPhrase 34.72 (2.0%) 34.83 (2.0%)0.3% ( -3% -4%) 0.595 OrNotHighLow 786.86 (2.0%) 789.56 (1.9%)0.3% ( -3% -4%) 0.580 HighSpanNear 32.06 (2.9%) 32.17 (2.6%)0.3% ( -4% -5%) 0.690 AndHighMed 210.90 (3.2%) 212.04 (3.1%)0.5% ( -5% -7%) 0.589 Wildcard 106.13 (3.0%) 106.83 (3.3%)0.7% ( -5% -7%) 0.505 OrHighMed 60.24 (2.3%) 60.69 (2.4%)0.7% ( -3% -5%) 0.318 AndHighHigh 54.27 (3.3%) 54.70 (3.3%)0.8% ( -5% -7%) 0.448 AndHighLow 514.75 (2.5%) 518.95 (2.3%)0.8% ( -3% -5%) 0.284 OrHighHigh 25.30 (2.5%) 25.51 (2.3%)0.8% ( -3% -5%) 0.274 OrNotHighMed 636.45 (2.8%) 643.57 (2.8%)1.1% ( -4% -6%) 0.204 Fuzzy2 49.98 (7.1%) 50.55
[GitHub] [lucene] gsmiller commented on a change in pull request #443: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals
gsmiller commented on a change in pull request #443: URL: https://github.com/apache/lucene/pull/443#discussion_r752525277 ## File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/FastTaxonomyFacetCounts.java ## @@ -69,31 +69,34 @@ public FastTaxonomyFacetCounts( } private final void count(List matchingDocs) throws IOException { +if (matchingDocs.isEmpty()) { + return; +} + +boolean useBinaryDv = FacetUtils.usesOlderBinaryOrdinals(matchingDocs.get(0).context.reader()); + for (MatchingDocs hits : matchingDocs) { - BinaryDocValues dv = hits.context.reader().getBinaryDocValues(indexFieldName); - if (dv == null) { // this reader does not have DocValues for the requested category list + assert useBinaryDv == FacetUtils.usesOlderBinaryOrdinals(hits.context.reader()); Review comment: Yeah, no problem. Thanks for the suggestions! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #443: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals
gsmiller commented on pull request #443: URL: https://github.com/apache/lucene/pull/443#issuecomment-973173432 OK, I believe this change is good-to-go now. I _think_ I've addressed all the feedback so far and just did another benchmark run to make sure all the performance benefits are still showing up with the back-compat considerations in place, etc. Please let me know if anyone has additional feedback. Thanks again everyone! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a change in pull request #442: LUCENE-10122 Use NumericDocValue to store taxonomy parent array
gsmiller commented on a change in pull request #442: URL: https://github.com/apache/lucene/pull/442#discussion_r752558188 ## File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java ## @@ -92,15 +93,18 @@ private final Directory dir; private final IndexWriter indexWriter; private final boolean useOlderStoredFieldIndex; + private final boolean useOlderTermPositionForParentOrdinal; Review comment: Just merged this into my working branch. Thanks for making that merge so easy! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] costin commented on pull request #453: Investigate whether should Packed64 use a byte[] plus VarHandles instead of a long[]
costin commented on pull request #453: URL: https://github.com/apache/lucene/pull/453#issuecomment-973193877 Thanks to @jpountz I've added two more tests beside JMH. `PackedRandomTest` which does a brute force for testing performance of random locations instead of consecutive ones. Surprisingly the VHLongByte implementation comes ahead, followed by Packed64 and finally Packed64VHLongLong - the results on my machine are roughly: ``` 1. VHLongAndByte 939100 2. Packed64 1358500 3. VHLongLong 1510400 ``` Adrien suggested another workload, more realistic `PackedMacroTest` which does `SortedDocValues` merging, which creates an `OrdinalMap` backed by `PackedInts` and then recorded the merge times. In all cases I manually updated the PackedInt class to returned the desired implementation. Based on these set of results, Packed64 is on the first place, followed by VHLongAndByte with VHLongLong coming in third. I'm attaching the merging times found for reference: [p64vhll.log](https://github.com/apache/lucene/files/7565523/p64vhll.log) [p64.log](https://github.com/apache/lucene/files/7565524/p64.log) [p64vhlb.log](https://github.com/apache/lucene/files/7565525/p64vhlb.log) . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10244) Please consider opening MultiCollector::getCollectors for public use
Andriy Redko created LUCENE-10244: - Summary: Please consider opening MultiCollector::getCollectors for public use Key: LUCENE-10244 URL: https://issues.apache.org/jira/browse/LUCENE-10244 Project: Lucene - Core Issue Type: Improvement Components: core/search Affects Versions: 8.10.1 Reporter: Andriy Redko The usage of *MultiCollector* sometimes is very convenient along with {*}CollectorManager{*}s instances (as an alternative to {*}MultiCollectorManager{*}), for example: {code:java} class SomeCollectorManager implements CollectorManager { @Override public Collector newCollector() throws IOException { return MultiCollector.wrap(new Collector1(), new Collector2(), ...); } @Override public List reduce(Collection collectors) throws IOException { return ...; } }{code} Unfortunately, the *reduce()* phase is lacking the access to *MultiCollector::getCollectors* method, since it is package protected at the moment. Making *MultiCollector::getCollectors* public would make it convenient to use *MultiCollector +* *CollectorManager* and implement the reduce phase by accessing individual collectors. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] costin edited a comment on pull request #453: Investigate whether should Packed64 use a byte[] plus VarHandles instead of a long[]
costin edited a comment on pull request #453: URL: https://github.com/apache/lucene/pull/453#issuecomment-973193877 Thanks to @jpountz I've added two more tests beside JMH. `PackedRandomTest` which does a brute force for testing performance of random locations instead of consecutive ones. Surprisingly the VHLongByte implementation comes ahead, followed by Packed64 and finally Packed64VHLongLong - the results on my machine are roughly: ``` 1. VHLongAndByte 939100 2. Packed64 1358500 3. VHLongLong 1510400 ``` Adrien suggested another workload, more realistic `PackedMacroTest` which does `SortedDocValues` merging, which creates an `OrdinalMap` backed by `PackedInts` and then recorded the merge times. In all cases I manually updated the PackedInt class to returned the desired implementation. Based on these set of results, Packed64 is on the first place, followed by VHLongAndByte with VHLongLong coming in third. I'm attaching the merging times found for reference: [p64vhll.log](https://github.com/apache/lucene/files/7565523/p64vhll.log) [p64.log](https://github.com/apache/lucene/files/7565524/p64.log) [p64vhlb.log](https://github.com/apache/lucene/files/7565525/p64vhlb.log) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10244) Please consider opening MultiCollector::getCollectors for public use
[ https://issues.apache.org/jira/browse/LUCENE-10244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17446118#comment-17446118 ] Andriy Redko commented on LUCENE-10244: --- https://github.com/apache/lucene/pull/455 > Please consider opening MultiCollector::getCollectors for public use > > > Key: LUCENE-10244 > URL: https://issues.apache.org/jira/browse/LUCENE-10244 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Affects Versions: 8.10.1 >Reporter: Andriy Redko >Priority: Minor > > The usage of *MultiCollector* sometimes is very convenient along with > {*}CollectorManager{*}s instances (as an alternative to > {*}MultiCollectorManager{*}), for example: > {code:java} > class SomeCollectorManager implements CollectorManager { > @Override > public Collector newCollector() throws IOException { > return MultiCollector.wrap(new Collector1(), new Collector2(), ...); > } > @Override > public List reduce(Collection collectors) throws > IOException { > return ...; > } > }{code} > Unfortunately, the *reduce()* phase is lacking the access to > *MultiCollector::getCollectors* method, since it is package protected at the > moment. Making *MultiCollector::getCollectors* public would make it > convenient to use *MultiCollector +* *CollectorManager* and implement the > reduce phase by accessing individual collectors. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] reta opened a new pull request #455: LUCENE-10244: Please consider opening MultiCollector::getCollectors for public use
reta opened a new pull request #455: URL: https://github.com/apache/lucene/pull/455 Signed-off-by: Andriy Redko # Description The usage of `MultiCollector` sometimes is very convenient along with `CollectorManager` instances (as an alternative to `MultiCollectorManager`), for example: ``` class SomeCollectorManager implements CollectorManager { @Override public Collector newCollector() throws IOException { return MultiCollector.wrap(new Collector1(), new Collector2(), ...); } @Override public List reduce(Collection collectors) throws IOException { return ...; } } ``` Unfortunately, the reduce() phase is lacking the access to `MultiCollector::getCollectors` method, since it is package protected at the moment. Making `MultiCollector::getCollectors` public would make it convenient to use `MultiCollector` + `CollectorManager` and implement the reduce phase by accessing individual collectors. # Solution Make `MultiCollector::getCollectors` public # Tests The test which illustrates the usage of the `MultiCollector` + `CollectorManager` has been added. # Checklist Please review the following and check all that apply: - [X] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code conforms to the standards described there to the best of my ability. - [X] I have created a Jira issue and added the issue ID to my pull request title. - [X] I have given Lucene maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [X] I have developed this patch against the `main` branch. - [X] I have run `./gradlew check`. - [X] I have added tests for my changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] reta commented on pull request #455: LUCENE-10244: Please consider opening MultiCollector::getCollectors for public use
reta commented on pull request #455: URL: https://github.com/apache/lucene/pull/455#issuecomment-973198716 @jpountz if you don't mind taking a look, really appreciate it, thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova edited a comment on pull request #416: LUCENE-10054 Make HnswGraph hierarchical
mayya-sharipova edited a comment on pull request #416: URL: https://github.com/apache/lucene/pull/416#issuecomment-972170715 @jtibshirani Thanks for your extensive review. I've tried to address some comments, and some comments I think we still need to discuss > does it have an impact on indexing speed? I've run another benchmarking experiment with `KnnGraphTester` , and as can be seen below there seems to be a small impact on indexing speed: 6% increase in indexing time;. Comparison of glove-100 baseline: main branch candidate: this PR | Baseline | candidate | | | | | built 118 in 23061/2280885 ms| built 118 in 23113/2295566 ms| | flushed: segment=_1 ramUsed=474.061 MB newFlushedSize=513.284 MB docs/MB=2,305.768 | flushed: segment=_0 ramUsed=474.061 MB newFlushedSize=517.153 MB docs/MB=2,288.517 | | | baseline recall | baseline QPS | candidate recall | candidate QPS | | --- | --: | ---: | ---: | : | | n_cands=10 | 0.482 | 4199.190 |0.496 | 3982.133 | | n_cands=20 | 0.553 | 3563.974 |0.561 | 3412.251 | | n_cands=40 | 0.631 | 2713.881 |0.635 | 2455.465 | | n_cands=80 | 0.707 | 1902.787 |0.709 | 1863.012 | | n_cands=120 | 0.748 | 1497.979 |0.747 | 1400.822 | | n_cands=200 | 0.790 | 1031.353 |0.791 | 1018.244 | | n_cands=400 | 0.840 | 577.106 |0.841 | 600.762 | | n_cands=600 | 0.865 | 417.961 |0.865 | 436.488 | | n_cands=800 | 0.881 | 319.900 |0.881 | 334.503 | -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] thelabdude opened a new pull request #2613: SOLR-15774: Avoid weird off-by-one errors with Angular's 'chosen' select box directive for the security and schema-designer screens i
thelabdude opened a new pull request #2613: URL: https://github.com/apache/lucene-solr/pull/2613 backport of https://github.com/apache/solr/pull/404 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10243) increase unicode versions of tokenizers to unicode 12.1
[ https://issues.apache.org/jira/browse/LUCENE-10243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17446161#comment-17446161 ] Robert Muir commented on LUCENE-10243: -- I tried this out on top of LUCENE-10239 branch, and also ran perl scripts to regenerate new unicode 12.1 tests. But some conformance tests fail. I haven't debugged it yet, but my guess might be some changes to the UAX#29 between 9 and 12.1, I will look into it. > increase unicode versions of tokenizers to unicode 12.1 > --- > > Key: LUCENE-10243 > URL: https://issues.apache.org/jira/browse/LUCENE-10243 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > > Followup from LUCENE-10239 > Bump the Unicode version of these tokenizers from Unicode 9 to 12.1, which is > the most recent supported by the jflex release. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] thelabdude merged pull request #2613: SOLR-15774: Avoid weird off-by-one errors with Angular's 'chosen' select box directive for the security and schema-designer screens in Admi
thelabdude merged pull request #2613: URL: https://github.com/apache/lucene-solr/pull/2613 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #452: LUCENE-10239: upgrade jflex (1.7.0 -> 1.8.2)
rmuir commented on pull request #452: URL: https://github.com/apache/lucene/pull/452#issuecomment-973323310 For convenience of reviewing: here is the diff between default skeleton and "buffer-expansion-disabled" skeleton. It is kinda the only way to review it since we brought in all the upstream changes. ``` think:lucene[LUCENE-10239]$ diff -u gradle/generation/jflex/skeleton.default.txt gradle/generation/jflex/skeleton.disable.buffer.expansion.txt --- gradle/generation/jflex/skeleton.default.txt 2021-11-17 19:04:46.844620167 -0500 +++ gradle/generation/jflex/skeleton.disable.buffer.expansion.txt 2021-11-17 19:05:01.124853267 -0500 @@ -5,7 +5,7 @@ /** Initial size of the lookahead buffer. */ --- private static final int ZZ_BUFFERSIZE = ...; - /** Lexical states. */ + /** Lexical States. */ --- lexical states, charmap /** Error code for "Unknown internal scanner error". */ @@ -94,18 +94,11 @@ zzStartRead = 0; } -/* is the buffer big enough? */ -if (zzCurrentPos >= zzBuffer.length - zzFinalHighSurrogate) { - /* if not: blow it up */ - char newBuffer[] = new char[zzBuffer.length * 2]; - System.arraycopy(zzBuffer, 0, newBuffer, 0, zzBuffer.length); - zzBuffer = newBuffer; - zzEndRead += zzFinalHighSurrogate; - zzFinalHighSurrogate = 0; -} - /* fill the buffer with new input */ -int requested = zzBuffer.length - zzEndRead; +int requested = zzBuffer.length - zzEndRead - zzFinalHighSurrogate; +if (requested == 0) { + return true; +} int numRead = zzReader.read(zzBuffer, zzEndRead, requested); /* not supposed to occur according to specification of java.io.Reader */ @@ -119,6 +112,9 @@ if (numRead == requested) { // We requested too few chars to encode a full Unicode character --zzEndRead; zzFinalHighSurrogate = 1; + if (numRead == 1) { +return true; + } } else {// There is room in the buffer for at least one more char int c = zzReader.read(); // Expecting to read a paired low surrogate char if (c == -1) { ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] thelabdude opened a new pull request #2614: SOLR-15774: Avoid weird off-by-one errors with Angular's 'chosen' select box directive for the security and schema-designer screens i
thelabdude opened a new pull request #2614: URL: https://github.com/apache/lucene-solr/pull/2614 backport of https://github.com/apache/lucene-solr/pull/2613 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] thelabdude merged pull request #2614: SOLR-15774: Avoid weird off-by-one errors with Angular's 'chosen' select box directive for the security and schema-designer screens in Admi
thelabdude merged pull request #2614: URL: https://github.com/apache/lucene-solr/pull/2614 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] LifeIsStrange commented on pull request #2012: SOLR-14935: Solr can forward request ( remoteQuery ) even if there are local cores present
LifeIsStrange commented on pull request #2012: URL: https://github.com/apache/lucene-solr/pull/2012#issuecomment-973596388 @vthacker any update? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zacharymorn merged pull request #444: LUCENE-10236: Updated field-weight used in CombinedFieldQuery scoring calculation, and added a test
zacharymorn merged pull request #444: URL: https://github.com/apache/lucene/pull/444 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10236) CombinedFieldsQuery to use fieldAndWeights.values() when constructing MultiNormsLeafSimScorer for scoring
[ https://issues.apache.org/jira/browse/LUCENE-10236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17446291#comment-17446291 ] ASF subversion and git services commented on LUCENE-10236: -- Commit 07ee3ba83a4c9f3abc24bf9d3fbb3c3102c4a102 in lucene's branch refs/heads/main from zacharymorn [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=07ee3ba ] LUCENE-10236: Update field-weight used in CombinedFieldQuery scoring calculation (#444) > CombinedFieldsQuery to use fieldAndWeights.values() when constructing > MultiNormsLeafSimScorer for scoring > - > > Key: LUCENE-10236 > URL: https://issues.apache.org/jira/browse/LUCENE-10236 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/sandbox >Reporter: Zach Chen >Assignee: Zach Chen >Priority: Minor > Time Spent: 1h 20m > Remaining Estimate: 0h > > This is a spin-off issue from discussion in > [https://github.com/apache/lucene/pull/418#issuecomment-967790816], for a > quick fix in CombinedFieldsQuery scoring. > Currently CombinedFieldsQuery would use a constructed > [fields|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L420-L421] > object to create a MultiNormsLeafSimScorer for scoring, but the fields > object may contain duplicated field-weight pairs as it is [built from looping > over > fieldTerms|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L404-L414], > resulting into duplicated norms being added during scoring calculation in > MultiNormsLeafSimScorer. > E.g. for CombinedFieldsQuery with two fields and two values matching a > particular doc: > {code:java} > CombinedFieldQuery query = > new CombinedFieldQuery.Builder() > .addField("field1", (float) 1.0) > .addField("field2", (float) 1.0) > .addTerm(new BytesRef("foo")) > .addTerm(new BytesRef("zoo")) > .build(); {code} > I would imagine the scoring to be based on the following: > # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + > freq(field1:zoo) + freq(field2:zoo) > # Sum of norms on doc = norm(field1) + norm(field2) > but the current logic would use the following for scoring: > # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + > freq(field1:zoo) + freq(field2:zoo) > # Sum of norms on doc = norm(field1) + norm(field2) + norm(field1) + > norm(field2) > > In addition, this differs from how MultiNormsLeafSimScorer is constructed > from CombinedFieldsQuery explain function, which [uses > fieldAndWeights.values()|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L387-L389] > and does not contain duplicated field-weight pairs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on pull request #444: LUCENE-10236: Updated field-weight used in CombinedFieldQuery scoring calculation, and added a test
zacharymorn commented on pull request #444: URL: https://github.com/apache/lucene/pull/444#issuecomment-973796460 Will backport this to 8.11 and 8x. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase merged pull request #7: LUCENE-9820: Separate logic for reading the BKD index from logic to intersecting it
iverase merged pull request #7: URL: https://github.com/apache/lucene/pull/7 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9820) Separate logic for reading the BKD index from logic to intersecting it.
[ https://issues.apache.org/jira/browse/LUCENE-9820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17446335#comment-17446335 ] ASF subversion and git services commented on LUCENE-9820: - Commit ad911df2605591a63ca58d237182f84aa8384325 in lucene's branch refs/heads/main from Ignacio Vera [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=ad911df ] LUCENE-9820: Separate logic for reading the BKD index from logic to intersecting it (#7) Extract BKD tree interface and move intersecting logic to the PointValues abstract class. > Separate logic for reading the BKD index from logic to intersecting it. > --- > > Key: LUCENE-9820 > URL: https://issues.apache.org/jira/browse/LUCENE-9820 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Ignacio Vera >Priority: Major > Time Spent: 10h 40m > Remaining Estimate: 0h > > Currently the class BKDReader contains all the logic for traversing the KD > tree and the logic to read the actual index. This makes difficult to develop > new visiting strategies, for example LUCENE-9619, where it is proposed to > move Points from a visitor API to a custor-style API. > The first step is to isolate the logic the read the index from the logic that > visits the the tree. Another benefit of doing this, is that it will help > evolving the index, for example moving the current index format to backwards > codec without moving the visiting logic. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase opened a new pull request #457: [Backport] LUCENE-9820: Separate logic for reading the BKD index from logic to intersecting it
iverase opened a new pull request #457: URL: https://github.com/apache/lucene/pull/457 backport of #7. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9820) Separate logic for reading the BKD index from logic to intersecting it.
[ https://issues.apache.org/jira/browse/LUCENE-9820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ignacio Vera resolved LUCENE-9820. -- Fix Version/s: 9.1 Assignee: Ignacio Vera Resolution: Fixed > Separate logic for reading the BKD index from logic to intersecting it. > --- > > Key: LUCENE-9820 > URL: https://issues.apache.org/jira/browse/LUCENE-9820 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Ignacio Vera >Assignee: Ignacio Vera >Priority: Major > Fix For: 9.1 > > Time Spent: 10h 50m > Remaining Estimate: 0h > > Currently the class BKDReader contains all the logic for traversing the KD > tree and the logic to read the actual index. This makes difficult to develop > new visiting strategies, for example LUCENE-9619, where it is proposed to > move Points from a visitor API to a custor-style API. > The first step is to isolate the logic the read the index from the logic that > visits the the tree. Another benefit of doing this, is that it will help > evolving the index, for example moving the current index format to backwards > codec without moving the visiting logic. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase merged pull request #457: [Backport] LUCENE-9820: Separate logic for reading the BKD index from logic to intersecting it
iverase merged pull request #457: URL: https://github.com/apache/lucene/pull/457 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9820) Separate logic for reading the BKD index from logic to intersecting it.
[ https://issues.apache.org/jira/browse/LUCENE-9820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17446338#comment-17446338 ] ASF subversion and git services commented on LUCENE-9820: - Commit 9adf7e27f97b76751429c3bc83eaf13006c07982 in lucene's branch refs/heads/branch_9x from Ignacio Vera [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=9adf7e2 ] LUCENE-9820: Separate logic for reading the BKD index from logic to intersecting it (#7) (#457) Extract BKD tree interface and move intersecting logic to the PointValues abstract class. > Separate logic for reading the BKD index from logic to intersecting it. > --- > > Key: LUCENE-9820 > URL: https://issues.apache.org/jira/browse/LUCENE-9820 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Ignacio Vera >Assignee: Ignacio Vera >Priority: Major > Fix For: 9.1 > > Time Spent: 10h 50m > Remaining Estimate: 0h > > Currently the class BKDReader contains all the logic for traversing the KD > tree and the logic to read the actual index. This makes difficult to develop > new visiting strategies, for example LUCENE-9619, where it is proposed to > move Points from a visitor API to a custor-style API. > The first step is to isolate the logic the read the index from the logic that > visits the the tree. Another benefit of doing this, is that it will help > evolving the index, for example moving the current index format to backwards > codec without moving the visiting logic. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on a change in pull request #418: LUCENE-10061: Implements dynamic pruning support for CombinedFieldsQuery
zacharymorn commented on a change in pull request #418: URL: https://github.com/apache/lucene/pull/418#discussion_r752930339 ## File path: lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java ## @@ -441,6 +491,273 @@ public boolean isCacheable(LeafReaderContext ctx) { } } + /** Merge impacts for combined field. */ + static ImpactsSource mergeImpacts( + Map> fieldsWithImpactsEnums, + Map> fieldsWithImpacts, + Map fieldWeights) { +return new ImpactsSource() { + + class SubIterator { +final Iterator iterator; +int previousFreq; +Impact current; + +SubIterator(Iterator iterator) { + this.iterator = iterator; + this.current = iterator.next(); +} + +void next() { + previousFreq = current.freq; + if (iterator.hasNext() == false) { +current = null; + } else { +current = iterator.next(); + } +} + } + + @Override + public Impacts getImpacts() throws IOException { +// Use the impacts that have the lower next boundary (doc id in skip entry) as a lead for +// each field +// They collectively will decide on the number of levels and the block boundaries. +Map leadingImpactsPerField = new HashMap<>(fieldsWithImpactsEnums.size()); + +for (Map.Entry> fieldImpacts : +fieldsWithImpactsEnums.entrySet()) { + String field = fieldImpacts.getKey(); + List impactsEnums = fieldImpacts.getValue(); + fieldsWithImpacts.put(field, new ArrayList<>(impactsEnums.size())); + + Impacts tmpLead = null; + // find the impact that has the lowest next boundary for this field + for (int i = 0; i < impactsEnums.size(); ++i) { +Impacts impacts = impactsEnums.get(i).getImpacts(); +fieldsWithImpacts.get(field).add(impacts); + +if (tmpLead == null || impacts.getDocIdUpTo(0) < tmpLead.getDocIdUpTo(0)) { + tmpLead = impacts; +} + } + + leadingImpactsPerField.put(field, tmpLead); +} + +return new Impacts() { + + @Override + public int numLevels() { +// max of levels across fields' impactEnums +int result = 0; + +for (Impacts impacts : leadingImpactsPerField.values()) { + result = Math.max(result, impacts.numLevels()); +} + +return result; + } + + @Override + public int getDocIdUpTo(int level) { +// min of docIdUpTo across fields' impactEnums +int result = Integer.MAX_VALUE; + +for (Impacts impacts : leadingImpactsPerField.values()) { + if (impacts.numLevels() > level) { +result = Math.min(result, impacts.getDocIdUpTo(level)); + } +} + +return result; + } Review comment: Hi @jpountz , I just noticed a place in the implementation that was called multiple times, and thus tried to "cache" the result to save the repeated computation https://github.com/apache/lucene/pull/418/commits/6d5e7808fd847d5849e7c4ff3b552aeb9c196bbb. It turned out the improvement was quite obvious: Results from `python3 src/python/localrun.py -source combinedFieldsBig` Run 1 ``` TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value CFQHighHigh3.17 (3.5%)2.70 (4.6%) -14.8% ( -22% - -6%) 0.000 PKLookup 117.15 (6.7%) 116.31 (10.3%) -0.7% ( -16% - 17%) 0.792 CFQHighMed 12.98 (5.2%) 18.26 (13.9%) 40.7% ( 20% - 63%) 0.000 CFQHighLow 17.30 (3.6%) 36.19 (20.1%) 109.2% ( 82% - 137%) 0.000 ``` Run 2 ``` TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value PKLookup 104.98 (6.9%) 115.76 (8.5%) 10.3% ( -4% - 27%) 0.000 CFQHighMed5.73 (3.9%)6.53 (8.4%) 14.0% ( 1% - 27%) 0.000 CFQHighLow 17.87 (3.4%) 20.37 (8.1%) 14.0% ( 2% - 26%) 0.000 CFQHighHigh5.41 (4.2%)6.53 (8.9%) 20.7% ( 7% - 35%) 0.000 ``` Run 3 ``` TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value CFQHighHigh3.18 (3.6%)2.66 (5.3%) -16.4% ( -24% - -7%) 0.000 CFQHighMed5.75 (4.0%)5.07 (5.5%) -11.7% ( -20% - -2%) 0.000 PKLookup 136.87 (5.9%) 133.97 (9.4%) -2.1% ( -16% - 14%) 0.393 CFQHighLow 23.19 (2.9%) 38.74 (16.0%) 67