[GitHub] [lucene-solr] mikemccand commented on a change in pull request #1893: LUCENE-9444 Utility class to get facet labels from taxonomy for a fac…

GitBox Fri, 25 Sep 2020 06:17:42 -0700


mikemccand commented on a change in pull request #1893:
URL: https://github.com/apache/lucene-solr/pull/1893#discussion_r494395610




##########
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyFacetLabels.java
##########
@@ -0,0 +1,184 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.taxonomy;
+
+import org.apache.lucene.facet.FacetsConfig;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.util.IntsRef;
+
+import java.io.IOException;
+
+import static org.apache.lucene.facet.taxonomy.TaxonomyReader.INVALID_ORDINAL;
+import static org.apache.lucene.facet.taxonomy.TaxonomyReader.ROOT_ORDINAL;
+
+/**
+ * Utility class to easily retrieve previously indexed facet labels, allowing 
you to skip also adding stored fields for these values,
+ * reducing your index size.
+ *
+ * @lucene.experimental
+ **/
+public class TaxonomyFacetLabels {
+
+  /**
+   * Index field name provided to the constructor
+   */
+  private final String indexFieldName;
+
+  /**
+   * {@code TaxonomyReader} provided to the constructor
+   */
+  private final TaxonomyReader taxoReader;
+
+  /**
+   * {@code FacetsConfig} provided to the constructor
+   */
+  private final FacetsConfig config;
+
+  /**
+   * {@code OrdinalsReader} to decode ordinals previously indexed into the 
{@code BinaryDocValues} facet field
+   */
+  private final OrdinalsReader ordsReader;
+
+  /**
+   * Sole constructor.  Do not close the provided {@link TaxonomyReader} while 
still using this instance!
+   */
+  public TaxonomyFacetLabels(TaxonomyReader taxoReader, FacetsConfig config, 
String indexFieldName) throws IOException {
+    this.taxoReader = taxoReader;
+    this.config = config;
+    this.indexFieldName = indexFieldName;
+    this.ordsReader = new DocValuesOrdinalsReader(indexFieldName);
+  }
+
+  /**
+   * Create and return an instance of {@link FacetLabelReader} to retrieve 
facet labels for
+   * multiple documents and (optionally) for a specific dimension.  You must 
create this per-segment,
+   * and then step through all hits, in order, for that segment.
+   *
+   * <p><b>NOTE</b>: This class is not thread-safe, so you must use a new 
instance of this
+   * class for each thread.</p>
+   *
+   * @param readerContext LeafReaderContext used to access the {@code 
BinaryDocValues} facet field
+   * @return an instance of {@link FacetLabelReader}
+   * @throws IOException when a low-level IO issue occurs
+   */
+  public FacetLabelReader getFacetLabelReader(LeafReaderContext readerContext) 
throws IOException {
+    return new FacetLabelReader(ordsReader, readerContext);
+  }
+
+  /**
+   * Utility class to retrieve facet labels for multiple documents.
+   *
+   * @lucene.experimental
+   */
+  public class FacetLabelReader {
+    private final OrdinalsReader.OrdinalsSegmentReader ordinalsSegmentReader;
+    private final IntsRef decodedOrds = new IntsRef();
+    private int currentDocId = -1;
+    private int currentPos = -1;
+
+    // Lazily set when nextFacetLabel(int docId, String facetDimension) is 
first called
+    private int[] parents;
+
+    /**
+     * Sole constructor.
+     */
+    public FacetLabelReader(OrdinalsReader ordsReader, LeafReaderContext 
readerContext) throws IOException {
+      ordinalsSegmentReader = ordsReader.getReader(readerContext);
+    }
+
+    /**
+     * Retrieves the next {@link FacetLabel} for the specified {@code docId}, 
or {@code null} if there are no more.
+     * This method has state: if the provided {@code docId} is the same as the 
previous invocation, it returns the
+     * next {@link FacetLabel} for that document.  Otherwise, it advances to 
the new {@code docId} and provides the
+     * first {@link FacetLabel} for that document, or {@code null} if that 
document has no indexed facets.  Each
+     * new {@code docId} must be in strictly monotonic (increasing) order.
+     *
+     * @param docId input docId provided in monotonic (non-decreasing) order
+     * @return the first or next {@link FacetLabel}, or {@code null} if there 
are no more
+     * @throws IOException when a low-level IO issue occurs
+     */
+    public FacetLabel nextFacetLabel(int docId) throws IOException {
+      if (currentDocId != docId) {
+        assert docId > currentDocId: "docs out of order!  currentDocId=" + 
currentDocId + " docId=" + docId;
+        ordinalsSegmentReader.get(docId, decodedOrds);
+        currentDocId = docId;
+        currentPos = decodedOrds.offset;
+      }
+
+      int endPos = decodedOrds.offset + decodedOrds.length;
+      assert currentPos <= endPos;
+
+      if (currentPos == endPos) {
+        // no more FacetLabels
+        return null;
+      }
+
+      int ord = decodedOrds.ints[currentPos++];
+      return taxoReader.getPath(ord);
+    }
+
+    private boolean isDescendant(int ord, int ancestorOrd) {
+      while (ord != INVALID_ORDINAL && ord != ROOT_ORDINAL) {
+        if (parents[ord] == ancestorOrd) {
+          return true;
+        }
+        ord = parents[ord];
+      }
+      return false;
+    }
+
+    /**
+     * Retrieves the next {@link FacetLabel} for the specified {@code docId} 
under the requested {@code facetDimension},
+     * or {@code null} if there are no more. This method has state: if the 
provided {@code docId} is the same as the
+     * previous invocation, it returns the next {@link FacetLabel} for that 
document.  Otherwise, it advances to
+     * the new {@code docId} and provides the first {@link FacetLabel} for 
that document, or {@code null} if that document
+     * has no indexed facets.  Each new {@code docId} must be in strictly 
monotonic (increasing) order.
+     *
+     * <p><b>NOTE</b>: this method loads the {@code int[] parents} array from 
the taxonomy index</p>
+     *
+     * @param docId input docId provided in non-decreasing order
+     * @return the first or next {@link FacetLabel}, or {@code null} if there 
are no more
+     * @throws IOException if {@link TaxonomyReader} has problems getting path 
for an ordinal
+     */
+    public FacetLabel nextFacetLabel(int docId, String facetDimension) throws 
IOException {
+      final int parentOrd = taxoReader.getOrdinal(new 
FacetLabel(facetDimension));
+      assert parentOrd != INVALID_ORDINAL : "Category ordinal not found for 
facet dimension: " + facetDimension;
+
+      if (currentDocId != docId) {
+        assert docId > currentDocId: "docs out of order!  currentDocId=" + 
currentDocId + " docId=" + docId;
+        ordinalsSegmentReader.get(docId, decodedOrds);
+        currentPos = decodedOrds.offset;
+        currentDocId = docId;
+      }
+
+      if (parents == null) {
+        parents = taxoReader.getParallelTaxonomyArrays().parents();
+      }
+
+      int endPos = decodedOrds.offset + decodedOrds.length;
+      assert currentPos <= endPos;
+
+      for (; currentPos < endPos; ) {
+        int ord = decodedOrds.ints[currentPos++];
+        if (isDescendant(ord, parentOrd) == true) {
+          return taxoReader.getPath(ord);

Review comment:
       Hmm, I was thinking this new per-segment class (`FacetLabelReader`) 
would hold onto a single `BinaryDocValues` instance, but you're right, that 
won't work.
   
   The facet ordinals are stored / compacted in sorted order in a single 
`BinaryDocValues` field which we decode here into `decodedOrds`.  Resolving 
those ordinals to `FacetLabel` is costly, currently retrieving one stored 
document per ord.  After LUCENE-9450 (switching from stored fields to BDV), we 
could pull a new BDV for each unique `docId` passed to `nextFacetLabel`, and 
then bulk resolve all the ordinals?
   
   Anyway, we can pursue all of this later -- this PR already looks awesome -- 
progress not perfection!

##########
File path: lucene/facet/src/test/org/apache/lucene/facet/FacetTestCase.java
##########
@@ -56,6 +60,28 @@ public Facets getTaxonomyFacetCounts(TaxonomyReader 
taxoReader, FacetsConfig con
     return facets;
   }
 
+  public List<List<FacetLabel>> getTaxonomyFacetLabels(TaxonomyReader 
taxoReader, FacetsConfig config, FacetsCollector fc) throws IOException {

Review comment:
       Thank you for adding this utility method so tests can easily use the new 
utility class!
   
   Can we rename this to `getAllTaxonomyFacetLabels`, and add javadoc 
explaining that the outer list is one entry per matched hit, and the inner list 
is one entry per `FacetLabel` belonging to that hit?

##########
File path: 
lucene/facet/src/test/org/apache/lucene/facet/taxonomy/TestTaxonomyFacetCounts.java
##########
@@ -726,6 +743,39 @@ public void testRandom() throws Exception {
     IOUtils.close(tw, searcher.getIndexReader(), tr, indexDir, taxoDir);
   }
 
+  private static List<List<FacetLabel>> 
sortedFacetLabels(List<List<FacetLabel>> allfacetLabels) {
+    for (List<FacetLabel> facetLabels : allfacetLabels) {
+      Collections.sort(facetLabels);
+    }
+
+    Collections.sort(allfacetLabels, (o1, o2) -> {
+      if (o1 == null) {

Review comment:
       Hmm why are these `null` checks necessary?  Are we really seeing `null` 
in the argument?  Oh, I guess this legitimately happens when the hit had no 
facets?  Maybe add a comment?  Hmm, actually, looking at how actual and 
expected are populated, neither of them seems to add `null`?  One of them 
filters out empty list but the other does not?

##########
File path: 
lucene/facet/src/test/org/apache/lucene/facet/taxonomy/TestTaxonomyFacetCounts.java
##########
@@ -711,6 +723,11 @@ public void testRandom() throws Exception {
         }
       }
 
+      // Test facet labels for each matching test doc
+      List<List<FacetLabel>> actualLabels = getTaxonomyFacetLabels(tr, config, 
fc);
+      assertEquals(expectedLabels.size(), actualLabels.size());

Review comment:
       Hmm I think `expectedLabels` filters out empty `List<FacetLabel>` but 
`actualLabels` does not, so this might false trip?

##########
File path: 
lucene/facet/src/test/org/apache/lucene/facet/taxonomy/TestTaxonomyFacetCounts.java
##########
@@ -726,6 +743,39 @@ public void testRandom() throws Exception {
     IOUtils.close(tw, searcher.getIndexReader(), tr, indexDir, taxoDir);
   }
 
+  private static List<List<FacetLabel>> 
sortedFacetLabels(List<List<FacetLabel>> allfacetLabels) {
+    for (List<FacetLabel> facetLabels : allfacetLabels) {
+      Collections.sort(facetLabels);
+    }
+
+    Collections.sort(allfacetLabels, (o1, o2) -> {

Review comment:
       I'm confused why we are sorting the top list?  Isn't the top list in 
order of the hits?  And we want to confirm, for a given `docId` hit, that 
expected and actual labels match?
   
   OK, I think I understand: this test does not index anything allowing you to 
track which original doc mapped to which `FacetLabel`, so then you cannot know, 
per segment, which docs ended up where :)
   
   Given that, I think it's OK to do the top-level sort of all 
`List<FacetLabel>` across all hits.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] mikemccand commented on a change in pull request #1893: LUCENE-9444 Utility class to get facet labels from taxonomy for a fac…

Reply via email to