[GitHub] [lucene] gsmiller commented on a change in pull request #747: LUCENE-10325: Add getTopDims functionality to Facets

GitBox Tue, 15 Mar 2022 10:18:32 -0700


gsmiller commented on a change in pull request #747:
URL: https://github.com/apache/lucene/pull/747#discussion_r826448307




##########
File path: 
lucene/facet/src/java/org/apache/lucene/facet/sortedset/SortedSetDocValuesFacetCounts.java
##########
@@ -414,4 +493,107 @@ public int compare(FacetResult a, FacetResult b) {
 
     return results;
   }
+
+  @Override
+  public List<FacetResult> getTopDims(int topNDims, int topNChildren) throws 
IOException {
+    // get topNDims by their values
+    SortedSetDocValueDimValuePriorityQueue pq =
+        new SortedSetDocValueDimValuePriorityQueue(topNDims);
+    cacheChildOrdsResult = new HashMap<>();
+    for (String dim : state.getDims()) {
+      FacetsConfig.DimConfig dimConfig = stateConfig.getDimConfig(dim);
+      if (dimConfig.hierarchical) {
+        DimTree dimTree = state.getDimTree(dim);
+        int dimOrd = dimTree.dimStartOrd;
+        // get dim value
+        Number value = getDimValue(dimConfig, dim, dimOrd, dimTree.iterator(), 
topNChildren);
+        if (value != null) {
+          // use priority queue to store SortedSetDocValuesDimValueResult for 
topNDims
+          pq.insertWithOverflow(new SortedSetDocValuesDimValueResult(dim, 
value));
+        }
+      } else {
+        OrdRange ordRange = state.getOrdRange(dim);
+        int dimOrd = ordRange.start;
+        PrimitiveIterator.OfInt childIt = ordRange.iterator();
+        if (dimConfig.multiValued && dimConfig.requireDimCount) {
+          // If the dim is multi-valued and requires dim counts, we know we've 
explicitly indexed
+          // the dimension and we need to skip past it so the iterator is 
positioned on the first
+          // child:
+          childIt.next();
+        }
+        Number value = getDimValue(dimConfig, dim, dimOrd, childIt, 
topNChildren);
+        if (value != null) {
+          pq.insertWithOverflow(new SortedSetDocValuesDimValueResult(dim, 
value));
+        }
+      }
+    }
+
+    // get FacetResult for topNDims
+    List<FacetResult> results = new LinkedList<>();
+    while (pq.size() > 0) {
+      SortedSetDocValuesDimValueResult dimValueResult = pq.pop();
+      if (dimValueResult != null) {
+        FacetResult factResult = getFacetResultForDim(dimValueResult.dim, 
topNChildren);
+        if (factResult != null) {
+          results.add(0, factResult);
+        }
+      }
+    }
+    // free cacheChildOrdsResult after getting results
+    cacheChildOrdsResult = null;
+    return results;
+  }
+
+  /**
+   * Creates SortedSetDocValuesChildOrdsResult to store dimCount, childCount, 
and TopOrdAndIntQueue
+   * q for getPathResult.
+   */
+  private class SortedSetDocValuesChildOrdsResult {
+    final int dimCount;
+    final int childCount;
+    final TopOrdAndIntQueue q;
+
+    SortedSetDocValuesChildOrdsResult(int dimCount, int childCount, 
TopOrdAndIntQueue q) {
+      this.dimCount = dimCount;
+      this.childCount = childCount;
+      this.q = q;
+    }
+  }
+
+  /**
+   * Creates SortedSetDocValuesDimValueResult to store the label and count of 
dim in order to sort
+   * by these two fields.
+   */
+  private class SortedSetDocValuesDimValueResult {
+    final String dim;
+    final Number value;
+
+    SortedSetDocValuesDimValueResult(String dim, Number value) {
+      this.dim = dim;
+      this.value = value;
+    }
+  }
+
+  /**
+   * Creates priority queue to store top dimensions and sort by their 
aggregated values/hits and
+   * string values.
+   */
+  private class SortedSetDocValueDimValuePriorityQueue

Review comment:
       minor: you could also consider inlining this as an anonymous class since 
you're only using it from one location currently. There's a bit of an existing 
convention for doing that, but if you think this is more readable, I don't have 
a strong opinion.

##########
File path: 
lucene/facet/src/java/org/apache/lucene/facet/sortedset/SortedSetDocValuesFacetCounts.java
##########
@@ -190,20 +235,45 @@ private FacetResult getPathResult(
       String[] parts = FacetsConfig.stringToPath(term.utf8ToString());
       labelValues[i] = new LabelAndValue(parts[parts.length - 1], 
ordAndValue.value);
     }
+    return labelValues;
+  }
 
-    if (dimConfig.hierarchical == false) {
+  /** Returns value/count of a dimension. */
+  private Number getDimValue(
+      FacetsConfig.DimConfig dimConfig,
+      String dim,
+      int pathOrd,
+      PrimitiveIterator.OfInt childOrds,
+      int topN)
+      throws IOException {
+
+    // if dimConfig.hierarchical == true, return dimCount directly

Review comment:
       Ah I see. This is because we _always_ index dim counts directly when 
hierarchical. Nice.

##########
File path: 
lucene/facet/src/java/org/apache/lucene/facet/sortedset/SortedSetDocValuesFacetCounts.java
##########
@@ -190,20 +235,45 @@ private FacetResult getPathResult(
       String[] parts = FacetsConfig.stringToPath(term.utf8ToString());
       labelValues[i] = new LabelAndValue(parts[parts.length - 1], 
ordAndValue.value);
     }
+    return labelValues;
+  }
 
-    if (dimConfig.hierarchical == false) {
+  /** Returns value/count of a dimension. */
+  private Number getDimValue(

Review comment:
       I'm not sure abstracting to a `Number` here gets us much right? We know 
that we're counting as integers, so I think we can just return `int` and 
simplify things (and avoid object creation and unnecessary indirection). Is 
there something I'm maybe missing though where `Number` is helpful?

##########
File path: 
lucene/facet/src/java/org/apache/lucene/facet/sortedset/SortedSetDocValuesFacetCounts.java
##########
@@ -76,6 +79,7 @@
   final String field;
   final int[] counts;
 
+  private HashMap<String, SortedSetDocValuesChildOrdsResult> 
cacheChildOrdsResult;

Review comment:
       I like the idea you've got here of memorizing partial results in the 
`getTopDims` case so they don't need to be recomputed (e.g., the dim counts, 
the child PQ, etc.). I think that's smart! I might suggest an approach though 
where you avoid using a member field to maintain this and instead pass the map 
around to the various helper methods that care about it. Providing a `null` map 
on input to the helper methods can indicate you don't need it populated, or 
that partial results aren't available. The advantage is that you don't need to 
remember to clear the map, etc., and I think it will be a little easier to 
reason about (avoiding mutable state can be a nice thing).
   
   I've got some other comments on the diff for this class that I can follow up 
with, but this is sort of a main structural bit that would be nice to bottom 
out on before I follow up with additional feedback.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a change in pull request #747: LUCENE-10325: Add getTopDims functionality to Facets

Reply via email to