gsmiller commented on a change in pull request #509: URL: https://github.com/apache/lucene/pull/509#discussion_r779892023
########## File path: lucene/facet/src/java/org/apache/lucene/facet/sortedset/SortedSetDocValuesFacetField.java ########## @@ -41,19 +42,28 @@ public final String dim; /** Label. */ - public final String label; + public final String[] path; Review comment: It's frustrating that these fields are public, but because they are, this isn't actually backwards compatible. It seems like these are public to make them visible to other code in the facets module that's in a different java package (specifically, `FacetsConfig`), but users could still be reading this field directly for some reason and have their code no longer compile because of this change. If we really want to keep this fully back-compat, I think we need to add a new field for `path` and maintain the `label` string (marked as deprecated). The label could be created in the ctor with `FacetsConfig#pathToString`. ########## File path: lucene/facet/src/java/org/apache/lucene/facet/sortedset/SortedSetDocValuesFacetCounts.java ########## @@ -95,17 +96,34 @@ public FacetResult getTopChildren(int topN, String dim, String... path) throws I if (topN <= 0) { throw new IllegalArgumentException("topN must be > 0 (got: " + topN + ")"); } - if (path.length > 0) { - throw new IllegalArgumentException("path should be 0 length"); - } - OrdRange ordRange = state.getOrdRange(dim); - if (ordRange == null) { - return null; // means dimension was never indexed + + if (state.isHierarchicalDim(dim)) { + int pathOrd = (int) dv.lookupTerm(new BytesRef(FacetsConfig.pathToString(dim, path))); + if (pathOrd == -1) { Review comment: If you take a look at the javadoc (or implementation) for `SortedSetDocValues#lookupTerm`, I think you'll want to change this condition to check for a return value < 0 (instead of `-1` explicitly). Any negative return value indicates "not found". It would also be good to ensure there's a unit test covering this case. I suspect there may not be if you didn't find this as a bug in testing? ########## File path: lucene/facet/src/java/org/apache/lucene/facet/sortedset/DefaultSortedSetDocValuesReaderState.java ########## @@ -79,38 +104,160 @@ public DefaultSortedSetDocValuesReaderState(IndexReader reader, String field) th } valueCount = (int) dv.getValueCount(); - // TODO: we can make this more efficient if eg we can be - // "involved" when OrdinalMap is being created? Ie see - // each term/ord it's assigning as it goes... - String lastDim = null; - int startOrd = -1; - - // TODO: this approach can work for full hierarchy?; - // TaxoReader can't do this since ords are not in - // "sorted order" ... but we should generalize this to - // support arbitrary hierarchy: - for (int ord = 0; ord < valueCount; ord++) { - final BytesRef term = dv.lookupOrd(ord); + int ord = 0; + while (ord != valueCount) { + BytesRef term = dv.lookupOrd(ord); String[] components = FacetsConfig.stringToPath(term.utf8ToString()); - if (components.length != 2) { - throw new IllegalArgumentException( - "this class can only handle 2 level hierarchy (dim/value); got: " - + Arrays.toString(components) - + " " - + term.utf8ToString()); + String dim = components[0]; + if (config != null && config.getDimConfig(dim).hierarchical) { + ord = createOneHierarchicalFacetDimState(dv, ord) + 1; + } else { + ord = createOneFlatFacetDimState(dv, ord) + 1; + } + } + } + + // returns last ord of dimension + private int createOneHierarchicalFacetDimState(SortedSetDocValues dv, int dimStartOrd) + throws IOException { + List<Boolean> hasChildren = new ArrayList<>(); + List<Integer> siblings = new ArrayList<>(); + + String dim = null; + + // stack of paths with unfulfilled siblings + Stack<OrdAndComponent> siblingStack = new Stack<>(); + + String[] nextComponents = null; + int dimEndOrd = dimStartOrd; + while (true) { + String[] components; + + int ord = dimEndOrd - dimStartOrd; + + if (nextComponents == null) { + BytesRef term = dv.lookupOrd(dimEndOrd); + components = FacetsConfig.stringToPath(term.utf8ToString()); + dim = components[0]; + } else { + components = nextComponents; } - if (!components[0].equals(lastDim)) { - if (lastDim != null) { - prefixToOrdRange.put(lastDim, new OrdRange(startOrd, ord - 1)); + + while (siblingStack.empty() == false + && siblingStack.peek().component.length >= components.length) { + OrdAndComponent possibleSibling = siblingStack.pop(); + if (possibleSibling.component.length == components.length) { + // lengths are equal + boolean isSibling = true; + for (int i = 0; i < possibleSibling.component.length - 1; i++) { + if (possibleSibling.component[i].equals(components[i]) == false) { Review comment: My reasoning might be incomplete, but I don't think you need this loop check. Isn't it enough that the lengths are equal? I think for some ancestry component to be different, we would have already popped this off right? ########## File path: lucene/facet/src/java/org/apache/lucene/facet/FacetsConfig.java ########## @@ -471,19 +471,35 @@ private void indexDrillDownTerms( private void processSSDVFacetFields( Map<String, List<SortedSetDocValuesFacetField>> byField, Document doc) { + for (Map.Entry<String, List<SortedSetDocValuesFacetField>> ent : byField.entrySet()) { String indexFieldName = ent.getKey(); for (SortedSetDocValuesFacetField facetField : ent.getValue()) { - FacetLabel facetLabel = new FacetLabel(facetField.dim, facetField.label); - String fullPath = pathToString(facetLabel.components, facetLabel.length); + FacetLabel facetLabel = new FacetLabel(facetField.dim, facetField.path); + if (getDimConfig(facetField.dim).hierarchical) { Review comment: minor: Maybe pull out the DimConfig into a local variable since you reference it a couple times. ########## File path: lucene/facet/src/java/org/apache/lucene/facet/sortedset/SortedSetDocValuesReaderState.java ########## @@ -62,15 +159,27 @@ protected SortedSetDocValuesReaderState() {} /** Indexed field we are reading. */ public abstract String getField(); + /** Returns top-level index reader. */ + public abstract IndexReader getReader(); + + /** Number of unique labels. */ + public abstract int getSize(); + + /** Returns if dim is configured as hierarchical ** */ + public abstract boolean isHierarchicalDim(String dim); + + /*** Only used for flat facets (dim/value) ***/ + /** Returns the {@link OrdRange} for this dimension. */ public abstract OrdRange getOrdRange(String dim); /** Returns mapping from prefix to {@link OrdRange}. */ public abstract Map<String, OrdRange> getPrefixToOrdRange(); - /** Returns top-level index reader. */ - public abstract IndexReader getReader(); + /*** Only used for hierarchical facets ***/ - /** Number of unique labels. */ - public abstract int getSize(); + public abstract DimTree getDimTree(String dim); + + /** Returns a list of all dimensions and their respective ordinals */ Review comment: This doesn't return the ordinals right? Just the dimension strings? ########## File path: lucene/facet/src/java/org/apache/lucene/facet/sortedset/SortedSetDocValuesFacetCounts.java ########## @@ -145,12 +164,19 @@ private FacetResult getDim(String dim, OrdRange ordRange, int topN) throws IOExc LabelAndValue[] labelValues = new LabelAndValue[q.size()]; for (int i = labelValues.length - 1; i >= 0; i--) { TopOrdAndIntQueue.OrdAndValue ordAndValue = q.pop(); + assert ordAndValue != null; final BytesRef term = dv.lookupOrd(ordAndValue.ord); String[] parts = FacetsConfig.stringToPath(term.utf8ToString()); - labelValues[i] = new LabelAndValue(parts[1], ordAndValue.value); + labelValues[i] = new LabelAndValue(parts[parts.length - 1], ordAndValue.value); } - return new FacetResult(dim, new String[0], dimCount, labelValues, childCount); + if (pathOrd == -1) { + // not hierarchical facet + return new FacetResult(dim, new String[0], dimCount, labelValues, childCount); Review comment: Maybe define a local constant as `new String[0]` that you reference in these spots instead? ########## File path: lucene/facet/src/java/org/apache/lucene/facet/sortedset/SortedSetDocValuesReaderState.java ########## @@ -51,8 +51,105 @@ public OrdRange(int start, int end) { this.start = start; this.end = end; } + + /** Iterates from start to end ord (inclusive) * */ Review comment: minor: I see some one-line javadoc that has a funky extra '*' near the end. Could you do a pass and try to clean these up? I'm sure it's some combo of and IDE + spotless? ########## File path: lucene/facet/src/java/org/apache/lucene/facet/sortedset/SortedSetDocValuesReaderState.java ########## @@ -62,15 +159,27 @@ protected SortedSetDocValuesReaderState() {} /** Indexed field we are reading. */ public abstract String getField(); + /** Returns top-level index reader. */ + public abstract IndexReader getReader(); + + /** Number of unique labels. */ + public abstract int getSize(); + + /** Returns if dim is configured as hierarchical ** */ + public abstract boolean isHierarchicalDim(String dim); Review comment: What about actually exposing the `FacetsConfig` instead of just this one-off attribute of the config? That would also help with a separate issue where dim counts can be over-counted since SSDV faceting doesn't know whether-or-not it's multi-valued (LUCENE-9952). ########## File path: lucene/facet/src/java/org/apache/lucene/facet/FacetsConfig.java ########## @@ -471,19 +471,35 @@ private void indexDrillDownTerms( private void processSSDVFacetFields( Map<String, List<SortedSetDocValuesFacetField>> byField, Document doc) { + for (Map.Entry<String, List<SortedSetDocValuesFacetField>> ent : byField.entrySet()) { String indexFieldName = ent.getKey(); for (SortedSetDocValuesFacetField facetField : ent.getValue()) { - FacetLabel facetLabel = new FacetLabel(facetField.dim, facetField.label); - String fullPath = pathToString(facetLabel.components, facetLabel.length); + FacetLabel facetLabel = new FacetLabel(facetField.dim, facetField.path); + if (getDimConfig(facetField.dim).hierarchical) { + for (int i = 0; i < facetLabel.length; i++) { + String fullPath = pathToString(facetLabel.components, i + 1); + // For facet counts: + doc.add(new SortedSetDocValuesField(indexFieldName, new BytesRef(fullPath))); + + // For drill-down: + indexDrillDownTerms(doc, indexFieldName, getDimConfig(facetField.dim), facetLabel); + } + } else { + if (facetLabel.length != 2) { + throw new IllegalArgumentException( Review comment: minor: Maybe try to base the exception message on the error checking already done for taxo-facets to make the wording a little more consistent? ``` if (facetField.path.length > 1 && dimConfig.hierarchical == false) { throw new IllegalArgumentException( "dimension \"" + facetField.dim + "\" is not hierarchical yet has " + facetField.path.length + " components"); } ``` ########## File path: lucene/facet/src/java/org/apache/lucene/facet/sortedset/SortedSetDocValuesFacetCounts.java ########## @@ -95,17 +96,34 @@ public FacetResult getTopChildren(int topN, String dim, String... path) throws I if (topN <= 0) { throw new IllegalArgumentException("topN must be > 0 (got: " + topN + ")"); } - if (path.length > 0) { - throw new IllegalArgumentException("path should be 0 length"); - } - OrdRange ordRange = state.getOrdRange(dim); - if (ordRange == null) { - return null; // means dimension was never indexed + + if (state.isHierarchicalDim(dim)) { + int pathOrd = (int) dv.lookupTerm(new BytesRef(FacetsConfig.pathToString(dim, path))); + if (pathOrd == -1) { + // path was never indexed + return null; + } + DimTree dimTree = state.getDimTree(dim); + if (dimTree == null) { + return null; + } + return getDim(dim, path, pathOrd, dimTree.iterator(pathOrd), topN); + } else { + if (path.length > 0) { + throw new IllegalArgumentException( + "Field is not configured as hierarchical, path should be 0 length"); + } + OrdRange ordRange = state.getOrdRange(dim); + if (ordRange == null) { + return null; Review comment: minor: Could you please retain the comment on this line? `// means dimension was never indexed` ########## File path: lucene/facet/src/java/org/apache/lucene/facet/sortedset/SortedSetDocValuesReaderState.java ########## @@ -51,8 +51,105 @@ public OrdRange(int start, int end) { this.start = start; this.end = end; } + + /** Iterates from start to end ord (inclusive) * */ + public PrimitiveIterator.OfInt iterator() { + return new PrimitiveIterator.OfInt() { + int current = start; + + @Override + public int nextInt() { + if (current > end) { + return INVALID_ORDINAL; + } + return current++; + } + + @Override + public boolean hasNext() { + return current <= end; + } + }; + } + } + + /** + * Holds children and sibling information for a single dimension. Only used with hierarchical + * dimensions. + */ + public static final class DimTree { + private final FixedBitSet hasChildren; + // TODO: This array can take up a lot of space. Change type based on input size maybe? + private final int[] siblings; + + /** The first ord of the dimension * */ + public final int dimStartOrd; + + /** Sibling and children must be of same length * */ + public DimTree(int dimStartOrd, List<Integer> sibling, List<Boolean> hasChildren) { + assert sibling.size() == hasChildren.size(); Review comment: Throwing an IllegalArumentException might be a better fit here since these are user-provided parameters. ########## File path: lucene/facet/src/java/org/apache/lucene/facet/sortedset/SortedSetDocValuesFacetCounts.java ########## @@ -95,17 +96,34 @@ public FacetResult getTopChildren(int topN, String dim, String... path) throws I if (topN <= 0) { throw new IllegalArgumentException("topN must be > 0 (got: " + topN + ")"); } - if (path.length > 0) { - throw new IllegalArgumentException("path should be 0 length"); - } - OrdRange ordRange = state.getOrdRange(dim); - if (ordRange == null) { - return null; // means dimension was never indexed + + if (state.isHierarchicalDim(dim)) { + int pathOrd = (int) dv.lookupTerm(new BytesRef(FacetsConfig.pathToString(dim, path))); + if (pathOrd == -1) { + // path was never indexed + return null; + } + DimTree dimTree = state.getDimTree(dim); + if (dimTree == null) { Review comment: Seems like this should never happen right? If we reach this point, it means the path was indexed. Shouldn't there always be a dim tree then? ########## File path: lucene/facet/src/java/org/apache/lucene/facet/sortedset/DefaultSortedSetDocValuesReaderState.java ########## @@ -79,38 +104,160 @@ public DefaultSortedSetDocValuesReaderState(IndexReader reader, String field) th } valueCount = (int) dv.getValueCount(); - // TODO: we can make this more efficient if eg we can be - // "involved" when OrdinalMap is being created? Ie see - // each term/ord it's assigning as it goes... - String lastDim = null; - int startOrd = -1; - - // TODO: this approach can work for full hierarchy?; - // TaxoReader can't do this since ords are not in - // "sorted order" ... but we should generalize this to - // support arbitrary hierarchy: - for (int ord = 0; ord < valueCount; ord++) { - final BytesRef term = dv.lookupOrd(ord); + int ord = 0; + while (ord != valueCount) { + BytesRef term = dv.lookupOrd(ord); String[] components = FacetsConfig.stringToPath(term.utf8ToString()); - if (components.length != 2) { - throw new IllegalArgumentException( - "this class can only handle 2 level hierarchy (dim/value); got: " - + Arrays.toString(components) - + " " - + term.utf8ToString()); + String dim = components[0]; + if (config != null && config.getDimConfig(dim).hierarchical) { + ord = createOneHierarchicalFacetDimState(dv, ord) + 1; + } else { + ord = createOneFlatFacetDimState(dv, ord) + 1; + } + } + } + + // returns last ord of dimension + private int createOneHierarchicalFacetDimState(SortedSetDocValues dv, int dimStartOrd) + throws IOException { + List<Boolean> hasChildren = new ArrayList<>(); + List<Integer> siblings = new ArrayList<>(); + + String dim = null; + + // stack of paths with unfulfilled siblings + Stack<OrdAndComponent> siblingStack = new Stack<>(); + + String[] nextComponents = null; + int dimEndOrd = dimStartOrd; + while (true) { + String[] components; + + int ord = dimEndOrd - dimStartOrd; + + if (nextComponents == null) { + BytesRef term = dv.lookupOrd(dimEndOrd); + components = FacetsConfig.stringToPath(term.utf8ToString()); + dim = components[0]; + } else { + components = nextComponents; } - if (!components[0].equals(lastDim)) { - if (lastDim != null) { - prefixToOrdRange.put(lastDim, new OrdRange(startOrd, ord - 1)); + + while (siblingStack.empty() == false + && siblingStack.peek().component.length >= components.length) { + OrdAndComponent possibleSibling = siblingStack.pop(); + if (possibleSibling.component.length == components.length) { + // lengths are equal + boolean isSibling = true; + for (int i = 0; i < possibleSibling.component.length - 1; i++) { + if (possibleSibling.component[i].equals(components[i]) == false) { + isSibling = false; + break; + } + } + if (isSibling) { + siblings.set(possibleSibling.ord, ord); + } } - startOrd = ord; - lastDim = components[0]; } + + if (dimEndOrd + 1 == valueCount) { + // current ord needs to be added, can't have children or siblings + siblings.add(-1); + hasChildren.add(false); + break; + } + + BytesRef nextTerm = dv.lookupOrd(dimEndOrd + 1); + nextComponents = FacetsConfig.stringToPath(nextTerm.utf8ToString()); + + if (nextComponents[0].equals(components[0]) == false) { + // current ord needs to be added, can't have children or siblings + siblings.add(-1); + hasChildren.add(false); + break; + } + + if (components.length < nextComponents.length) { + // next ord must be a direct child of current ord, this is because we are indexing all + // ancestral paths + hasChildren.add(ord, true); + // we don't know if this ord has a sibling or where it's sibling could be yet + siblingStack.push(new OrdAndComponent(ord, components)); + // we still add INVALID_ORDINAL, which will be replaced if a valid sibling is found + siblings.add(ord, INVALID_ORDINAL); + } else if (components.length == nextComponents.length) { + // next ord must be a sibling of current and there are no direct children or current, this + // is because we + // are indexing all ancestral paths + siblings.add(ord, ord + 1); + hasChildren.add(ord, false); + } else { + // components.length > nextComponents.length + // next ord is neither sibling nor child + siblings.add(ord, INVALID_ORDINAL); + hasChildren.add(ord, false); + } + + dimEndOrd++; } - if (lastDim != null) { - prefixToOrdRange.put(lastDim, new OrdRange(startOrd, valueCount - 1)); + prefixToDimTree.put(dim, new DimTree(dimStartOrd, siblings, hasChildren)); + + return dimEndOrd; + } + + // returns last ord of dimension + private int createOneFlatFacetDimState(SortedSetDocValues dv, int dimStartOrd) + throws IOException { + + String dim = null; + + String[] nextComponents = null; + int dimEndOrd = dimStartOrd; + + while (true) { + String[] components; + + if (nextComponents == null) { Review comment: This should only be true on the first iteration of the loop right? Maybe it would be a little more readable to pull this outside the loop? So you'd initialize nextComponents to be the first entry for the dim and then enter the loop? ########## File path: lucene/facet/src/java/org/apache/lucene/facet/sortedset/SortedSetDocValuesFacetCounts.java ########## @@ -317,10 +343,19 @@ public Number getSpecificValue(String dim, String... path) throws IOException { public List<FacetResult> getAllDims(int topN) throws IOException { List<FacetResult> results = new ArrayList<>(); - for (Map.Entry<String, OrdRange> ent : state.getPrefixToOrdRange().entrySet()) { - FacetResult fr = getDim(ent.getKey(), ent.getValue(), topN); - if (fr != null) { - results.add(fr); + for (String dim : state.getDims()) { + if (state.isHierarchicalDim(dim)) { + DimTree dimTree = state.getDimTree(dim); + FacetResult fr = getDim(dim, new String[0], dimTree.dimStartOrd, dimTree.iterator(), topN); Review comment: Can you just pass `-1` instead of `dimTree.dimStartOrd` here? If I'm understanding the logic in `getDim`, I think you can right? Then you can make `dimStartOrd` `private`, which would be nice. ########## File path: lucene/facet/src/java/org/apache/lucene/facet/sortedset/DefaultSortedSetDocValuesReaderState.java ########## @@ -51,20 +54,42 @@ private final Map<String, OrdinalMap> cachedOrdMaps = new HashMap<>(); + private final FacetsConfig config; + + /*** Used for hierarchical dims ***/ Review comment: I commented elsewhere but there's something going on with extra '*' in some of these javadoc comments :) ########## File path: lucene/facet/src/java/org/apache/lucene/facet/sortedset/DefaultSortedSetDocValuesReaderState.java ########## @@ -79,38 +104,160 @@ public DefaultSortedSetDocValuesReaderState(IndexReader reader, String field) th } valueCount = (int) dv.getValueCount(); - // TODO: we can make this more efficient if eg we can be - // "involved" when OrdinalMap is being created? Ie see - // each term/ord it's assigning as it goes... - String lastDim = null; - int startOrd = -1; - - // TODO: this approach can work for full hierarchy?; - // TaxoReader can't do this since ords are not in - // "sorted order" ... but we should generalize this to - // support arbitrary hierarchy: - for (int ord = 0; ord < valueCount; ord++) { - final BytesRef term = dv.lookupOrd(ord); + int ord = 0; + while (ord != valueCount) { + BytesRef term = dv.lookupOrd(ord); String[] components = FacetsConfig.stringToPath(term.utf8ToString()); - if (components.length != 2) { - throw new IllegalArgumentException( - "this class can only handle 2 level hierarchy (dim/value); got: " - + Arrays.toString(components) - + " " - + term.utf8ToString()); + String dim = components[0]; + if (config != null && config.getDimConfig(dim).hierarchical) { + ord = createOneHierarchicalFacetDimState(dv, ord) + 1; + } else { + ord = createOneFlatFacetDimState(dv, ord) + 1; + } + } + } + + // returns last ord of dimension + private int createOneHierarchicalFacetDimState(SortedSetDocValues dv, int dimStartOrd) + throws IOException { + List<Boolean> hasChildren = new ArrayList<>(); + List<Integer> siblings = new ArrayList<>(); + + String dim = null; + + // stack of paths with unfulfilled siblings + Stack<OrdAndComponent> siblingStack = new Stack<>(); + + String[] nextComponents = null; + int dimEndOrd = dimStartOrd; + while (true) { + String[] components; + + int ord = dimEndOrd - dimStartOrd; + + if (nextComponents == null) { Review comment: I made a similar comment elsewhere, but it seems like nextComponents could be initialized outside of the loop right? ########## File path: lucene/facet/src/java/org/apache/lucene/facet/FacetsConfig.java ########## @@ -471,19 +471,35 @@ private void indexDrillDownTerms( private void processSSDVFacetFields( Map<String, List<SortedSetDocValuesFacetField>> byField, Document doc) { + for (Map.Entry<String, List<SortedSetDocValuesFacetField>> ent : byField.entrySet()) { String indexFieldName = ent.getKey(); for (SortedSetDocValuesFacetField facetField : ent.getValue()) { - FacetLabel facetLabel = new FacetLabel(facetField.dim, facetField.label); - String fullPath = pathToString(facetLabel.components, facetLabel.length); + FacetLabel facetLabel = new FacetLabel(facetField.dim, facetField.path); + if (getDimConfig(facetField.dim).hierarchical) { + for (int i = 0; i < facetLabel.length; i++) { + String fullPath = pathToString(facetLabel.components, i + 1); + // For facet counts: + doc.add(new SortedSetDocValuesField(indexFieldName, new BytesRef(fullPath))); + + // For drill-down: + indexDrillDownTerms(doc, indexFieldName, getDimConfig(facetField.dim), facetLabel); Review comment: I don't think you intended to put this in the loop right? In fact, I think it should be outside of the conditional altogether since the logic should be the same regardless of whether-or-not it's hierarchical right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org