mikemccand commented on code in PR #12829: URL: https://github.com/apache/lucene/pull/12829#discussion_r1406124683
########## lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThread.java: ########## @@ -262,6 +277,73 @@ long updateDocuments( } } + private interface DocValidator extends Consumer<IndexableField> { Review Comment: Do we really need a new interface here? We have only one validator now (`ParentBlockValidator`) -- can we just make a concrete instance of that which IW calls? ########## lucene/core/src/java/org/apache/lucene/search/Sort.java: ########## @@ -59,10 +61,28 @@ public Sort() { * is still a tie after all SortFields are checked, the internal Lucene docid is used to break it. */ public Sort(SortField... fields) { + this(null, fields); + } + + /** + * Sets the sort to the given criteria in succession: the first SortField is checked first, but if + * it produces a tie, then the second SortField is used to break the tie, etc. Finally, if there + * is still a tie after all SortFields are checked, the internal Lucene docid is used to break it. + * + * @param parentField the name of a numeric doc values field that marks the last document of a + * document blocks indexed with {@link + * org.apache.lucene.index.IndexWriter#addDocuments(Iterable)} or it's update relatives. This + * is required for indices that use index sorting in combination with document blocks in order + * to maintain the document order of the blocks documents. Index sorting will effectively + * compare the parent (last document) of a block in order to stable sort all it's adjacent Review Comment: s/it's/its ########## lucene/core/src/java/org/apache/lucene/search/Sort.java: ########## @@ -59,10 +61,28 @@ public Sort() { * is still a tie after all SortFields are checked, the internal Lucene docid is used to break it. */ public Sort(SortField... fields) { + this(null, fields); + } + + /** + * Sets the sort to the given criteria in succession: the first SortField is checked first, but if + * it produces a tie, then the second SortField is used to break the tie, etc. Finally, if there + * is still a tie after all SortFields are checked, the internal Lucene docid is used to break it. + * + * @param parentField the name of a numeric doc values field that marks the last document of a Review Comment: Can we make it clearer that you only use this for index-time sorting? It has no meaning at search time (and should not be set)? Or ... does it have search-time meaning? Can I use this to retrieve whole blocks in my search hits? ########## lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThread.java: ########## @@ -262,6 +277,73 @@ long updateDocuments( } } + private interface DocValidator extends Consumer<IndexableField> { + + default void afterDocument() {} + + default void beforeDocument() {} + + default void afterDocuments(int numDocs) {} + + default void reset() {} + } + + private static class ParentBlockValidator implements DocValidator { + private boolean lastDocHasParentField = false; Review Comment: You don't need the `= false` -- it's java's default already :) ########## lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThread.java: ########## @@ -262,6 +277,73 @@ long updateDocuments( } } + private interface DocValidator extends Consumer<IndexableField> { + + default void afterDocument() {} + + default void beforeDocument() {} + + default void afterDocuments(int numDocs) {} + + default void reset() {} + } + + private static class ParentBlockValidator implements DocValidator { + private boolean lastDocHasParentField = false; + private boolean childDocHasParentField = false; + private final String parentFieldName; + + @Override + public void reset() { + lastDocHasParentField = false; + childDocHasParentField = false; + } + + private ParentBlockValidator(Sort sort) { + this.parentFieldName = sort.getParentField(); + } + + @Override + public void beforeDocument() { + if (childDocHasParentField == false && lastDocHasParentField) { + childDocHasParentField = true; + } + lastDocHasParentField = false; + } + + @Override + public void accept(IndexableField field) { + if (parentFieldName != null) { + if (parentFieldName.equals(field.name()) + && DocValuesType.NUMERIC == field.fieldType().docValuesType()) { + lastDocHasParentField = true; + } + } + } + + @Override + public void afterDocument() { + if (childDocHasParentField) { + throw new IllegalArgumentException( + "only the last document in the block must contain a numeric doc values field named: " Review Comment: Maybe reword to `"child documents must not contain the parent doc values field \"" + parentFieldName + "\""` or so? ########## lucene/core/src/java/org/apache/lucene/search/Sort.java: ########## @@ -59,10 +61,28 @@ public Sort() { * is still a tie after all SortFields are checked, the internal Lucene docid is used to break it. */ public Sort(SortField... fields) { + this(null, fields); + } + + /** + * Sets the sort to the given criteria in succession: the first SortField is checked first, but if + * it produces a tie, then the second SortField is used to break the tie, etc. Finally, if there + * is still a tie after all SortFields are checked, the internal Lucene docid is used to break it. + * + * @param parentField the name of a numeric doc values field that marks the last document of a + * document blocks indexed with {@link + * org.apache.lucene.index.IndexWriter#addDocuments(Iterable)} or it's update relatives. This + * is required for indices that use index sorting in combination with document blocks in order + * to maintain the document order of the blocks documents. Index sorting will effectively + * compare the parent (last document) of a block in order to stable sort all it's adjacent + * documents that belong to a block. This field must be a numeric doc values field a + */ + public Sort(String parentField, SortField... fields) { Review Comment: It's sort of strange that we are pushing this down into `Sort` since when used at search time (for sorting hits) a `parentField` doesn't make sense? Do we check that `parentField` is not specified when sorting hits during searching? ########## lucene/core/src/java/org/apache/lucene/search/Sort.java: ########## @@ -85,7 +110,7 @@ public SortField[] getSort() { */ public Sort rewrite(IndexSearcher searcher) throws IOException { boolean changed = false; - + assert parentField == null; Review Comment: Ahh we do here -- maybe make this a real `if` since a user might incorrectly set a `parentField` hoping for something to happen. ########## lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThread.java: ########## @@ -262,6 +277,73 @@ long updateDocuments( } } + private interface DocValidator extends Consumer<IndexableField> { + + default void afterDocument() {} + + default void beforeDocument() {} + + default void afterDocuments(int numDocs) {} + + default void reset() {} + } + + private static class ParentBlockValidator implements DocValidator { + private boolean lastDocHasParentField = false; + private boolean childDocHasParentField = false; + private final String parentFieldName; + + @Override + public void reset() { + lastDocHasParentField = false; + childDocHasParentField = false; + } + + private ParentBlockValidator(Sort sort) { + this.parentFieldName = sort.getParentField(); + } + + @Override + public void beforeDocument() { + if (childDocHasParentField == false && lastDocHasParentField) { + childDocHasParentField = true; + } + lastDocHasParentField = false; + } + + @Override + public void accept(IndexableField field) { + if (parentFieldName != null) { + if (parentFieldName.equals(field.name()) + && DocValuesType.NUMERIC == field.fieldType().docValuesType()) { + lastDocHasParentField = true; + } + } + } + + @Override + public void afterDocument() { + if (childDocHasParentField) { + throw new IllegalArgumentException( + "only the last document in the block must contain a numeric doc values field named: " + + parentFieldName); + } + } + + @Override + public void afterDocuments(int numDocs) { + if (numDocs > 1 && parentFieldName == null) { + throw new IllegalArgumentException( + "A parent field must be set in order to use document blocks with index sorting"); + } + if (parentFieldName != null && lastDocHasParentField == false) { + throw new IllegalArgumentException( + "the last document in the block must contain a numeric doc values field named: " Review Comment: Maybe `must contain the parent doc values field`? ########## lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThread.java: ########## @@ -262,6 +277,73 @@ long updateDocuments( } } + private interface DocValidator extends Consumer<IndexableField> { + + default void afterDocument() {} + + default void beforeDocument() {} + + default void afterDocuments(int numDocs) {} + + default void reset() {} + } + + private static class ParentBlockValidator implements DocValidator { + private boolean lastDocHasParentField = false; + private boolean childDocHasParentField = false; + private final String parentFieldName; + + @Override + public void reset() { + lastDocHasParentField = false; + childDocHasParentField = false; + } + + private ParentBlockValidator(Sort sort) { + this.parentFieldName = sort.getParentField(); + } + + @Override + public void beforeDocument() { + if (childDocHasParentField == false && lastDocHasParentField) { + childDocHasParentField = true; + } + lastDocHasParentField = false; + } + + @Override + public void accept(IndexableField field) { + if (parentFieldName != null) { + if (parentFieldName.equals(field.name()) + && DocValuesType.NUMERIC == field.fieldType().docValuesType()) { + lastDocHasParentField = true; + } + } + } + + @Override + public void afterDocument() { + if (childDocHasParentField) { + throw new IllegalArgumentException( + "only the last document in the block must contain a numeric doc values field named: " + + parentFieldName); + } + } + + @Override + public void afterDocuments(int numDocs) { + if (numDocs > 1 && parentFieldName == null) { + throw new IllegalArgumentException( + "A parent field must be set in order to use document blocks with index sorting"); Review Comment: Lowercase `a` for consistency? And maybe point to `Sort.setParentField` or so? ########## lucene/core/src/java/org/apache/lucene/search/Sort.java: ########## @@ -59,10 +61,28 @@ public Sort() { * is still a tie after all SortFields are checked, the internal Lucene docid is used to break it. */ public Sort(SortField... fields) { + this(null, fields); + } + + /** + * Sets the sort to the given criteria in succession: the first SortField is checked first, but if + * it produces a tie, then the second SortField is used to break the tie, etc. Finally, if there + * is still a tie after all SortFields are checked, the internal Lucene docid is used to break it. + * + * @param parentField the name of a numeric doc values field that marks the last document of a + * document blocks indexed with {@link + * org.apache.lucene.index.IndexWriter#addDocuments(Iterable)} or it's update relatives. This + * is required for indices that use index sorting in combination with document blocks in order + * to maintain the document order of the blocks documents. Index sorting will effectively Review Comment: Does this mean the Sort is otherwise free to break up the block? I.e. this `parentField` will override sort fields that might attempt to sort child docs away from one another? ########## lucene/core/src/java/org/apache/lucene/search/Sort.java: ########## @@ -59,10 +61,28 @@ public Sort() { * is still a tie after all SortFields are checked, the internal Lucene docid is used to break it. */ public Sort(SortField... fields) { + this(null, fields); + } + + /** + * Sets the sort to the given criteria in succession: the first SortField is checked first, but if + * it produces a tie, then the second SortField is used to break the tie, etc. Finally, if there + * is still a tie after all SortFields are checked, the internal Lucene docid is used to break it. + * + * @param parentField the name of a numeric doc values field that marks the last document of a + * document blocks indexed with {@link + * org.apache.lucene.index.IndexWriter#addDocuments(Iterable)} or it's update relatives. This + * is required for indices that use index sorting in combination with document blocks in order + * to maintain the document order of the blocks documents. Index sorting will effectively + * compare the parent (last document) of a block in order to stable sort all it's adjacent + * documents that belong to a block. This field must be a numeric doc values field a Review Comment: Remove that errant last `a` and replace with a period maybe? ########## lucene/core/src/java/org/apache/lucene/search/Sort.java: ########## @@ -59,10 +61,28 @@ public Sort() { * is still a tie after all SortFields are checked, the internal Lucene docid is used to break it. */ public Sort(SortField... fields) { + this(null, fields); + } + + /** + * Sets the sort to the given criteria in succession: the first SortField is checked first, but if + * it produces a tie, then the second SortField is used to break the tie, etc. Finally, if there + * is still a tie after all SortFields are checked, the internal Lucene docid is used to break it. + * + * @param parentField the name of a numeric doc values field that marks the last document of a + * document blocks indexed with {@link + * org.apache.lucene.index.IndexWriter#addDocuments(Iterable)} or it's update relatives. This Review Comment: s/it's/its ########## lucene/core/src/test/org/apache/lucene/index/TestIndexSorting.java: ########## @@ -3173,4 +3173,184 @@ public void testSortDocsAndFreqsAndPositionsAndOffsets() throws IOException { reader.close(); dir.close(); } + + public void testBlockIsMissingParentField() throws IOException { + try (Directory dir = newDirectory()) { + IndexWriterConfig iwc = new IndexWriterConfig(new MockAnalyzer(random())); + String parentField = "parent"; + Sort indexSort = new Sort(parentField, new SortField("foo", SortField.Type.INT)); + iwc.setIndexSort(indexSort); + try (IndexWriter writer = new IndexWriter(dir, iwc)) { + List<Runnable> runnabels = + Arrays.asList( + () -> { + IllegalArgumentException ex = + expectThrows( + IllegalArgumentException.class, + () -> { + writer.addDocuments(Arrays.asList(new Document(), new Document())); + }); + assertEquals( + "the last document in the block must contain a numeric doc values field named: parent", + ex.getMessage()); + }, + () -> { + IllegalArgumentException ex = + expectThrows( + IllegalArgumentException.class, + () -> { + Document doc = new Document(); + doc.add(new NumericDocValuesField("parent", 0)); + writer.addDocuments(Arrays.asList(doc, new Document())); + }); + assertEquals( + "only the last document in the block must contain a numeric doc values field named: parent", Review Comment: @msokolov I think multiple joins (nested levels) would work, except the user cannot use this parent DV field to record the sub-structure? Another field must be added by the user to record the child -> grandchild blocks. I agree it'd be nice if this parent field would accommodate more than one level of nesting, but maybe that can be a followon improvement. This field would only handle a single level (the entire block of child and grandchild docs). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org