mikemccand commented on code in PR #12829:
URL: https://github.com/apache/lucene/pull/12829#discussion_r1406124683


##########
lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThread.java:
##########
@@ -262,6 +277,73 @@ long updateDocuments(
     }
   }
 
+  private interface DocValidator extends Consumer<IndexableField> {

Review Comment:
   Do we really need a new interface here?  We have only one validator now 
(`ParentBlockValidator`) -- can we just make a concrete instance of that which 
IW calls?



##########
lucene/core/src/java/org/apache/lucene/search/Sort.java:
##########
@@ -59,10 +61,28 @@ public Sort() {
    * is still a tie after all SortFields are checked, the internal Lucene 
docid is used to break it.
    */
   public Sort(SortField... fields) {
+    this(null, fields);
+  }
+
+  /**
+   * Sets the sort to the given criteria in succession: the first SortField is 
checked first, but if
+   * it produces a tie, then the second SortField is used to break the tie, 
etc. Finally, if there
+   * is still a tie after all SortFields are checked, the internal Lucene 
docid is used to break it.
+   *
+   * @param parentField the name of a numeric doc values field that marks the 
last document of a
+   *     document blocks indexed with {@link
+   *     org.apache.lucene.index.IndexWriter#addDocuments(Iterable)} or it's 
update relatives. This
+   *     is required for indices that use index sorting in combination with 
document blocks in order
+   *     to maintain the document order of the blocks documents. Index sorting 
will effectively
+   *     compare the parent (last document) of a block in order to stable sort 
all it's adjacent

Review Comment:
   s/it's/its



##########
lucene/core/src/java/org/apache/lucene/search/Sort.java:
##########
@@ -59,10 +61,28 @@ public Sort() {
    * is still a tie after all SortFields are checked, the internal Lucene 
docid is used to break it.
    */
   public Sort(SortField... fields) {
+    this(null, fields);
+  }
+
+  /**
+   * Sets the sort to the given criteria in succession: the first SortField is 
checked first, but if
+   * it produces a tie, then the second SortField is used to break the tie, 
etc. Finally, if there
+   * is still a tie after all SortFields are checked, the internal Lucene 
docid is used to break it.
+   *
+   * @param parentField the name of a numeric doc values field that marks the 
last document of a

Review Comment:
   Can we make it clearer that you only use this for index-time sorting?  It 
has no meaning at search time (and should not be set)?
   
   Or ... does it have search-time meaning?  Can I use this to retrieve whole 
blocks in my search hits?



##########
lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThread.java:
##########
@@ -262,6 +277,73 @@ long updateDocuments(
     }
   }
 
+  private interface DocValidator extends Consumer<IndexableField> {
+
+    default void afterDocument() {}
+
+    default void beforeDocument() {}
+
+    default void afterDocuments(int numDocs) {}
+
+    default void reset() {}
+  }
+
+  private static class ParentBlockValidator implements DocValidator {
+    private boolean lastDocHasParentField = false;

Review Comment:
   You don't need the `= false` -- it's java's default already :)



##########
lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThread.java:
##########
@@ -262,6 +277,73 @@ long updateDocuments(
     }
   }
 
+  private interface DocValidator extends Consumer<IndexableField> {
+
+    default void afterDocument() {}
+
+    default void beforeDocument() {}
+
+    default void afterDocuments(int numDocs) {}
+
+    default void reset() {}
+  }
+
+  private static class ParentBlockValidator implements DocValidator {
+    private boolean lastDocHasParentField = false;
+    private boolean childDocHasParentField = false;
+    private final String parentFieldName;
+
+    @Override
+    public void reset() {
+      lastDocHasParentField = false;
+      childDocHasParentField = false;
+    }
+
+    private ParentBlockValidator(Sort sort) {
+      this.parentFieldName = sort.getParentField();
+    }
+
+    @Override
+    public void beforeDocument() {
+      if (childDocHasParentField == false && lastDocHasParentField) {
+        childDocHasParentField = true;
+      }
+      lastDocHasParentField = false;
+    }
+
+    @Override
+    public void accept(IndexableField field) {
+      if (parentFieldName != null) {
+        if (parentFieldName.equals(field.name())
+            && DocValuesType.NUMERIC == field.fieldType().docValuesType()) {
+          lastDocHasParentField = true;
+        }
+      }
+    }
+
+    @Override
+    public void afterDocument() {
+      if (childDocHasParentField) {
+        throw new IllegalArgumentException(
+            "only the last document in the block must contain a numeric doc 
values field named: "

Review Comment:
   Maybe reword to `"child documents must not contain the parent doc values 
field \"" + parentFieldName + "\""` or so?



##########
lucene/core/src/java/org/apache/lucene/search/Sort.java:
##########
@@ -59,10 +61,28 @@ public Sort() {
    * is still a tie after all SortFields are checked, the internal Lucene 
docid is used to break it.
    */
   public Sort(SortField... fields) {
+    this(null, fields);
+  }
+
+  /**
+   * Sets the sort to the given criteria in succession: the first SortField is 
checked first, but if
+   * it produces a tie, then the second SortField is used to break the tie, 
etc. Finally, if there
+   * is still a tie after all SortFields are checked, the internal Lucene 
docid is used to break it.
+   *
+   * @param parentField the name of a numeric doc values field that marks the 
last document of a
+   *     document blocks indexed with {@link
+   *     org.apache.lucene.index.IndexWriter#addDocuments(Iterable)} or it's 
update relatives. This
+   *     is required for indices that use index sorting in combination with 
document blocks in order
+   *     to maintain the document order of the blocks documents. Index sorting 
will effectively
+   *     compare the parent (last document) of a block in order to stable sort 
all it's adjacent
+   *     documents that belong to a block. This field must be a numeric doc 
values field a
+   */
+  public Sort(String parentField, SortField... fields) {

Review Comment:
   It's sort of strange that we are pushing this down into `Sort` since when 
used at search time (for sorting hits) a `parentField` doesn't make sense?
   
   Do we check that `parentField` is not specified when sorting hits during 
searching?



##########
lucene/core/src/java/org/apache/lucene/search/Sort.java:
##########
@@ -85,7 +110,7 @@ public SortField[] getSort() {
    */
   public Sort rewrite(IndexSearcher searcher) throws IOException {
     boolean changed = false;
-
+    assert parentField == null;

Review Comment:
   Ahh we do here -- maybe make this a real `if` since a user might incorrectly 
set a `parentField` hoping for something to happen.



##########
lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThread.java:
##########
@@ -262,6 +277,73 @@ long updateDocuments(
     }
   }
 
+  private interface DocValidator extends Consumer<IndexableField> {
+
+    default void afterDocument() {}
+
+    default void beforeDocument() {}
+
+    default void afterDocuments(int numDocs) {}
+
+    default void reset() {}
+  }
+
+  private static class ParentBlockValidator implements DocValidator {
+    private boolean lastDocHasParentField = false;
+    private boolean childDocHasParentField = false;
+    private final String parentFieldName;
+
+    @Override
+    public void reset() {
+      lastDocHasParentField = false;
+      childDocHasParentField = false;
+    }
+
+    private ParentBlockValidator(Sort sort) {
+      this.parentFieldName = sort.getParentField();
+    }
+
+    @Override
+    public void beforeDocument() {
+      if (childDocHasParentField == false && lastDocHasParentField) {
+        childDocHasParentField = true;
+      }
+      lastDocHasParentField = false;
+    }
+
+    @Override
+    public void accept(IndexableField field) {
+      if (parentFieldName != null) {
+        if (parentFieldName.equals(field.name())
+            && DocValuesType.NUMERIC == field.fieldType().docValuesType()) {
+          lastDocHasParentField = true;
+        }
+      }
+    }
+
+    @Override
+    public void afterDocument() {
+      if (childDocHasParentField) {
+        throw new IllegalArgumentException(
+            "only the last document in the block must contain a numeric doc 
values field named: "
+                + parentFieldName);
+      }
+    }
+
+    @Override
+    public void afterDocuments(int numDocs) {
+      if (numDocs > 1 && parentFieldName == null) {
+        throw new IllegalArgumentException(
+            "A parent field must be set in order to use document blocks with 
index sorting");
+      }
+      if (parentFieldName != null && lastDocHasParentField == false) {
+        throw new IllegalArgumentException(
+            "the last document in the block must contain a numeric doc values 
field named: "

Review Comment:
   Maybe `must contain the parent doc values field`?



##########
lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThread.java:
##########
@@ -262,6 +277,73 @@ long updateDocuments(
     }
   }
 
+  private interface DocValidator extends Consumer<IndexableField> {
+
+    default void afterDocument() {}
+
+    default void beforeDocument() {}
+
+    default void afterDocuments(int numDocs) {}
+
+    default void reset() {}
+  }
+
+  private static class ParentBlockValidator implements DocValidator {
+    private boolean lastDocHasParentField = false;
+    private boolean childDocHasParentField = false;
+    private final String parentFieldName;
+
+    @Override
+    public void reset() {
+      lastDocHasParentField = false;
+      childDocHasParentField = false;
+    }
+
+    private ParentBlockValidator(Sort sort) {
+      this.parentFieldName = sort.getParentField();
+    }
+
+    @Override
+    public void beforeDocument() {
+      if (childDocHasParentField == false && lastDocHasParentField) {
+        childDocHasParentField = true;
+      }
+      lastDocHasParentField = false;
+    }
+
+    @Override
+    public void accept(IndexableField field) {
+      if (parentFieldName != null) {
+        if (parentFieldName.equals(field.name())
+            && DocValuesType.NUMERIC == field.fieldType().docValuesType()) {
+          lastDocHasParentField = true;
+        }
+      }
+    }
+
+    @Override
+    public void afterDocument() {
+      if (childDocHasParentField) {
+        throw new IllegalArgumentException(
+            "only the last document in the block must contain a numeric doc 
values field named: "
+                + parentFieldName);
+      }
+    }
+
+    @Override
+    public void afterDocuments(int numDocs) {
+      if (numDocs > 1 && parentFieldName == null) {
+        throw new IllegalArgumentException(
+            "A parent field must be set in order to use document blocks with 
index sorting");

Review Comment:
   Lowercase `a` for consistency?  And maybe point to `Sort.setParentField` or 
so?



##########
lucene/core/src/java/org/apache/lucene/search/Sort.java:
##########
@@ -59,10 +61,28 @@ public Sort() {
    * is still a tie after all SortFields are checked, the internal Lucene 
docid is used to break it.
    */
   public Sort(SortField... fields) {
+    this(null, fields);
+  }
+
+  /**
+   * Sets the sort to the given criteria in succession: the first SortField is 
checked first, but if
+   * it produces a tie, then the second SortField is used to break the tie, 
etc. Finally, if there
+   * is still a tie after all SortFields are checked, the internal Lucene 
docid is used to break it.
+   *
+   * @param parentField the name of a numeric doc values field that marks the 
last document of a
+   *     document blocks indexed with {@link
+   *     org.apache.lucene.index.IndexWriter#addDocuments(Iterable)} or it's 
update relatives. This
+   *     is required for indices that use index sorting in combination with 
document blocks in order
+   *     to maintain the document order of the blocks documents. Index sorting 
will effectively

Review Comment:
   Does this mean the Sort is otherwise free to break up the block?  I.e. this 
`parentField` will override sort fields that might attempt to sort child docs 
away from one another?



##########
lucene/core/src/java/org/apache/lucene/search/Sort.java:
##########
@@ -59,10 +61,28 @@ public Sort() {
    * is still a tie after all SortFields are checked, the internal Lucene 
docid is used to break it.
    */
   public Sort(SortField... fields) {
+    this(null, fields);
+  }
+
+  /**
+   * Sets the sort to the given criteria in succession: the first SortField is 
checked first, but if
+   * it produces a tie, then the second SortField is used to break the tie, 
etc. Finally, if there
+   * is still a tie after all SortFields are checked, the internal Lucene 
docid is used to break it.
+   *
+   * @param parentField the name of a numeric doc values field that marks the 
last document of a
+   *     document blocks indexed with {@link
+   *     org.apache.lucene.index.IndexWriter#addDocuments(Iterable)} or it's 
update relatives. This
+   *     is required for indices that use index sorting in combination with 
document blocks in order
+   *     to maintain the document order of the blocks documents. Index sorting 
will effectively
+   *     compare the parent (last document) of a block in order to stable sort 
all it's adjacent
+   *     documents that belong to a block. This field must be a numeric doc 
values field a

Review Comment:
   Remove that errant last `a` and replace with a period maybe?



##########
lucene/core/src/java/org/apache/lucene/search/Sort.java:
##########
@@ -59,10 +61,28 @@ public Sort() {
    * is still a tie after all SortFields are checked, the internal Lucene 
docid is used to break it.
    */
   public Sort(SortField... fields) {
+    this(null, fields);
+  }
+
+  /**
+   * Sets the sort to the given criteria in succession: the first SortField is 
checked first, but if
+   * it produces a tie, then the second SortField is used to break the tie, 
etc. Finally, if there
+   * is still a tie after all SortFields are checked, the internal Lucene 
docid is used to break it.
+   *
+   * @param parentField the name of a numeric doc values field that marks the 
last document of a
+   *     document blocks indexed with {@link
+   *     org.apache.lucene.index.IndexWriter#addDocuments(Iterable)} or it's 
update relatives. This

Review Comment:
   s/it's/its



##########
lucene/core/src/test/org/apache/lucene/index/TestIndexSorting.java:
##########
@@ -3173,4 +3173,184 @@ public void 
testSortDocsAndFreqsAndPositionsAndOffsets() throws IOException {
     reader.close();
     dir.close();
   }
+
+  public void testBlockIsMissingParentField() throws IOException {
+    try (Directory dir = newDirectory()) {
+      IndexWriterConfig iwc = new IndexWriterConfig(new 
MockAnalyzer(random()));
+      String parentField = "parent";
+      Sort indexSort = new Sort(parentField, new SortField("foo", 
SortField.Type.INT));
+      iwc.setIndexSort(indexSort);
+      try (IndexWriter writer = new IndexWriter(dir, iwc)) {
+        List<Runnable> runnabels =
+            Arrays.asList(
+                () -> {
+                  IllegalArgumentException ex =
+                      expectThrows(
+                          IllegalArgumentException.class,
+                          () -> {
+                            writer.addDocuments(Arrays.asList(new Document(), 
new Document()));
+                          });
+                  assertEquals(
+                      "the last document in the block must contain a numeric 
doc values field named: parent",
+                      ex.getMessage());
+                },
+                () -> {
+                  IllegalArgumentException ex =
+                      expectThrows(
+                          IllegalArgumentException.class,
+                          () -> {
+                            Document doc = new Document();
+                            doc.add(new NumericDocValuesField("parent", 0));
+                            writer.addDocuments(Arrays.asList(doc, new 
Document()));
+                          });
+                  assertEquals(
+                      "only the last document in the block must contain a 
numeric doc values field named: parent",

Review Comment:
   @msokolov I think multiple joins (nested levels) would work, except the user 
cannot use this parent DV field to record the sub-structure?  Another field 
must be added by the user to record the child -> grandchild blocks.  I agree 
it'd be nice if this parent field would accommodate more than one level of 
nesting, but maybe that can be a followon improvement.
   
   This field would only handle a single level (the entire block of child and 
grandchild docs).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to