mikemccand commented on code in PR #12829:
URL: https://github.com/apache/lucene/pull/12829#discussion_r1406124683
##########
lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThread.java:
##########
@@ -262,6 +277,73 @@ long updateDocuments(
}
}
+ private interface DocValidator extends Consumer<IndexableField> {
Review Comment:
Do we really need a new interface here? We have only one validator now
(`ParentBlockValidator`) -- can we just make a concrete instance of that which
IW calls?
##########
lucene/core/src/java/org/apache/lucene/search/Sort.java:
##########
@@ -59,10 +61,28 @@ public Sort() {
* is still a tie after all SortFields are checked, the internal Lucene
docid is used to break it.
*/
public Sort(SortField... fields) {
+ this(null, fields);
+ }
+
+ /**
+ * Sets the sort to the given criteria in succession: the first SortField is
checked first, but if
+ * it produces a tie, then the second SortField is used to break the tie,
etc. Finally, if there
+ * is still a tie after all SortFields are checked, the internal Lucene
docid is used to break it.
+ *
+ * @param parentField the name of a numeric doc values field that marks the
last document of a
+ * document blocks indexed with {@link
+ * org.apache.lucene.index.IndexWriter#addDocuments(Iterable)} or it's
update relatives. This
+ * is required for indices that use index sorting in combination with
document blocks in order
+ * to maintain the document order of the blocks documents. Index sorting
will effectively
+ * compare the parent (last document) of a block in order to stable sort
all it's adjacent
Review Comment:
s/it's/its
##########
lucene/core/src/java/org/apache/lucene/search/Sort.java:
##########
@@ -59,10 +61,28 @@ public Sort() {
* is still a tie after all SortFields are checked, the internal Lucene
docid is used to break it.
*/
public Sort(SortField... fields) {
+ this(null, fields);
+ }
+
+ /**
+ * Sets the sort to the given criteria in succession: the first SortField is
checked first, but if
+ * it produces a tie, then the second SortField is used to break the tie,
etc. Finally, if there
+ * is still a tie after all SortFields are checked, the internal Lucene
docid is used to break it.
+ *
+ * @param parentField the name of a numeric doc values field that marks the
last document of a
Review Comment:
Can we make it clearer that you only use this for index-time sorting? It
has no meaning at search time (and should not be set)?
Or ... does it have search-time meaning? Can I use this to retrieve whole
blocks in my search hits?
##########
lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThread.java:
##########
@@ -262,6 +277,73 @@ long updateDocuments(
}
}
+ private interface DocValidator extends Consumer<IndexableField> {
+
+ default void afterDocument() {}
+
+ default void beforeDocument() {}
+
+ default void afterDocuments(int numDocs) {}
+
+ default void reset() {}
+ }
+
+ private static class ParentBlockValidator implements DocValidator {
+ private boolean lastDocHasParentField = false;
Review Comment:
You don't need the `= false` -- it's java's default already :)
##########
lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThread.java:
##########
@@ -262,6 +277,73 @@ long updateDocuments(
}
}
+ private interface DocValidator extends Consumer<IndexableField> {
+
+ default void afterDocument() {}
+
+ default void beforeDocument() {}
+
+ default void afterDocuments(int numDocs) {}
+
+ default void reset() {}
+ }
+
+ private static class ParentBlockValidator implements DocValidator {
+ private boolean lastDocHasParentField = false;
+ private boolean childDocHasParentField = false;
+ private final String parentFieldName;
+
+ @Override
+ public void reset() {
+ lastDocHasParentField = false;
+ childDocHasParentField = false;
+ }
+
+ private ParentBlockValidator(Sort sort) {
+ this.parentFieldName = sort.getParentField();
+ }
+
+ @Override
+ public void beforeDocument() {
+ if (childDocHasParentField == false && lastDocHasParentField) {
+ childDocHasParentField = true;
+ }
+ lastDocHasParentField = false;
+ }
+
+ @Override
+ public void accept(IndexableField field) {
+ if (parentFieldName != null) {
+ if (parentFieldName.equals(field.name())
+ && DocValuesType.NUMERIC == field.fieldType().docValuesType()) {
+ lastDocHasParentField = true;
+ }
+ }
+ }
+
+ @Override
+ public void afterDocument() {
+ if (childDocHasParentField) {
+ throw new IllegalArgumentException(
+ "only the last document in the block must contain a numeric doc
values field named: "
Review Comment:
Maybe reword to `"child documents must not contain the parent doc values
field \"" + parentFieldName + "\""` or so?
##########
lucene/core/src/java/org/apache/lucene/search/Sort.java:
##########
@@ -59,10 +61,28 @@ public Sort() {
* is still a tie after all SortFields are checked, the internal Lucene
docid is used to break it.
*/
public Sort(SortField... fields) {
+ this(null, fields);
+ }
+
+ /**
+ * Sets the sort to the given criteria in succession: the first SortField is
checked first, but if
+ * it produces a tie, then the second SortField is used to break the tie,
etc. Finally, if there
+ * is still a tie after all SortFields are checked, the internal Lucene
docid is used to break it.
+ *
+ * @param parentField the name of a numeric doc values field that marks the
last document of a
+ * document blocks indexed with {@link
+ * org.apache.lucene.index.IndexWriter#addDocuments(Iterable)} or it's
update relatives. This
+ * is required for indices that use index sorting in combination with
document blocks in order
+ * to maintain the document order of the blocks documents. Index sorting
will effectively
+ * compare the parent (last document) of a block in order to stable sort
all it's adjacent
+ * documents that belong to a block. This field must be a numeric doc
values field a
+ */
+ public Sort(String parentField, SortField... fields) {
Review Comment:
It's sort of strange that we are pushing this down into `Sort` since when
used at search time (for sorting hits) a `parentField` doesn't make sense?
Do we check that `parentField` is not specified when sorting hits during
searching?
##########
lucene/core/src/java/org/apache/lucene/search/Sort.java:
##########
@@ -85,7 +110,7 @@ public SortField[] getSort() {
*/
public Sort rewrite(IndexSearcher searcher) throws IOException {
boolean changed = false;
-
+ assert parentField == null;
Review Comment:
Ahh we do here -- maybe make this a real `if` since a user might incorrectly
set a `parentField` hoping for something to happen.
##########
lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThread.java:
##########
@@ -262,6 +277,73 @@ long updateDocuments(
}
}
+ private interface DocValidator extends Consumer<IndexableField> {
+
+ default void afterDocument() {}
+
+ default void beforeDocument() {}
+
+ default void afterDocuments(int numDocs) {}
+
+ default void reset() {}
+ }
+
+ private static class ParentBlockValidator implements DocValidator {
+ private boolean lastDocHasParentField = false;
+ private boolean childDocHasParentField = false;
+ private final String parentFieldName;
+
+ @Override
+ public void reset() {
+ lastDocHasParentField = false;
+ childDocHasParentField = false;
+ }
+
+ private ParentBlockValidator(Sort sort) {
+ this.parentFieldName = sort.getParentField();
+ }
+
+ @Override
+ public void beforeDocument() {
+ if (childDocHasParentField == false && lastDocHasParentField) {
+ childDocHasParentField = true;
+ }
+ lastDocHasParentField = false;
+ }
+
+ @Override
+ public void accept(IndexableField field) {
+ if (parentFieldName != null) {
+ if (parentFieldName.equals(field.name())
+ && DocValuesType.NUMERIC == field.fieldType().docValuesType()) {
+ lastDocHasParentField = true;
+ }
+ }
+ }
+
+ @Override
+ public void afterDocument() {
+ if (childDocHasParentField) {
+ throw new IllegalArgumentException(
+ "only the last document in the block must contain a numeric doc
values field named: "
+ + parentFieldName);
+ }
+ }
+
+ @Override
+ public void afterDocuments(int numDocs) {
+ if (numDocs > 1 && parentFieldName == null) {
+ throw new IllegalArgumentException(
+ "A parent field must be set in order to use document blocks with
index sorting");
+ }
+ if (parentFieldName != null && lastDocHasParentField == false) {
+ throw new IllegalArgumentException(
+ "the last document in the block must contain a numeric doc values
field named: "
Review Comment:
Maybe `must contain the parent doc values field`?
##########
lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThread.java:
##########
@@ -262,6 +277,73 @@ long updateDocuments(
}
}
+ private interface DocValidator extends Consumer<IndexableField> {
+
+ default void afterDocument() {}
+
+ default void beforeDocument() {}
+
+ default void afterDocuments(int numDocs) {}
+
+ default void reset() {}
+ }
+
+ private static class ParentBlockValidator implements DocValidator {
+ private boolean lastDocHasParentField = false;
+ private boolean childDocHasParentField = false;
+ private final String parentFieldName;
+
+ @Override
+ public void reset() {
+ lastDocHasParentField = false;
+ childDocHasParentField = false;
+ }
+
+ private ParentBlockValidator(Sort sort) {
+ this.parentFieldName = sort.getParentField();
+ }
+
+ @Override
+ public void beforeDocument() {
+ if (childDocHasParentField == false && lastDocHasParentField) {
+ childDocHasParentField = true;
+ }
+ lastDocHasParentField = false;
+ }
+
+ @Override
+ public void accept(IndexableField field) {
+ if (parentFieldName != null) {
+ if (parentFieldName.equals(field.name())
+ && DocValuesType.NUMERIC == field.fieldType().docValuesType()) {
+ lastDocHasParentField = true;
+ }
+ }
+ }
+
+ @Override
+ public void afterDocument() {
+ if (childDocHasParentField) {
+ throw new IllegalArgumentException(
+ "only the last document in the block must contain a numeric doc
values field named: "
+ + parentFieldName);
+ }
+ }
+
+ @Override
+ public void afterDocuments(int numDocs) {
+ if (numDocs > 1 && parentFieldName == null) {
+ throw new IllegalArgumentException(
+ "A parent field must be set in order to use document blocks with
index sorting");
Review Comment:
Lowercase `a` for consistency? And maybe point to `Sort.setParentField` or
so?
##########
lucene/core/src/java/org/apache/lucene/search/Sort.java:
##########
@@ -59,10 +61,28 @@ public Sort() {
* is still a tie after all SortFields are checked, the internal Lucene
docid is used to break it.
*/
public Sort(SortField... fields) {
+ this(null, fields);
+ }
+
+ /**
+ * Sets the sort to the given criteria in succession: the first SortField is
checked first, but if
+ * it produces a tie, then the second SortField is used to break the tie,
etc. Finally, if there
+ * is still a tie after all SortFields are checked, the internal Lucene
docid is used to break it.
+ *
+ * @param parentField the name of a numeric doc values field that marks the
last document of a
+ * document blocks indexed with {@link
+ * org.apache.lucene.index.IndexWriter#addDocuments(Iterable)} or it's
update relatives. This
+ * is required for indices that use index sorting in combination with
document blocks in order
+ * to maintain the document order of the blocks documents. Index sorting
will effectively
Review Comment:
Does this mean the Sort is otherwise free to break up the block? I.e. this
`parentField` will override sort fields that might attempt to sort child docs
away from one another?
##########
lucene/core/src/java/org/apache/lucene/search/Sort.java:
##########
@@ -59,10 +61,28 @@ public Sort() {
* is still a tie after all SortFields are checked, the internal Lucene
docid is used to break it.
*/
public Sort(SortField... fields) {
+ this(null, fields);
+ }
+
+ /**
+ * Sets the sort to the given criteria in succession: the first SortField is
checked first, but if
+ * it produces a tie, then the second SortField is used to break the tie,
etc. Finally, if there
+ * is still a tie after all SortFields are checked, the internal Lucene
docid is used to break it.
+ *
+ * @param parentField the name of a numeric doc values field that marks the
last document of a
+ * document blocks indexed with {@link
+ * org.apache.lucene.index.IndexWriter#addDocuments(Iterable)} or it's
update relatives. This
+ * is required for indices that use index sorting in combination with
document blocks in order
+ * to maintain the document order of the blocks documents. Index sorting
will effectively
+ * compare the parent (last document) of a block in order to stable sort
all it's adjacent
+ * documents that belong to a block. This field must be a numeric doc
values field a
Review Comment:
Remove that errant last `a` and replace with a period maybe?
##########
lucene/core/src/java/org/apache/lucene/search/Sort.java:
##########
@@ -59,10 +61,28 @@ public Sort() {
* is still a tie after all SortFields are checked, the internal Lucene
docid is used to break it.
*/
public Sort(SortField... fields) {
+ this(null, fields);
+ }
+
+ /**
+ * Sets the sort to the given criteria in succession: the first SortField is
checked first, but if
+ * it produces a tie, then the second SortField is used to break the tie,
etc. Finally, if there
+ * is still a tie after all SortFields are checked, the internal Lucene
docid is used to break it.
+ *
+ * @param parentField the name of a numeric doc values field that marks the
last document of a
+ * document blocks indexed with {@link
+ * org.apache.lucene.index.IndexWriter#addDocuments(Iterable)} or it's
update relatives. This
Review Comment:
s/it's/its
##########
lucene/core/src/test/org/apache/lucene/index/TestIndexSorting.java:
##########
@@ -3173,4 +3173,184 @@ public void
testSortDocsAndFreqsAndPositionsAndOffsets() throws IOException {
reader.close();
dir.close();
}
+
+ public void testBlockIsMissingParentField() throws IOException {
+ try (Directory dir = newDirectory()) {
+ IndexWriterConfig iwc = new IndexWriterConfig(new
MockAnalyzer(random()));
+ String parentField = "parent";
+ Sort indexSort = new Sort(parentField, new SortField("foo",
SortField.Type.INT));
+ iwc.setIndexSort(indexSort);
+ try (IndexWriter writer = new IndexWriter(dir, iwc)) {
+ List<Runnable> runnabels =
+ Arrays.asList(
+ () -> {
+ IllegalArgumentException ex =
+ expectThrows(
+ IllegalArgumentException.class,
+ () -> {
+ writer.addDocuments(Arrays.asList(new Document(),
new Document()));
+ });
+ assertEquals(
+ "the last document in the block must contain a numeric
doc values field named: parent",
+ ex.getMessage());
+ },
+ () -> {
+ IllegalArgumentException ex =
+ expectThrows(
+ IllegalArgumentException.class,
+ () -> {
+ Document doc = new Document();
+ doc.add(new NumericDocValuesField("parent", 0));
+ writer.addDocuments(Arrays.asList(doc, new
Document()));
+ });
+ assertEquals(
+ "only the last document in the block must contain a
numeric doc values field named: parent",
Review Comment:
@msokolov I think multiple joins (nested levels) would work, except the user
cannot use this parent DV field to record the sub-structure? Another field
must be added by the user to record the child -> grandchild blocks. I agree
it'd be nice if this parent field would accommodate more than one level of
nesting, but maybe that can be a followon improvement.
This field would only handle a single level (the entire block of child and
grandchild docs).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]