Re: [PR] Add support for index sorting with document blocks [lucene]

via GitHub Tue, 02 Jan 2024 11:00:22 -0800


mikemccand commented on code in PR #12829:
URL: https://github.com/apache/lucene/pull/12829#discussion_r1439716624



##########
lucene/core/src/java/org/apache/lucene/index/CheckIndex.java:
##########
@@ -1176,34 +1176,44 @@ public static Status.IndexSortStatus testSort(
         comparators[i] = fields[i].getComparator(1, 
Pruning.NONE).getLeafComparator(readerContext);
       }
 
-      int maxDoc = reader.maxDoc();
-
       try {
-
-        for (int docID = 1; docID < maxDoc; docID++) {
-
+        LeafMetaData metaData = reader.getMetaData();
+        FieldInfos fieldInfos = reader.getFieldInfos();
+        if (metaData.hasBlocks()
+            && fieldInfos.getParentField() == null
+            && metaData.getCreatedVersionMajor() >= 
Version.LUCENE_10_0_0.major) {
+          throw new IllegalStateException(
+              "parent field is not set but the index has document blocks and 
was created with version: "
+                  + metaData.getCreatedVersionMajor());
+        }
+        final DocIdSetIterator iter =

Review Comment:
   Hmm maybe un-ternary this one?



##########
lucene/CHANGES.txt:
##########
@@ -90,6 +90,11 @@ New Features
 * LUCENE-10626 Hunspell: add tools to aid dictionary editing:
   analysis introspection, stem expansion and stem/flag suggestion (Peter 
Gromov)
 
+* GITHUB#12829: IndexWriter now preserves document blocks indexed via 
IndexWriter#addDocuments
+  et.al. when index sorting is configured. Document blocks are maintained 
alongside their
+  parent documents during sort and merge. IndexWriterConfig requires a parent 
field to be specified
+  if index sorting is used together with documents blocks. (Simon Willnauer)

Review Comment:
   `documents blocks` -> `document blocks`



##########
lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThread.java:
##########
@@ -134,6 +137,8 @@ void abort() throws IOException {
   private final ReentrantLock lock = new ReentrantLock();
   private int[] deleteDocIDs = new int[0];
   private int numDeletedDocIds = 0;
+  private final int indexVersionCreated;

Review Comment:
   Rename to `indexMajorVersionCreated`?



##########
lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThread.java:
##########
@@ -231,7 +244,23 @@ long updateDocuments(
       final int docsInRamBefore = numDocsInRAM;
       boolean allDocsIndexed = false;
       try {
-        for (Iterable<? extends IndexableField> doc : docs) {
+        final Iterator<? extends Iterable<? extends IndexableField>> iterator 
= docs.iterator();
+        while (iterator.hasNext()) {
+          Iterable<? extends IndexableField> doc = iterator.next();
+          if (parentField != null) {
+            if (iterator.hasNext() == false) {
+              doc = addParentField(doc, parentField);
+            }
+          } else if (segmentInfo.getIndexSort() != null
+              && iterator.hasNext()
+              && indexVersionCreated >= Version.LUCENE_10_0_0.major) {
+            // sort is configured but parent field is missing, yet we have a 
doc-block
+            // yet we must not fail if this index was created in an earlier 
version where this
+            // behavior was permitted.
+            throw new IllegalArgumentException(
+                "a parent field must be set in order to use document blocks 
with index sorting; see IndexWriterConfig#getParentField");

Review Comment:
   Maybe `#setParentField` instead?  Or is the javadoc mostly on 
`getParentField` if so leave this be.



##########
lucene/core/src/java/org/apache/lucene/index/IndexingChain.java:
##########
@@ -1512,4 +1557,77 @@ void assertSameSchema(FieldInfo fi) {
       assertSame("point num bytes", fi.getPointNumBytes(), pointNumBytes);
     }
   }
+
+  /**
+   * Wraps the given field in a reserved field and registers it as reserved. 
Only DWPT should do
+   * this to mark fields as private / reserved to prevent this fieldname to be 
used from the outside

Review Comment:
   +1 to tighten up the language.



##########
lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThread.java:
##########
@@ -245,10 +274,11 @@ long updateDocuments(
             onNewDocOnRAM.run();
           }
         }
-        allDocsIndexed = true;
-        if (numDocsInRAM - docsInRamBefore > 1) {
+        final int numDocs = numDocsInRAM - docsInRamBefore;
+        if (numDocs > 1) {
           segmentInfo.setHasBlocks();

Review Comment:
   This might mean some segments don't have `hasBlocks` set but some do, if all 
doc blocks in a given segment happened to have just one doc.  That should be 
OK?  Nothing checks that all segments have the same boolean value for this?  
Maybe add a test case to confirm?



##########
lucene/core/src/java/org/apache/lucene/index/FieldInfos.java:
##########
@@ -437,6 +488,31 @@ private void verifySoftDeletedFieldName(String fieldName, 
boolean isSoftDeletesF
       }
     }
 
+    private void verifyParentFieldName(String fieldName, boolean 
isParentField) {
+      if (isParentField) {
+        if (parentFieldName == null) {
+          throw new IllegalArgumentException(
+              "this index has ["
+                  + fieldName
+                  + "] as parent document field already but parent document 
field is not configured in IWC");
+        } else if (fieldName.equals(parentFieldName) == false) {
+          throw new IllegalArgumentException(
+              "cannot configure ["
+                  + parentFieldName
+                  + "] as parent document field ; this index uses ["
+                  + fieldName
+                  + "] as parent document field  already");

Review Comment:
   Extra space between `field` and `already`?



##########
lucene/CHANGES.txt:
##########
@@ -90,6 +90,11 @@ New Features
 * LUCENE-10626 Hunspell: add tools to aid dictionary editing:
   analysis introspection, stem expansion and stem/flag suggestion (Peter 
Gromov)
 
+* GITHUB#12829: IndexWriter now preserves document blocks indexed via 
IndexWriter#addDocuments
+  et.al. when index sorting is configured. Document blocks are maintained 
alongside their

Review Comment:
   Maybe say `or #updateDocuments`?  Those are the only two methods?



##########
lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThread.java:
##########
@@ -262,6 +292,35 @@ long updateDocuments(
     }
   }
 
+  private Iterable<? extends IndexableField> addParentField(
+      Iterable<? extends IndexableField> doc, IndexableField parentField) {
+    return () -> {
+      final Iterator<? extends IndexableField> first = doc.iterator();
+      return new Iterator<>() {
+        IndexableField additionalField = parentField;
+
+        @Override
+        public boolean hasNext() {
+          return additionalField != null || first.hasNext();
+        }
+
+        @Override
+        public IndexableField next() {
+          if (additionalField != null) {
+            IndexableField field = additionalField;
+            additionalField = null;
+            return field;
+          }
+          if (first.hasNext()) {
+            IndexableField field = first.next();
+            return field;

Review Comment:
   Maybe just `return first.next();`?



##########
lucene/CHANGES.txt:
##########
@@ -131,6 +136,13 @@ Bug Fixes
 * GITHUB#12878: Fix the declared Exceptions of Expression#evaluate() to match 
those
   of DoubleValues#doubleValue(). (Uwe Schindler)
 
+Changes in Backwards Compatibility Policy
+-----------------------------------------
+
+* GITHUB#12829: IndexWriter#addDocuments et.al. now require a parent field 
name to be

Review Comment:
   Maybe insert `For indices newly created as of 10.0.0 onwards, ...`?
   
   Also, maybe move this to `MIGRATE.txt`, or keep this one here but also add 
one to `MIGRATE.txt`?
   
   And `et.al.` -> `or #updateDocuments`?



##########
lucene/core/src/java/org/apache/lucene/index/FieldInfos.java:
##########
@@ -437,6 +486,31 @@ private void verifySoftDeletedFieldName(String fieldName, 
boolean isSoftDeletesF
       }
     }
 
+    private void verifyParentFieldName(String fieldName, boolean 
isParentField) {
+      if (isParentField) {
+        if (parentFieldName == null) {
+          throw new IllegalArgumentException(
+              "this index has ["
+                  + fieldName
+                  + "] as parent document field already but parent document 
field is not configured in IWC");
+        } else if (fieldName.equals(parentFieldName) == false) {
+          throw new IllegalArgumentException(
+              "cannot configure ["
+                  + parentFieldName
+                  + "] as parent document field ; this index uses ["
+                  + fieldName
+                  + "] as parent document field  already");
+        }
+      } else if (fieldName.equals(parentFieldName)) {
+        throw new IllegalArgumentException(
+            "cannot configure ["
+                + parentFieldName
+                + "] as parent document field ; this index uses ["

Review Comment:
   OK I think I'm confused about how this code is working/catching something.
   
   `parentFieldName` is what was configured in IWC, and the incoming 
`fieldName` is coming from ... a document being newly indexed where this 
in-memory segment is seeing this `fieldName` for the first time and creating a 
`FieldInfo` for it?  Hmm but then how is that new `FieldInfo` setting its 
`.isParentField`?  Sorry I gotta stare at this / think about this a bit more :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add support for index sorting with document blocks [lucene]

Reply via email to