mikemccand commented on code in PR #12829: URL: https://github.com/apache/lucene/pull/12829#discussion_r1439716624
########## lucene/core/src/java/org/apache/lucene/index/CheckIndex.java: ########## @@ -1176,34 +1176,44 @@ public static Status.IndexSortStatus testSort( comparators[i] = fields[i].getComparator(1, Pruning.NONE).getLeafComparator(readerContext); } - int maxDoc = reader.maxDoc(); - try { - - for (int docID = 1; docID < maxDoc; docID++) { - + LeafMetaData metaData = reader.getMetaData(); + FieldInfos fieldInfos = reader.getFieldInfos(); + if (metaData.hasBlocks() + && fieldInfos.getParentField() == null + && metaData.getCreatedVersionMajor() >= Version.LUCENE_10_0_0.major) { + throw new IllegalStateException( + "parent field is not set but the index has document blocks and was created with version: " + + metaData.getCreatedVersionMajor()); + } + final DocIdSetIterator iter = Review Comment: Hmm maybe un-ternary this one? ########## lucene/CHANGES.txt: ########## @@ -90,6 +90,11 @@ New Features * LUCENE-10626 Hunspell: add tools to aid dictionary editing: analysis introspection, stem expansion and stem/flag suggestion (Peter Gromov) +* GITHUB#12829: IndexWriter now preserves document blocks indexed via IndexWriter#addDocuments + et.al. when index sorting is configured. Document blocks are maintained alongside their + parent documents during sort and merge. IndexWriterConfig requires a parent field to be specified + if index sorting is used together with documents blocks. (Simon Willnauer) Review Comment: `documents blocks` -> `document blocks` ########## lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThread.java: ########## @@ -134,6 +137,8 @@ void abort() throws IOException { private final ReentrantLock lock = new ReentrantLock(); private int[] deleteDocIDs = new int[0]; private int numDeletedDocIds = 0; + private final int indexVersionCreated; Review Comment: Rename to `indexMajorVersionCreated`? ########## lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThread.java: ########## @@ -231,7 +244,23 @@ long updateDocuments( final int docsInRamBefore = numDocsInRAM; boolean allDocsIndexed = false; try { - for (Iterable<? extends IndexableField> doc : docs) { + final Iterator<? extends Iterable<? extends IndexableField>> iterator = docs.iterator(); + while (iterator.hasNext()) { + Iterable<? extends IndexableField> doc = iterator.next(); + if (parentField != null) { + if (iterator.hasNext() == false) { + doc = addParentField(doc, parentField); + } + } else if (segmentInfo.getIndexSort() != null + && iterator.hasNext() + && indexVersionCreated >= Version.LUCENE_10_0_0.major) { + // sort is configured but parent field is missing, yet we have a doc-block + // yet we must not fail if this index was created in an earlier version where this + // behavior was permitted. + throw new IllegalArgumentException( + "a parent field must be set in order to use document blocks with index sorting; see IndexWriterConfig#getParentField"); Review Comment: Maybe `#setParentField` instead? Or is the javadoc mostly on `getParentField` if so leave this be. ########## lucene/core/src/java/org/apache/lucene/index/IndexingChain.java: ########## @@ -1512,4 +1557,77 @@ void assertSameSchema(FieldInfo fi) { assertSame("point num bytes", fi.getPointNumBytes(), pointNumBytes); } } + + /** + * Wraps the given field in a reserved field and registers it as reserved. Only DWPT should do + * this to mark fields as private / reserved to prevent this fieldname to be used from the outside Review Comment: +1 to tighten up the language. ########## lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThread.java: ########## @@ -245,10 +274,11 @@ long updateDocuments( onNewDocOnRAM.run(); } } - allDocsIndexed = true; - if (numDocsInRAM - docsInRamBefore > 1) { + final int numDocs = numDocsInRAM - docsInRamBefore; + if (numDocs > 1) { segmentInfo.setHasBlocks(); Review Comment: This might mean some segments don't have `hasBlocks` set but some do, if all doc blocks in a given segment happened to have just one doc. That should be OK? Nothing checks that all segments have the same boolean value for this? Maybe add a test case to confirm? ########## lucene/core/src/java/org/apache/lucene/index/FieldInfos.java: ########## @@ -437,6 +488,31 @@ private void verifySoftDeletedFieldName(String fieldName, boolean isSoftDeletesF } } + private void verifyParentFieldName(String fieldName, boolean isParentField) { + if (isParentField) { + if (parentFieldName == null) { + throw new IllegalArgumentException( + "this index has [" + + fieldName + + "] as parent document field already but parent document field is not configured in IWC"); + } else if (fieldName.equals(parentFieldName) == false) { + throw new IllegalArgumentException( + "cannot configure [" + + parentFieldName + + "] as parent document field ; this index uses [" + + fieldName + + "] as parent document field already"); Review Comment: Extra space between `field` and `already`? ########## lucene/CHANGES.txt: ########## @@ -90,6 +90,11 @@ New Features * LUCENE-10626 Hunspell: add tools to aid dictionary editing: analysis introspection, stem expansion and stem/flag suggestion (Peter Gromov) +* GITHUB#12829: IndexWriter now preserves document blocks indexed via IndexWriter#addDocuments + et.al. when index sorting is configured. Document blocks are maintained alongside their Review Comment: Maybe say `or #updateDocuments`? Those are the only two methods? ########## lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThread.java: ########## @@ -262,6 +292,35 @@ long updateDocuments( } } + private Iterable<? extends IndexableField> addParentField( + Iterable<? extends IndexableField> doc, IndexableField parentField) { + return () -> { + final Iterator<? extends IndexableField> first = doc.iterator(); + return new Iterator<>() { + IndexableField additionalField = parentField; + + @Override + public boolean hasNext() { + return additionalField != null || first.hasNext(); + } + + @Override + public IndexableField next() { + if (additionalField != null) { + IndexableField field = additionalField; + additionalField = null; + return field; + } + if (first.hasNext()) { + IndexableField field = first.next(); + return field; Review Comment: Maybe just `return first.next();`? ########## lucene/CHANGES.txt: ########## @@ -131,6 +136,13 @@ Bug Fixes * GITHUB#12878: Fix the declared Exceptions of Expression#evaluate() to match those of DoubleValues#doubleValue(). (Uwe Schindler) +Changes in Backwards Compatibility Policy +----------------------------------------- + +* GITHUB#12829: IndexWriter#addDocuments et.al. now require a parent field name to be Review Comment: Maybe insert `For indices newly created as of 10.0.0 onwards, ...`? Also, maybe move this to `MIGRATE.txt`, or keep this one here but also add one to `MIGRATE.txt`? And `et.al.` -> `or #updateDocuments`? ########## lucene/core/src/java/org/apache/lucene/index/FieldInfos.java: ########## @@ -437,6 +486,31 @@ private void verifySoftDeletedFieldName(String fieldName, boolean isSoftDeletesF } } + private void verifyParentFieldName(String fieldName, boolean isParentField) { + if (isParentField) { + if (parentFieldName == null) { + throw new IllegalArgumentException( + "this index has [" + + fieldName + + "] as parent document field already but parent document field is not configured in IWC"); + } else if (fieldName.equals(parentFieldName) == false) { + throw new IllegalArgumentException( + "cannot configure [" + + parentFieldName + + "] as parent document field ; this index uses [" + + fieldName + + "] as parent document field already"); + } + } else if (fieldName.equals(parentFieldName)) { + throw new IllegalArgumentException( + "cannot configure [" + + parentFieldName + + "] as parent document field ; this index uses [" Review Comment: OK I think I'm confused about how this code is working/catching something. `parentFieldName` is what was configured in IWC, and the incoming `fieldName` is coming from ... a document being newly indexed where this in-memory segment is seeing this `fieldName` for the first time and creating a `FieldInfo` for it? Hmm but then how is that new `FieldInfo` setting its `.isParentField`? Sorry I gotta stare at this / think about this a bit more :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org