Re: [PR] Optimize FST on-heap BytesReader [lucene]
mikemccand commented on code in PR #12879: URL: https://github.com/apache/lucene/pull/12879#discussion_r1442807365 ## lucene/core/src/java/org/apache/lucene/util/fst/ReadWriteDataOutput.java: ## @@ -56,14 +66,59 @@ public long ramBytesUsed() { public void freeze() { frozen = true; -// this operation are costly, so we want to compute it once and cache -dataInput = dataOutput.toDataInput(); +// this operation is costly, so we want to compute it once and cache +this.byteBuffers = dataOutput.toWriteableBufferList(); } @Override public FST.BytesReader getReverseBytesReader() { -assert dataInput != null; // freeze() must be called first -return new ReverseRandomAccessReader(dataInput); +assert byteBuffers != null; // freeze() must be called first +if (byteBuffers.size() == 1) { + // use a faster implementation for single-block case + return new ReverseBytesReader(byteBuffers.get(0).array()); Review Comment: Awesome, thanks, looks great @dungba88! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for index sorting with document blocks [lucene]
mikemccand commented on code in PR #12829: URL: https://github.com/apache/lucene/pull/12829#discussion_r1442838857 ## lucene/core/src/test/org/apache/lucene/index/TestAddIndexes.java: ## @@ -1937,4 +1937,97 @@ public void setMergeInfo(SegmentCommitInfo info) { targetDir.close(); sourceDir.close(); } + + public void testIllegalParentDocChange() throws Exception { +Directory dir1 = newDirectory(); +IndexWriterConfig iwc1 = newIndexWriterConfig(new MockAnalyzer(random())); +iwc1.setParentField("foobar"); +RandomIndexWriter w1 = new RandomIndexWriter(random(), dir1, iwc1); +Document parent = new Document(); +w1.addDocuments(Arrays.asList(new Document(), new Document(), parent)); +w1.commit(); +w1.addDocuments(Arrays.asList(new Document(), new Document(), parent)); +w1.commit(); +// so the index sort is in fact burned into the index: +w1.forceMerge(1); +w1.close(); + +Directory dir2 = newDirectory(); +IndexWriterConfig iwc2 = newIndexWriterConfig(new MockAnalyzer(random())); +iwc2.setParentField("foo"); +RandomIndexWriter w2 = new RandomIndexWriter(random(), dir2, iwc2); + +IndexReader r1 = DirectoryReader.open(dir1); +String message = +expectThrows( +IllegalArgumentException.class, +() -> { + w2.addIndexes((SegmentReader) getOnlyLeafReader(r1)); +}) +.getMessage(); +assertEquals( +"cannot configure [foo] as parent document field ; this index uses [foobar] as parent document field already", Review Comment: See I think this exception message is backwards :) Shouldn't it be: ``` cannot configure [foobar] as parent document field ; this index uses [foo] as parent document field already ``` I.e. "this index" is referring to `dir2`/`w2` (the index into which we are adding an external index), and it is configured to use `foo` as the parent field, while the incoming index via `addIndexes` (`dir1`) is attempting (illegally) to change that to `foobar`? Edit: OK I think this all comes down to `this index` being super confusing to me :) If it's referring to the incoming (newly added) index via addIndexes, the message makes sense. But I would think `this index` means the one you opened with IW... ## lucene/core/src/java/org/apache/lucene/index/FieldInfos.java: ## @@ -78,6 +80,7 @@ public FieldInfos(FieldInfo[] infos) { boolean hasPointValues = false; boolean hasVectorValues = false; String softDeletesField = null; +String parentField = null; Review Comment: Maybe fix these silly pointless initializers ^^? This `FieldInfos.java` is some of the oldest, most "pristine"/"untouched" code in Lucene ;) You can almost count the tree rings ... ## lucene/core/src/test/org/apache/lucene/index/TestAddIndexes.java: ## @@ -1937,4 +1937,97 @@ public void setMergeInfo(SegmentCommitInfo info) { targetDir.close(); sourceDir.close(); } + + public void testIllegalParentDocChange() throws Exception { +Directory dir1 = newDirectory(); +IndexWriterConfig iwc1 = newIndexWriterConfig(new MockAnalyzer(random())); +iwc1.setParentField("foobar"); +RandomIndexWriter w1 = new RandomIndexWriter(random(), dir1, iwc1); +Document parent = new Document(); +w1.addDocuments(Arrays.asList(new Document(), new Document(), parent)); +w1.commit(); +w1.addDocuments(Arrays.asList(new Document(), new Document(), parent)); +w1.commit(); +// so the index sort is in fact burned into the index: +w1.forceMerge(1); +w1.close(); + +Directory dir2 = newDirectory(); +IndexWriterConfig iwc2 = newIndexWriterConfig(new MockAnalyzer(random())); +iwc2.setParentField("foo"); +RandomIndexWriter w2 = new RandomIndexWriter(random(), dir2, iwc2); + +IndexReader r1 = DirectoryReader.open(dir1); +String message = +expectThrows( +IllegalArgumentException.class, +() -> { + w2.addIndexes((SegmentReader) getOnlyLeafReader(r1)); +}) +.getMessage(); +assertEquals( +"cannot configure [foo] as parent document field ; this index uses [foobar] as parent document field already", +message); + +message = +expectThrows( +IllegalArgumentException.class, +() -> { + w2.addIndexes(dir1); +}) +.getMessage(); +assertEquals( +"cannot configure [foo] as parent document field ; this index uses [foobar] as parent document field already", Review Comment: Same here? ## lucene/core/src/java/org/apache/lucene/index/FieldInfos.java: ## @@ -437,6 +488,33 @@ private void verifySoftDeletedFieldName(String fieldName, boolean isSoftDeletesF } } +private void verifyParentFieldName(St
Re: [PR] Improve vector search speed by using FixedBitSet [lucene]
benwtrent closed pull request #12789: Improve vector search speed by using FixedBitSet URL: https://github.com/apache/lucene/pull/12789 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Taxonomy facets: can we change massive `int[]` for parent/child/sibling tree to paged/block `int[]` to reduce RAM pressure? [lucene]
msfroh commented on issue #12989: URL: https://github.com/apache/lucene/issues/12989#issuecomment-1879227454 > What do you think, did you have something else in mind? Oh -- I didn't have anything in mind. I just saw the issue and thought, "Hey, I could figure out how to do that!" Sounds like you've got it in hand, though! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Taxonomy facets: can we change massive `int[]` for parent/child/sibling tree to paged/block `int[]` to reduce RAM pressure? [lucene]
stefanvodita commented on issue #12989: URL: https://github.com/apache/lucene/issues/12989#issuecomment-1879330131 I'd be happy to work together on it! If we go the route I was proposing, there's a non-trivial amount of work to do: 1. Create the new interface for taxonomy arrays and use it with taxonomy facets and in the taxo reader. 2. Augment the `IntBlockPool` with conveience methods that support this new use-case and implement the new taxo array interface using the `IntBlockPool`. 1 and 2 can be done independently, so we could each take one of those work streams. I'll start on it in the next few days, but feel free to jump in if you get the chance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Taxonomy facets: can we change massive `int[]` for parent/child/sibling tree to paged/block `int[]` to reduce RAM pressure? [lucene]
msfroh commented on issue #12989: URL: https://github.com/apache/lucene/issues/12989#issuecomment-1879403796 I took a look and I think we might be able to do it a little easier: ``` public abstract class ParallelTaxonomyArrays { public class ChunkedArray { private final int chunkSize; private final int[][] chunks; public ChunkedArray(int chunkSize, int[][] chunks) { this.chunkSize = chunkSize; this.chunks = chunks; } public int get(int i) { int chunkNum = i / chunkSize; return chunks[chunkNum][i - (chunkNum * chunkSize)]; } public int length() { return chunkSize * chunks.length; } } /** Sole constructor. */ public ParallelTaxonomyArrays() {} public abstract ChunkedArray parents(); public abstract ChunkedArray children(); public abstract ChunkedArray siblings(); } ``` Then within `TaxonomyIndexArrays`, we can focus on building `int[][]` instances that get wrapped in `ChunkedArray`. I feel like `IntBlockPool` might be overkill? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Split taxonomy arrays across chunks [lucene]
msfroh opened a new pull request, #12995: URL: https://github.com/apache/lucene/pull/12995 ### Description Taxonomy ordinals are added in an append-only way. Instead of reallocating a single big array when loading new taxonomy ordinals and copying all the values from the previous arrays over individually, we can keep blocks of ordinals and reuse blocks from the previous arrays. Resolves https://github.com/apache/lucene/issues/12989 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Taxonomy facets: can we change massive `int[]` for parent/child/sibling tree to paged/block `int[]` to reduce RAM pressure? [lucene]
msfroh commented on issue #12989: URL: https://github.com/apache/lucene/issues/12989#issuecomment-1879502787 I ended up running with that idea (sort of) and implemented this: https://github.com/apache/lucene/pull/12995 The unit tests pass, but I don't think any of them allocate more than 8192 ordinals (the size of chunk that I set). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org