Re: [PR] Optimize FST on-heap BytesReader [lucene]

2024-01-05 Thread via GitHub


mikemccand commented on code in PR #12879:
URL: https://github.com/apache/lucene/pull/12879#discussion_r1442807365


##
lucene/core/src/java/org/apache/lucene/util/fst/ReadWriteDataOutput.java:
##
@@ -56,14 +66,59 @@ public long ramBytesUsed() {
 
   public void freeze() {
 frozen = true;
-// this operation are costly, so we want to compute it once and cache
-dataInput = dataOutput.toDataInput();
+// this operation is costly, so we want to compute it once and cache
+this.byteBuffers = dataOutput.toWriteableBufferList();
   }
 
   @Override
   public FST.BytesReader getReverseBytesReader() {
-assert dataInput != null; // freeze() must be called first
-return new ReverseRandomAccessReader(dataInput);
+assert byteBuffers != null; // freeze() must be called first
+if (byteBuffers.size() == 1) {
+  // use a faster implementation for single-block case
+  return new ReverseBytesReader(byteBuffers.get(0).array());

Review Comment:
   Awesome, thanks, looks great @dungba88!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for index sorting with document blocks [lucene]

2024-01-05 Thread via GitHub


mikemccand commented on code in PR #12829:
URL: https://github.com/apache/lucene/pull/12829#discussion_r1442838857


##
lucene/core/src/test/org/apache/lucene/index/TestAddIndexes.java:
##
@@ -1937,4 +1937,97 @@ public void setMergeInfo(SegmentCommitInfo info) {
 targetDir.close();
 sourceDir.close();
   }
+
+  public void testIllegalParentDocChange() throws Exception {
+Directory dir1 = newDirectory();
+IndexWriterConfig iwc1 = newIndexWriterConfig(new MockAnalyzer(random()));
+iwc1.setParentField("foobar");
+RandomIndexWriter w1 = new RandomIndexWriter(random(), dir1, iwc1);
+Document parent = new Document();
+w1.addDocuments(Arrays.asList(new Document(), new Document(), parent));
+w1.commit();
+w1.addDocuments(Arrays.asList(new Document(), new Document(), parent));
+w1.commit();
+// so the index sort is in fact burned into the index:
+w1.forceMerge(1);
+w1.close();
+
+Directory dir2 = newDirectory();
+IndexWriterConfig iwc2 = newIndexWriterConfig(new MockAnalyzer(random()));
+iwc2.setParentField("foo");
+RandomIndexWriter w2 = new RandomIndexWriter(random(), dir2, iwc2);
+
+IndexReader r1 = DirectoryReader.open(dir1);
+String message =
+expectThrows(
+IllegalArgumentException.class,
+() -> {
+  w2.addIndexes((SegmentReader) getOnlyLeafReader(r1));
+})
+.getMessage();
+assertEquals(
+"cannot configure [foo] as parent document field ; this index uses 
[foobar] as parent document field already",

Review Comment:
   See I think this exception message is backwards :)  Shouldn't it be:
   
   ```
   cannot configure [foobar] as parent document field ; this index uses [foo] 
as parent document field already
   ```
   
   I.e. "this index" is referring to `dir2`/`w2` (the index into which we are 
adding an external index), and it is configured to use `foo` as the parent 
field, while the incoming index via `addIndexes` (`dir1`) is attempting 
(illegally) to change that to `foobar`?
   
   Edit: OK I think this all comes down to `this index` being super confusing 
to me :)  If it's referring to the incoming (newly added) index via addIndexes, 
the message makes sense.  But I would think `this index` means the one you 
opened with IW...



##
lucene/core/src/java/org/apache/lucene/index/FieldInfos.java:
##
@@ -78,6 +80,7 @@ public FieldInfos(FieldInfo[] infos) {
 boolean hasPointValues = false;
 boolean hasVectorValues = false;
 String softDeletesField = null;
+String parentField = null;

Review Comment:
   Maybe fix these silly pointless initializers ^^?
   
   This `FieldInfos.java` is some of the oldest, most "pristine"/"untouched" 
code in Lucene ;)  You can almost count the tree rings ...



##
lucene/core/src/test/org/apache/lucene/index/TestAddIndexes.java:
##
@@ -1937,4 +1937,97 @@ public void setMergeInfo(SegmentCommitInfo info) {
 targetDir.close();
 sourceDir.close();
   }
+
+  public void testIllegalParentDocChange() throws Exception {
+Directory dir1 = newDirectory();
+IndexWriterConfig iwc1 = newIndexWriterConfig(new MockAnalyzer(random()));
+iwc1.setParentField("foobar");
+RandomIndexWriter w1 = new RandomIndexWriter(random(), dir1, iwc1);
+Document parent = new Document();
+w1.addDocuments(Arrays.asList(new Document(), new Document(), parent));
+w1.commit();
+w1.addDocuments(Arrays.asList(new Document(), new Document(), parent));
+w1.commit();
+// so the index sort is in fact burned into the index:
+w1.forceMerge(1);
+w1.close();
+
+Directory dir2 = newDirectory();
+IndexWriterConfig iwc2 = newIndexWriterConfig(new MockAnalyzer(random()));
+iwc2.setParentField("foo");
+RandomIndexWriter w2 = new RandomIndexWriter(random(), dir2, iwc2);
+
+IndexReader r1 = DirectoryReader.open(dir1);
+String message =
+expectThrows(
+IllegalArgumentException.class,
+() -> {
+  w2.addIndexes((SegmentReader) getOnlyLeafReader(r1));
+})
+.getMessage();
+assertEquals(
+"cannot configure [foo] as parent document field ; this index uses 
[foobar] as parent document field already",
+message);
+
+message =
+expectThrows(
+IllegalArgumentException.class,
+() -> {
+  w2.addIndexes(dir1);
+})
+.getMessage();
+assertEquals(
+"cannot configure [foo] as parent document field ; this index uses 
[foobar] as parent document field already",

Review Comment:
   Same here?



##
lucene/core/src/java/org/apache/lucene/index/FieldInfos.java:
##
@@ -437,6 +488,33 @@ private void verifySoftDeletedFieldName(String fieldName, 
boolean isSoftDeletesF
   }
 }
 
+private void verifyParentFieldName(St

Re: [PR] Improve vector search speed by using FixedBitSet [lucene]

2024-01-05 Thread via GitHub


benwtrent closed pull request #12789: Improve vector search speed by using 
FixedBitSet
URL: https://github.com/apache/lucene/pull/12789


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Taxonomy facets: can we change massive `int[]` for parent/child/sibling tree to paged/block `int[]` to reduce RAM pressure? [lucene]

2024-01-05 Thread via GitHub


msfroh commented on issue #12989:
URL: https://github.com/apache/lucene/issues/12989#issuecomment-1879227454

   > What do you think, did you have something else in mind?
   
   Oh -- I didn't have anything in mind. I just saw the issue and thought, 
"Hey, I could figure out how to do that!" Sounds like you've got it in hand, 
though!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Taxonomy facets: can we change massive `int[]` for parent/child/sibling tree to paged/block `int[]` to reduce RAM pressure? [lucene]

2024-01-05 Thread via GitHub


stefanvodita commented on issue #12989:
URL: https://github.com/apache/lucene/issues/12989#issuecomment-1879330131

   I'd be happy to work together on it! If we go the route I was proposing, 
there's a non-trivial amount of work to do:
   1. Create the new interface for taxonomy arrays and use it with taxonomy 
facets and in the taxo reader.
   2. Augment the `IntBlockPool` with conveience methods that support this new 
use-case and implement the new taxo array interface using the `IntBlockPool`.
   
   1 and 2 can be done independently, so we could each take one of those work 
streams. I'll start on it in the next few days, but feel free to jump in if you 
get the chance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Taxonomy facets: can we change massive `int[]` for parent/child/sibling tree to paged/block `int[]` to reduce RAM pressure? [lucene]

2024-01-05 Thread via GitHub


msfroh commented on issue #12989:
URL: https://github.com/apache/lucene/issues/12989#issuecomment-1879403796

   I took a look and I think we might be able to do it a little easier:
   
   ```
   public abstract class ParallelTaxonomyArrays {
 public class ChunkedArray {
   private final int chunkSize;
   private final int[][] chunks;
   
   public ChunkedArray(int chunkSize, int[][] chunks) {
 this.chunkSize = chunkSize;
 this.chunks = chunks;
   }
   
   public int get(int i) {
 int chunkNum = i / chunkSize;
 return chunks[chunkNum][i - (chunkNum * chunkSize)];
   }
   
   public int length() {
 return chunkSize * chunks.length;
   }
 }
   
 /** Sole constructor. */
 public ParallelTaxonomyArrays() {}
   
 public abstract ChunkedArray parents();
 public abstract ChunkedArray children();
 public abstract ChunkedArray siblings();
   }
   ```
   
   Then within `TaxonomyIndexArrays`, we can focus on building `int[][]` 
instances that get wrapped in `ChunkedArray`. I feel like `IntBlockPool` might 
be overkill?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Split taxonomy arrays across chunks [lucene]

2024-01-05 Thread via GitHub


msfroh opened a new pull request, #12995:
URL: https://github.com/apache/lucene/pull/12995

   
   
   ### Description
   
   Taxonomy ordinals are added in an append-only way.
   
   Instead of reallocating a single big array when loading new taxonomy 
ordinals and copying all the values from the previous arrays over individually, 
we can keep blocks of ordinals and reuse blocks from the previous arrays.
   
   Resolves https://github.com/apache/lucene/issues/12989


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Taxonomy facets: can we change massive `int[]` for parent/child/sibling tree to paged/block `int[]` to reduce RAM pressure? [lucene]

2024-01-05 Thread via GitHub


msfroh commented on issue #12989:
URL: https://github.com/apache/lucene/issues/12989#issuecomment-1879502787

   I ended up running with that idea (sort of) and implemented this: 
https://github.com/apache/lucene/pull/12995
   
   The unit tests pass, but I don't think any of them allocate more than 8192 
ordinals (the size of chunk that I set).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org