Re: [PR] Use Collections.addAll() instead of manual array copy [lucene]

2024-01-04 Thread via GitHub


dweiss commented on code in PR #12977:
URL: https://github.com/apache/lucene/pull/12977#discussion_r1441428909


##
lucene/misc/src/java/org/apache/lucene/misc/index/IndexSplitter.java:
##
@@ -67,18 +66,10 @@ public static void main(String[] args) throws Exception {
 if (args[1].equals("-l")) {
   is.listSegments();
 } else if (args[1].equals("-d")) {
-  List segs = new ArrayList<>();
-  for (int x = 2; x < args.length; x++) {
-segs.add(args[x]);
-  }
-  is.remove(segs.toArray(new String[0]));
+  is.remove(Arrays.copyOfRange(args, 2, args.length));

Review Comment:
   This passes validation checks only because the method is suppressed from 
checks - it uses the sysouts. Arrays.copyOfRange is on the forbidden APIs list, 
it should be replaced with:
   
   java.util.Arrays#copyOfRange(**) @ Prefer using ArrayUtil as 
Arrays#copyOfRange fills zeros for bad bounds
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove unnecessary fields loop from extractWeightedSpanTerms() [lucene]

2024-01-04 Thread via GitHub


dweiss commented on PR #12965:
URL: https://github.com/apache/lucene/pull/12965#issuecomment-1876672741

   Yep, I think this early termination condition makes sense. Could you add it, 
please? Thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Nightly benchmark regression for term dict queries [lucene]

2024-01-04 Thread via GitHub


gf2121 closed issue #12659: Nightly benchmark regression for term dict queries
URL: https://github.com/apache/lucene/issues/12659


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] Test an FST bytes store that re-reverses (reads bytes forward) on-the-fly [lucene]

2024-01-04 Thread via GitHub


mikemccand opened a new issue, #12992:
URL: https://github.com/apache/lucene/issues/12992

   ### Description
   
   At read-time the FST apis must read bytes in reverse, which is perverse and 
unnatural for all stacks in modern CPUs / IO devices that do read-ahead 
optimizations for forward reading.
   
   It's quite complex to change the underlying FST format to become 
fundamentally forward only.  It'd require rewriting node addresses, which may 
then take different numbers of `vInt` bytes, causing more renumbering, etc.
   
   A simpler first step might be, at FST `freeze()` time (when the FST is done 
being compiled), reverse all bytes in the underlying storage (on disk or on 
heap), and at read time, pretend to the caller that they are still reading 
backwards, yet actually read forwards.  We could even do this separately for 
each store, e.g. start by testing on-heap read-time double reversal, or maybe 
start with on-disk where the OS's readahead optimizations may matter more.
   
   It should be relatively trivial to implement yet hard to think about, and we 
could see if it helps performance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Optimize FST on-heap BytesReader [lucene]

2024-01-04 Thread via GitHub


mikemccand commented on code in PR #12879:
URL: https://github.com/apache/lucene/pull/12879#discussion_r1441712407


##
lucene/core/src/java/org/apache/lucene/util/fst/ReadWriteDataOutput.java:
##
@@ -56,14 +66,59 @@ public long ramBytesUsed() {
 
   public void freeze() {
 frozen = true;
-// this operation are costly, so we want to compute it once and cache
-dataInput = dataOutput.toDataInput();
+// this operation is costly, so we want to compute it once and cache
+this.byteBuffers = dataOutput.toWriteableBufferList();
   }
 
   @Override
   public FST.BytesReader getReverseBytesReader() {
-assert dataInput != null; // freeze() must be called first
-return new ReverseRandomAccessReader(dataInput);
+assert byteBuffers != null; // freeze() must be called first
+if (byteBuffers.size() == 1) {
+  // use a faster implementation for single-block case
+  return new ReverseBytesReader(byteBuffers.get(0).array());

Review Comment:
   Can we `assert hasArray()` before this?   `.array()` is optional (only valid 
for `DirectByteBuffer` I think).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Optimize FST on-heap BytesReader [lucene]

2024-01-04 Thread via GitHub


mikemccand commented on code in PR #12879:
URL: https://github.com/apache/lucene/pull/12879#discussion_r1441713259


##
lucene/core/src/java/org/apache/lucene/util/fst/ReadWriteDataOutput.java:
##
@@ -56,14 +66,59 @@ public long ramBytesUsed() {
 
   public void freeze() {
 frozen = true;
-// this operation are costly, so we want to compute it once and cache
-dataInput = dataOutput.toDataInput();
+// this operation is costly, so we want to compute it once and cache
+this.byteBuffers = dataOutput.toWriteableBufferList();
   }
 
   @Override
   public FST.BytesReader getReverseBytesReader() {
-assert dataInput != null; // freeze() must be called first
-return new ReverseRandomAccessReader(dataInput);
+assert byteBuffers != null; // freeze() must be called first
+if (byteBuffers.size() == 1) {
+  // use a faster implementation for single-block case
+  return new ReverseBytesReader(byteBuffers.get(0).array());
+}
+return new FST.BytesReader() {

Review Comment:
   Hmm I'm still worried about test coverage of this.  Conditionals like this 
are dangerous for Lucene, since our tests only test tiny FSTs, this code path 
would be rarely/never executed, likely even by our nightly benchmarks.  Ideally 
we would find a way to randomize the page size in many tests, or at least, one 
test?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Initial impl of MMapDirectory for Java 22 [lucene]

2024-01-04 Thread via GitHub


uschindler commented on PR #12706:
URL: https://github.com/apache/lucene/pull/12706#issuecomment-1877096268

   No no no, I am not stale!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Optimize FST on-heap BytesReader [lucene]

2024-01-04 Thread via GitHub


dungba88 commented on code in PR #12879:
URL: https://github.com/apache/lucene/pull/12879#discussion_r1441917076


##
lucene/core/src/java/org/apache/lucene/util/fst/ReadWriteDataOutput.java:
##
@@ -56,14 +66,59 @@ public long ramBytesUsed() {
 
   public void freeze() {
 frozen = true;
-// this operation are costly, so we want to compute it once and cache
-dataInput = dataOutput.toDataInput();
+// this operation is costly, so we want to compute it once and cache
+this.byteBuffers = dataOutput.toWriteableBufferList();
   }
 
   @Override
   public FST.BytesReader getReverseBytesReader() {
-assert dataInput != null; // freeze() must be called first
-return new ReverseRandomAccessReader(dataInput);
+assert byteBuffers != null; // freeze() must be called first
+if (byteBuffers.size() == 1) {
+  // use a faster implementation for single-block case
+  return new ReverseBytesReader(byteBuffers.get(0).array());
+}
+return new FST.BytesReader() {

Review Comment:
   Sorry I meant TestFSTs.testRealTerms. The default block size is 32KB and the 
FST is 55-80KB, thus 2-3 blocks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Optimize FST on-heap BytesReader [lucene]

2024-01-04 Thread via GitHub


dungba88 commented on code in PR #12879:
URL: https://github.com/apache/lucene/pull/12879#discussion_r1441927006


##
lucene/core/src/java/org/apache/lucene/util/fst/ReadWriteDataOutput.java:
##
@@ -56,14 +66,59 @@ public long ramBytesUsed() {
 
   public void freeze() {
 frozen = true;
-// this operation are costly, so we want to compute it once and cache
-dataInput = dataOutput.toDataInput();
+// this operation is costly, so we want to compute it once and cache
+this.byteBuffers = dataOutput.toWriteableBufferList();
   }
 
   @Override
   public FST.BytesReader getReverseBytesReader() {
-assert dataInput != null; // freeze() must be called first
-return new ReverseRandomAccessReader(dataInput);
+assert byteBuffers != null; // freeze() must be called first
+if (byteBuffers.size() == 1) {
+  // use a faster implementation for single-block case
+  return new ReverseBytesReader(byteBuffers.get(0).array());

Review Comment:
   Since we call toWriteableBufferList the array should be accessible. I think 
if the array is somehow not accessible HeapByteBuffer (the one being used) 
would throw an exception.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Optimize FST on-heap BytesReader [lucene]

2024-01-04 Thread via GitHub


dungba88 commented on code in PR #12879:
URL: https://github.com/apache/lucene/pull/12879#discussion_r1441927006


##
lucene/core/src/java/org/apache/lucene/util/fst/ReadWriteDataOutput.java:
##
@@ -56,14 +66,59 @@ public long ramBytesUsed() {
 
   public void freeze() {
 frozen = true;
-// this operation are costly, so we want to compute it once and cache
-dataInput = dataOutput.toDataInput();
+// this operation is costly, so we want to compute it once and cache
+this.byteBuffers = dataOutput.toWriteableBufferList();
   }
 
   @Override
   public FST.BytesReader getReverseBytesReader() {
-assert dataInput != null; // freeze() must be called first
-return new ReverseRandomAccessReader(dataInput);
+assert byteBuffers != null; // freeze() must be called first
+if (byteBuffers.size() == 1) {
+  // use a faster implementation for single-block case
+  return new ReverseBytesReader(byteBuffers.get(0).array());

Review Comment:
   Since we call toWriteableBufferList the array should be accessible. I think 
if the array is somehow not accessible HeapByteBuffer (the one being used) 
would throw an exception and thus would be caught by tests



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Optimize FST on-heap BytesReader [lucene]

2024-01-04 Thread via GitHub


mikemccand commented on code in PR #12879:
URL: https://github.com/apache/lucene/pull/12879#discussion_r1441950879


##
lucene/core/src/java/org/apache/lucene/util/fst/ReadWriteDataOutput.java:
##
@@ -56,14 +66,59 @@ public long ramBytesUsed() {
 
   public void freeze() {
 frozen = true;
-// this operation are costly, so we want to compute it once and cache
-dataInput = dataOutput.toDataInput();
+// this operation is costly, so we want to compute it once and cache
+this.byteBuffers = dataOutput.toWriteableBufferList();
   }
 
   @Override
   public FST.BytesReader getReverseBytesReader() {
-assert dataInput != null; // freeze() must be called first
-return new ReverseRandomAccessReader(dataInput);
+assert byteBuffers != null; // freeze() must be called first
+if (byteBuffers.size() == 1) {
+  // use a faster implementation for single-block case
+  return new ReverseBytesReader(byteBuffers.get(0).array());

Review Comment:
   Yeah I realize the code is safe but it takes some thinking to confirm it -- 
maybe add the assert with your awesome above explanation?  So future code 
readers know why this `ByteBuffer` is for sure backed by an array.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [WIP] LUCENE-10002: Deprecate FacetsCollector#search helper methods as they internally use IndexSearcher#search(Query, Collector) API [lucene]

2024-01-04 Thread via GitHub


mikemccand commented on PR #12890:
URL: https://github.com/apache/lucene/pull/12890#issuecomment-1877386385

   Yeah I agree it's OK to deprecate without replacement, but maybe in the 
deprecated javadocs (and in `MIGRATE.txt` for 10.0) add a short explanation 
about using `MultiCollectorManager` for this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Port PR management bot from Apache Beam [lucene]

2024-01-04 Thread via GitHub


uschindler commented on issue #12796:
URL: https://github.com/apache/lucene/issues/12796#issuecomment-1877414041

   As said in the mailing list thread: We should maybe exclude "draft" PRs from 
this. I regularily start them for stuff that I won't merge soon. One example is 
#12706 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Introduce workflow for stale PRs [lucene]

2024-01-04 Thread via GitHub


uschindler commented on code in PR #12813:
URL: https://github.com/apache/lucene/pull/12813#discussion_r1442004702


##
.github/workflows/stale.yml:
##
@@ -22,6 +22,7 @@ jobs:
   with:
 repo-token: ${{ secrets.GITHUB_TOKEN }}
 days-before-pr-stale: 14
+exempt-draft-pr: true

Review Comment:
   thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Introduce workflow for stale PRs [lucene]

2024-01-04 Thread via GitHub


stefanvodita commented on PR #12813:
URL: https://github.com/apache/lucene/pull/12813#issuecomment-1877494977

   I ended up checking more of the configurable parameters after looking for 
the one to exclude draft PRs. Three things to note:
   1. I have to configure a few more parameters to handle issues correctly and 
to not close PRs when they get too stale, which seems to be the default 
(scary!).
   2. There is a `debug-only` mode we can enable, which will make it so the 
worflow runs without applying any changes. We can read its logs to see if it's 
behaving the way we want it to. If it is, we can disable debug mode and run it 
for real.
   3. We can configure how many operations the workflow will do to avoid 
hitting the rate limit. The default is `30`, which is probably a good place to 
start. As a consequence, we would also avoid having most of the open PRs 
labeled stale in one go.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Introduce workflow for stale PRs [lucene]

2024-01-04 Thread via GitHub


stefanvodita commented on PR #12813:
URL: https://github.com/apache/lucene/pull/12813#issuecomment-1877528163

   I did some testing with the parameters I mentioned and pushed a new revision.
   In the tests on my fork, a stale PR took 5 operations to process and a 
non-stale PR took 1.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for index sorting with document blocks [lucene]

2024-01-04 Thread via GitHub


s1monw commented on code in PR #12829:
URL: https://github.com/apache/lucene/pull/12829#discussion_r1442110320


##
lucene/core/src/java/org/apache/lucene/index/FieldInfos.java:
##
@@ -437,6 +488,33 @@ private void verifySoftDeletedFieldName(String fieldName, 
boolean isSoftDeletesF
   }
 }
 
+private void verifyParentFieldName(String fieldName, boolean 
isParentField) {
+  if (isParentField) {
+if (parentFieldName == null) {
+  throw new IllegalArgumentException(
+  "this index has ["
+  + fieldName
+  + "] as parent document field already but parent document 
field is not configured in IWC");
+} else if (fieldName.equals(parentFieldName) == false) {
+  throw new IllegalArgumentException(
+  "cannot configure ["
+  + parentFieldName
+  + "] as parent document field ; this index uses ["
+  + fieldName

Review Comment:
   I think this is confusing from it's name. The `parentFieldName` comes from 
IWC and fieldname is the incoming field. I think it's right the way it is, 
maybe we need to do some rewording altogether but when you look at the tests I 
think the error message makes sense? 
   I would like to find a way that users don't have to specify these fieldnames 
altogether but that should be a follow up



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Remove unnecessary comment [lucene]

2024-01-04 Thread via GitHub


andrross opened a new pull request, #12993:
URL: https://github.com/apache/lucene/pull/12993

   ### Description
   
   A [previous iteration][1] of this code used an AtomicInteger and required 
this comment. The committed version uses a self-documenting boolean and the 
comment is not needed.
   
   [1]: 
https://github.com/apache/lucene/pull/12689/commits/5c847a0a3b2312414c8a4623d288d8fe96d52ff8
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove unnecessary comment [lucene]

2024-01-04 Thread via GitHub


andrross commented on PR #12993:
URL: https://github.com/apache/lucene/pull/12993#issuecomment-1877693422

   FYI @javanna


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use Collections.addAll() instead of manual array copy [lucene]

2024-01-04 Thread via GitHub


dweiss commented on code in PR #12977:
URL: https://github.com/apache/lucene/pull/12977#discussion_r1442253173


##
lucene/misc/src/java/org/apache/lucene/misc/index/IndexSplitter.java:
##
@@ -67,18 +66,10 @@ public static void main(String[] args) throws Exception {
 if (args[1].equals("-l")) {
   is.listSegments();
 } else if (args[1].equals("-d")) {
-  List segs = new ArrayList<>();
-  for (int x = 2; x < args.length; x++) {
-segs.add(args[x]);
-  }
-  is.remove(segs.toArray(new String[0]));
+  is.remove(Arrays.copyOfRange(args, 2, args.length));

Review Comment:
   I allowed myself to correct this one and push it to your branch.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use Collections.addAll() instead of manual array copy and misc. code cleanups [lucene]

2024-01-04 Thread via GitHub


dweiss merged PR #12977:
URL: https://github.com/apache/lucene/pull/12977


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Make Lucene90 postings format to write FST off heap [lucene]

2024-01-04 Thread via GitHub


dungba88 commented on PR #12985:
URL: https://github.com/apache/lucene/pull/12985#issuecomment-1877952318

   @mikemccand I'm wondering if there is already some benchmark that can show 
the RAM saved by this change


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Taxonomy facets: can we change massive `int[]` for parent/child/sibling tree to paged/block `int[]` to reduce RAM pressure? [lucene]

2024-01-04 Thread via GitHub


msfroh commented on issue #12989:
URL: https://github.com/apache/lucene/issues/12989#issuecomment-1877979164

   If nobody else is working on this, I think I'd like to take it!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Optimize FST on-heap BytesReader [lucene]

2024-01-04 Thread via GitHub


dungba88 commented on code in PR #12879:
URL: https://github.com/apache/lucene/pull/12879#discussion_r1442419722


##
lucene/core/src/java/org/apache/lucene/util/fst/ReadWriteDataOutput.java:
##
@@ -56,14 +66,59 @@ public long ramBytesUsed() {
 
   public void freeze() {
 frozen = true;
-// this operation are costly, so we want to compute it once and cache
-dataInput = dataOutput.toDataInput();
+// this operation is costly, so we want to compute it once and cache
+this.byteBuffers = dataOutput.toWriteableBufferList();
   }
 
   @Override
   public FST.BytesReader getReverseBytesReader() {
-assert dataInput != null; // freeze() must be called first
-return new ReverseRandomAccessReader(dataInput);
+assert byteBuffers != null; // freeze() must be called first
+if (byteBuffers.size() == 1) {
+  // use a faster implementation for single-block case
+  return new ReverseBytesReader(byteBuffers.get(0).array());

Review Comment:
   That's a good idea. I've added the assertion. I decided to put it right 
after the ByteBuffer list is created (in `freeze()`) and assert all ByteBuffer 
in the list.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Taxonomy facets: can we change massive `int[]` for parent/child/sibling tree to paged/block `int[]` to reduce RAM pressure? [lucene]

2024-01-04 Thread via GitHub


stefanvodita commented on issue #12989:
URL: https://github.com/apache/lucene/issues/12989#issuecomment-1878171579

   @msfroh - I was looking into this as well and had some thoughts about how to 
do it.
   
   We could replace 
[`ParallelTaxonomyArrays`](https://github.com/apache/lucene/blob/7b8aece125aabff2823626d5b939abf4747f63a7/lucene/facet/src/java/org/apache/lucene/facet/taxonomy/ParallelTaxonomyArrays.java)
 with a new interface that offers three operations for each of the arrays:
   ```
   interface ChunkedParallelTaxonomyArrays {
   
 /* Record new entry. */
 public void appendParent(int parent);
   
 /* Retrieve this ordinal's parent. From the user's perspective, this is 
like an array look-up. */
 public int getParent(int ord);
   
 /* There are some places where we need to know how many parents exist in 
total. */
 public int sizeParents();
   
 // Same for children and siblings
 ...
   }
   ```
   
   To implement this inteface, we could use an 
[`IntBlockPool`](https://github.com/apache/lucene/blob/7b8aece125aabff2823626d5b939abf4747f63a7/lucene/core/src/java/org/apache/lucene/util/IntBlockPool.java).
 We would allocate new int buffers in the block pool as needed and preserve the 
block pool across 
[`DirectoryTaxonomyReader`](https://github.com/apache/lucene/blob/7b8aece125aabff2823626d5b939abf4747f63a7/lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java)
 refreshes.
   
   There are definitely some disadvantages with the block pool idea:
   1. We're preserving a mutable data-structure across taxonomy refreshes. 
There is 
[precedent](https://github.com/apache/lucene/blob/7b8aece125aabff2823626d5b939abf4747f63a7/lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java#L172)
 though, with the caches in `DirectoryTaxonomyReader`.
   2. We would be slightly overallocating by having the last buffer in the pool 
not be completely used, but I think this is a good trade-off to take for the 
increased efficiency and simplicity.
   
   What do you think, did you have something else in mind?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Clean up unused code & variables [lucene]

2024-01-04 Thread via GitHub


dungba88 opened a new pull request, #12994:
URL: https://github.com/apache/lucene/pull/12994

   ### Description
   
   Clean up unused code & variables in FSTCompiler


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org