date:20231113

Re: [I] Reproducible failure in TestIndexWriter.testHasUncommittedChanges [lucene]

2023-11-13 Thread via GitHub



jpountz closed issue #12763: Reproducible failure in 
TestIndexWriter.testHasUncommittedChanges
URL: https://github.com/apache/lucene/issues/12763


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Remove patching for doc blocks. [lucene]

2023-11-13 Thread via GitHub



jpountz commented on PR #12741:
URL: https://github.com/apache/lucene/pull/12741#issuecomment-1807797987

   Thanks both, I pushed an annotation, it should show up tomorrow. I hah high 
expectations based on preliminary results from 
https://github.com/apache/lucene/issues/12696#issue-1950239343 where 
`AndHighMed` had a reproducible 3-4% speedup, so I was expecting nightlies to 
show it too. @slow-J I'm curious if you had a chance to run benchmarks on this 
PR, did it also show a speedup?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Refactor the use of runFinalization in tests and benchmarks [lucene]

2023-11-13 Thread via GitHub



ChrisHegarty merged PR #12768:
URL: https://github.com/apache/lucene/pull/12768


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Remove patching for doc blocks. [lucene]

2023-11-13 Thread via GitHub



slow-J commented on PR #12741:
URL: https://github.com/apache/lucene/pull/12741#issuecomment-1807962044

   I ran a new luceneutil benchmark on Saturday with my commit 
https://github.com/apache/lucene/commit/8ae598bae593e1faa4ff82a87f4cd45f120f1059
 (using Lucene99PostingsFormat) as candidate and the commit's parent as 
baseline (using Lucene90PostingsFormat).
   
   Other benchmark variables for transparency:
   * Java 19
   * Ec2 instance: `m5.12xlarge`.
   * Disabled JFR
   
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
  BrowseMonthTaxoFacets3.98  (6.7%)3.92  
(2.3%)   -1.7% (  -9% -7%) 0.286
  BrowseDayOfYearSSDVFacets3.29  (1.6%)3.27  
(1.1%)   -0.7% (  -3% -2%) 0.126
   BrowseDateSSDVFacets1.01  (4.8%)1.00  
(5.6%)   -0.6% ( -10% -   10%) 0.695
   PKLookup  151.00  (2.5%)  150.26  
(2.6%)   -0.5% (  -5% -4%) 0.543
  BrowseMonthSSDVFacets3.44  (1.6%)3.43  
(1.5%)   -0.5% (  -3% -2%) 0.321
LowTerm  345.74  (3.0%)  344.39  
(3.1%)   -0.4% (  -6% -5%) 0.689
   Wildcard  142.48  (1.9%)  142.00  
(1.8%)   -0.3% (  -3% -3%) 0.569
Prefix3 1056.96  (4.0%) 1054.07  
(3.2%)   -0.3% (  -7% -7%) 0.812
BrowseRandomLabelSSDVFacets2.55  (7.2%)2.54  
(7.2%)   -0.1% ( -13% -   15%) 0.975
  BrowseDayOfYearTaxoFacets3.96  (0.7%)3.96  
(0.7%)   -0.0% (  -1% -1%) 0.856
   BrowseDateTaxoFacets3.94  (0.7%)3.94  
(0.7%)   -0.0% (  -1% -1%) 0.912
BrowseRandomLabelTaxoFacets3.41  (0.8%)3.41  
(0.9%)   -0.0% (  -1% -1%) 0.977
 Fuzzy1   69.71  (0.9%)   69.74  
(0.8%)0.0% (  -1% -1%) 0.902
  OrHighNotHigh  205.50  (4.6%)  205.67  
(5.3%)0.1% (  -9% -   10%) 0.958
 Fuzzy2   58.39  (0.8%)   58.44  
(0.6%)0.1% (  -1% -1%) 0.688
   OrHighNotLow  265.57  (5.4%)  266.05  
(5.9%)0.2% ( -10% -   12%) 0.921
   HighIntervalsOrdered5.32  (4.5%)5.33  
(4.1%)0.2% (  -8% -9%) 0.879
   HighTermTitleBDVSort7.81  (3.1%)7.83  
(2.7%)0.3% (  -5% -6%) 0.756
Respell   34.00  (1.3%)   34.10  
(1.1%)0.3% (  -2% -2%) 0.451
   MedTermDayTaxoFacets   16.00  (3.8%)   16.04  
(3.8%)0.3% (  -7% -8%) 0.801
   OrHighNotMed  255.88  (5.0%)  257.07  
(5.2%)0.5% (  -9% -   11%) 0.774
   AndHighHighDayTaxoFacets2.83  (3.5%)2.84  
(3.4%)0.5% (  -6% -7%) 0.658
MedTerm  374.43  (4.9%)  376.51  
(5.6%)0.6% (  -9% -   11%) 0.738
   HighTerm  472.29  (4.8%)  474.92  
(5.7%)0.6% (  -9% -   11%) 0.738
MedSpanNear5.01  (4.3%)5.04  
(4.6%)0.6% (  -7% -9%) 0.690
  HighTermMonthSort 2670.49  (4.1%) 2689.05  
(3.1%)0.7% (  -6% -8%) 0.549
LowIntervalsOrdered6.65  (4.2%)6.70  
(4.3%)0.7% (  -7% -9%) 0.584
 OrHighHigh   22.54  (1.2%)   22.71  
(2.4%)0.7% (  -2% -4%) 0.204
 OrHighMedDayTaxoFacets1.58  (4.6%)1.59  
(3.5%)0.8% (  -6% -9%) 0.523
 IntNRQ   27.99  (2.4%)   28.25  
(3.9%)0.9% (  -5% -7%) 0.371
  OrNotHighHigh  244.47  (3.7%)  246.97  
(4.5%)1.0% (  -6% -9%) 0.432
MedIntervalsOrdered3.14  (3.7%)3.18  
(3.9%)1.0% (  -6% -9%) 0.391
  OrHighMed   43.19  (1.3%)   43.65  
(1.7%)1.1% (  -1% -4%) 0.025
   HighSpanNear6.73  (2.7%)6.80  
(3.3%)1.2% (  -4% -7%) 0.223
  LowPhrase  255.75  (2.1%)  259.11  
(2.0%)1.3% (  -2% -5%) 0.043
  HighTermDayOfYearSort  253.38  (4.1%)  257.57  
(2.2%)1.7% (  -4% -8%) 0.112
AndHighHigh   15.70  (1.1%)   15.98  
(2.5%)1.8% (  -1% -5%) 0.004
  MedPhrase   14.60  (2.4%)   14.89  
(2.2%)1.9% (  -2% -6%) 0.009

Re: [PR] Fix CheckIndex to detect major corruption with old (not the latest) commit point [lucene]

2023-11-13 Thread via GitHub



mikemccand commented on code in PR #12530:
URL: https://github.com/apache/lucene/pull/12530#discussion_r1390999456


##
lucene/core/src/java/org/apache/lucene/index/CheckIndex.java:
##
@@ -610,6 +610,39 @@ public Status checkIndex(List onlySegments, 
ExecutorService executorServ
   return result;
 }
 
+// https://github.com/apache/lucene/issues/7820: also attempt to open any 
older commit
+// points (segments_N), which will catch certain corruption like missing 
_N.si files
+// for segments not also referenced by the newest commit point (which was 
already
+// loaded, successfully, above).  Note that we do not do a deeper check of 
segments
+// referenced ONLY by these older commit points, because such corruption 
would not
+// prevent a new IndexWriter from opening on the newest commit point.  but 
it is still
+// corruption, e.g. a reader opened on those old commit points can hit 
corruption
+// exceptions which we (still) will not detect here.  progress not 
perfection!
+
+for (String fileName : files) {
+  if (fileName.startsWith(IndexFileNames.SEGMENTS)
+  && fileName.equals(lastSegmentsFile) == false) {

Review Comment:
   OK I merged the two cases.  It was a bit tricky because I still wanted to 
first attempt to load the last commit point, before the older ones, so that if 
both of them have problems, we always show the user the problem with the last 
commit point first.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Fix CheckIndex to detect major corruption with old (not the latest) commit point [lucene]

2023-11-13 Thread via GitHub



mikemccand commented on PR #12530:
URL: https://github.com/apache/lucene/pull/12530#issuecomment-1808000711

   I think this PR is ready.  Note that its scope is "only" to catch cases 
where the last commit point succeeds, but older commit points have problems.  
This case was previously passing `CheckIndex` yet `IndexWriter` would fail to 
start up since it attempts to load all commit points.
   
   We have [separate issue open](https://github.com/apache/lucene/issues/7820) 
to more gracefully handle the tricky missing `_N.si` file case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[PR] Generalize LSBRadixSorter and use it in SortingPostingsEnum [lucene]

2023-11-13 Thread via GitHub



gf2121 opened a new pull request, #12800:
URL: https://github.com/apache/lucene/pull/12800

   **Description**
   
   In https://github.com/apache/lucene/pull/12114, we had great numbers for LSB 
radix sorter when sorting random docs in `SortingDocsEnum` . But we can not 
take advantage of the LSB radix sorter in `SortingPostingsEnum` because ((I 
guess) we need to swap an parallel offset array. This PR proposes to do the 
following: 
   
   * Extract a generalized LSB radix sorter called `BaseLSBRadixSorter`, and 
change the origin `LSBRadixSorter` to `UnsignedIntLSBRadixSorter`.
   
   * Implement a custom LSB radix sorter fo `SortingPostingsEnum`.
   
   **Some Notes**
   
   * I plan to mark the `LSBRadixSorter` deprecated in 9.x and delete in 10.0. 
So I called the abstract class `BaseLSBRadixSorter` instead of `LSBRadixSorter` 
to help backport.
   
   * I run a benchmark to make sure we do not have regression for the unsigned 
ints sorter. I attach the benchmark code in the PR and will remove before merge.
   ```
   Benchmark  (bit)  (order)  (size)   Mode  Cnt  Score   Error 
  Units
   LSBSorterBenchmark.newLSB 24  natural  10  thrpt5  2.802 ± 0.038 
 ops/ms
   LSBSorterBenchmark.newLSB 24  reverse  10  thrpt5  2.800 ± 0.040 
 ops/ms
   LSBSorterBenchmark.newLSB 24   random  10  thrpt5  2.768 ± 0.039 
 ops/ms
   LSBSorterBenchmark.newLSB 31  natural  10  thrpt5  2.125 ± 0.021 
 ops/ms
   LSBSorterBenchmark.newLSB 31  reverse  10  thrpt5  2.128 ± 0.033 
 ops/ms
   LSBSorterBenchmark.newLSB 31   random  10  thrpt5  2.101 ± 0.014 
 ops/ms
   LSBSorterBenchmark.oldLSB 24  natural  10  thrpt5  2.847 ± 0.030 
 ops/ms
   LSBSorterBenchmark.oldLSB 24  reverse  10  thrpt5  2.848 ± 0.013 
 ops/ms
   LSBSorterBenchmark.oldLSB 24   random  10  thrpt5  2.787 ± 0.129 
 ops/ms
   LSBSorterBenchmark.oldLSB 31  natural  10  thrpt5  2.154 ± 0.033 
 ops/ms
   LSBSorterBenchmark.oldLSB 31  reverse  10  thrpt5  2.150 ± 0.018 
 ops/ms
   LSBSorterBenchmark.oldLSB 31   random  10  thrpt5  2.121 ± 0.011 
 ops/ms
   ```
   
   * I run a benchmark to compare tim sorter and LSB radix sorter impl in 
`SortingPostingsEnum`. LSB radix sorter is much faster in most cases except for 
the completely sorted docs, which looks fine to me. I attached the benchmark 
code in the PR and will remove before merge as well.
   
   ```
   Benchmark   (bit)  (order)  (size)   Mode  Cnt   Score   
 Error   Units
   DocSorterBenchmark.radixSorter 24  natural  10  thrpt5   2.063 ± 
 0.088  ops/ms
   DocSorterBenchmark.radixSorter 24  reverse  10  thrpt5   2.082 ± 
 0.016  ops/ms
   DocSorterBenchmark.radixSorter 24   random  10  thrpt5   2.110 ± 
 0.014  ops/ms
   DocSorterBenchmark.radixSorter 24  partial  10  thrpt5   2.091 ± 
 0.026  ops/ms
   DocSorterBenchmark.radixSorter 31  natural  10  thrpt5   1.633 ± 
 0.014  ops/ms
   DocSorterBenchmark.radixSorter 31  reverse  10  thrpt5   1.636 ± 
 0.023  ops/ms
   DocSorterBenchmark.radixSorter 31   random  10  thrpt5   1.641 ± 
 0.009  ops/ms
   DocSorterBenchmark.radixSorter 31  partial  10  thrpt5   1.640 ± 
 0.010  ops/ms
   DocSorterBenchmark.timSorter   24  natural  10  thrpt5  49.726 ± 
 1.452  ops/ms
   DocSorterBenchmark.timSorter   24  reverse  10  thrpt5   0.724 ± 
 0.006  ops/ms
   DocSorterBenchmark.timSorter   24   random  10  thrpt5   0.061 ± 
 0.001  ops/ms
   DocSorterBenchmark.timSorter   24  partial  10  thrpt5   0.203 ± 
 0.023  ops/ms
   DocSorterBenchmark.timSorter   31  natural  10  thrpt5  49.475 ± 
 2.480  ops/ms
   DocSorterBenchmark.timSorter   31  reverse  10  thrpt5   2.236 ± 
 0.192  ops/ms
   DocSorterBenchmark.timSorter   31   random  10  thrpt5   0.058 ± 
 0.001  ops/ms
   DocSorterBenchmark.timSorter   31  partial  10  thrpt5   0.209 ± 
 0.012  ops/ms
   ```
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Improve hash mixing in FST's double-barrel LRU hash [lucene]

2023-11-13 Thread via GitHub



mikemccand commented on PR #12716:
URL: https://github.com/apache/lucene/pull/12716#issuecomment-1808013188

   @shubhamvishu can we close this one?  Any other things to try?  For some 
reason, FST building of enwiki terms seems not to like this magical hashing...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum [lucene]

2023-11-13 Thread via GitHub



mikemccand commented on code in PR #12699:
URL: https://github.com/apache/lucene/pull/12699#discussion_r1391017311


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnum.java:
##
@@ -484,15 +487,15 @@ public boolean seekExact(BytesRef target) throws 
IOException {
   //   System.out.println("no seek state; push root frame");
   // }
 
-  output = arc.output();
+  accumulator.push(arc.output());
 
   currentFrame = staticFrame;
 
   // term.length = 0;
   targetUpto = 0;
-  currentFrame =
-  pushFrame(
-  arc, Lucene90BlockTreeTermsReader.FST_OUTPUTS.add(output, 
arc.nextFinalOutput()), 0);
+  accumulator.push(arc.nextFinalOutput());

Review Comment:
   Could we rename these to `outputAccumulator`?



##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnum.java:
##
@@ -1190,4 +1176,65 @@ public void seekExact(long ord) {
   public long ord() {
 throw new UnsupportedOperationException();
   }
+
+  static class OutputAccumulator extends DataInput {
+
+/** We could have at most 10 no-empty arcs: 9 for vLong and 1 for floor 
data. */
+private static final int MAX_ARC = 10;

Review Comment:
   Hmm -- makes me nervous.  What if we change our output format and FST can do 
more sharing than it does today?  Can we make it grow on demand if needed?



##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnum.java:
##
@@ -680,14 +676,8 @@ public SeekStatus seekCeil(BytesRef target) throws 
IOException {
 + (char) arc.label()
 + " targetLabel="
 + (char) (target.bytes[target.offset + targetUpto] & 0xFF);
-// TODO: we could save the outputs in local
-// byte[][] instead of making new objs ever
-// seek; but, often the FST doesn't have any
-// shared bytes (but this could change if we
-// reverse vLong byte order)

Review Comment:
   Aha!  Another ancient yet important `TODO`!  Nice ;)



##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnum.java:
##
@@ -1190,4 +1176,65 @@ public void seekExact(long ord) {
   public long ord() {
 throw new UnsupportedOperationException();
   }
+
+  static class OutputAccumulator extends DataInput {
+
+/** We could have at most 10 no-empty arcs: 9 for vLong and 1 for floor 
data. */
+private static final int MAX_ARC = 10;
+
+BytesRef[] outputs = new BytesRef[MAX_ARC];
+BytesRef current;
+int num = 0;
+int outputIndex = 0;
+int index = 0;

Review Comment:
   You don't need the `= 0` initializers -- it is already java's default.



##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/FieldReader.java:
##
@@ -118,13 +118,11 @@ long readVLongOutput(DataInput in) throws IOException {
* Package private for testing.
*/
   static long readMSBVLong(DataInput in) throws IOException {
-long l = 0L;
-while (true) {
-  byte b = in.readByte();
+byte b = in.readByte();
+long l = b & 0x7FL;
+while (b < 0) {

Review Comment:
   Why this change?  Is it a bit of specialization so CPU can predict branch a 
bit better?



##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/IntersectTermsEnumFrame.java:
##
@@ -142,12 +138,20 @@ public void setState(int state) {
   }
 
   void load(BytesRef frameIndexData) throws IOException {
-if (frameIndexData != null) {

Review Comment:
   Hmm was `frameIndexData` never `null`?



##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnum.java:
##
@@ -1190,4 +1176,65 @@ public void seekExact(long ord) {
   public long ord() {
 throw new UnsupportedOperationException();
   }
+
+  static class OutputAccumulator extends DataInput {

Review Comment:
   So this class (efficiently) accumulates `BytesRef` outputs, and then after 
all appending is done, becomes a `DataInput` allow the caller to read byte 
bytes sequentially from the logically concatenated `byte[]`?  Cool.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Random access term dictionary [lucene]

2023-11-13 Thread via GitHub



mikemccand commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1391027915


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermsIndexBuilder.java:
##
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import java.io.IOException;
+import java.util.Arrays;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IntsRefBuilder;
+import org.apache.lucene.util.fst.FST;
+import org.apache.lucene.util.fst.FSTCompiler;
+import org.apache.lucene.util.fst.PositiveIntOutputs;
+import org.apache.lucene.util.fst.Util;
+
+/**
+ * Builds a term index for a given field. Logically this is a map: term -> 
(type, ord) where the
+ * ordinals are scoped to type (not global).
+ */
+final class TermsIndexBuilder {
+  private static long MAX_ORD = (1L << 60) - 1;
+
+  private final long[] countPerType = new long[TermType.NUM_TOTAL_TYPES];
+  private final FSTCompiler fstCompiler =
+  new FSTCompiler<>(FST.INPUT_TYPE.BYTE1, 
PositiveIntOutputs.getSingleton());
+
+  TermsIndexBuilder() {
+Arrays.fill(countPerType, -1);
+  }
+
+  public void addTerm(BytesRef term, TermType termType) throws IOException {
+IntsRefBuilder scratchInts = new IntsRefBuilder();
+long ord = ++countPerType[termType.getId()];
+fstCompiler.add(Util.toIntsRef(term, scratchInts), encode(ord, termType));
+  }
+
+  public TermsIndex build() throws IOException {
+return new TermsIndex(fstCompiler.compile());
+  }
+
+  private long encode(long ord, TermType termType) {
+// use a single long to encode `ord` and `termType`
+// also consider the special value of `PositiveIntOutputs.NO_OUTPUT == 0`
+// so it looks like this |...  ord ...| termType| ... hasOutput  ...|
+// where termType takes 3 bit and hasOutput takes the lowest bit. The rest 
is taken by ord
+if (ord < 0) {
+  throw new IllegalArgumentException("can't encode negative ord");
+}
+if (ord > MAX_ORD) {
+  throw new IllegalArgumentException(
+  "Input ord is too large for TermType: "

Review Comment:
   Well, maybe something like `Can only index XXX unique terms, but saw YYY 
terms for field ZZZ` or so?  (If that's what this error message really means to 
the user).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Random access term dictionary [lucene]

2023-11-13 Thread via GitHub



mikemccand commented on PR #12688:
URL: https://github.com/apache/lucene/pull/12688#issuecomment-1808041346

   > Thanks for the tips! Yes, almost there. I'm working on the real compact 
bitpacker and unpacker. I still need to implement the PostingFormat afterwards. 
Do you think I need to implement a new Codec?
   
   You don't really need a new `Codec` -- you could test this easily by 
subclassing `Lucene99Codec` and overriding `getPostingsFormatForField` to use 
your new cool bit packing `PostingsFormat`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Use group-varint encoding for the tail of postings [lucene]

2023-11-13 Thread via GitHub



jpountz commented on PR #12782:
URL: https://github.com/apache/lucene/pull/12782#issuecomment-1808043672

   Could you check in your benchmark under `lucene/benchmark-jmh` so that we 
could play with it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Random access term dictionary [lucene]

2023-11-13 Thread via GitHub



mikemccand commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1391032959


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene99/randomaccess/TermStateCodec.java:
##
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene99.randomaccess;
+
+import 
org.apache.lucene.codecs.lucene99.Lucene99PostingsFormat.IntBlockTermState;
+import 
org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking.BitPacker;
+import 
org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking.BitUnpacker;
+import org.apache.lucene.util.BytesRef;
+
+interface TermStateCodec {

Review Comment:
   Hmm maybe don't name it `Codec` since it makes it seem like it could be a 
normal Lucene `Codec`?  But this is much lower level ... maybe 
`TermStateEncoder`?



##
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene99/randomaccess/TermStateCodecImpl.java:
##
@@ -0,0 +1,234 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene99.randomaccess;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import 
org.apache.lucene.codecs.lucene99.Lucene99PostingsFormat.IntBlockTermState;
+import org.apache.lucene.index.IndexOptions;
+import 
org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.DocFreq;
+import 
org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.DocStartFP;
+import 
org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.LastPositionBlockOffset;
+import 
org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.PayloadStartFP;
+import 
org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.PositionStartFP;
+import 
org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.SingletonDocId;
+import 
org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.SkipOffset;
+import 
org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.TotalTermFreq;
+import 
org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking.BitPacker;
+import 
org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking.BitUnpacker;
+import org.apache.lucene.store.ByteArrayDataInput;
+import org.apache.lucene.store.ByteArrayDataOutput;
+import org.apache.lucene.util.BytesRef;
+
+final class TermStateCodecImpl implements TermStateCodec {

Review Comment:
   Do we really need the separable interface and impl yet?  Are there more than 
one impl we are exploring?



##
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene99/randomaccess/bitpacking/FixedSizeByteArrayBitPacker.java:
##
@@ -0,0 +1,41 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASI

Re: [PR] Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum [lucene]

2023-11-13 Thread via GitHub



gf2121 commented on code in PR #12699:
URL: https://github.com/apache/lucene/pull/12699#discussion_r1391055842


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/FieldReader.java:
##
@@ -118,13 +118,11 @@ long readVLongOutput(DataInput in) throws IOException {
* Package private for testing.
*/
   static long readMSBVLong(DataInput in) throws IOException {
-long l = 0L;
-while (true) {
-  byte b = in.readByte();
+byte b = in.readByte();
+long l = b & 0x7FL;
+while (b < 0) {

Review Comment:
   Background: 
https://github.com/apache/lucene/issues/12659#issuecomment-1759244758
   
   This was a try to get back the PKLookUp speed but Luceneutil has shown that 
this change does not affect performance. I kept it here because I think this 
way is more consistent with the `DataInput#readVInt` and more readable. But now 
I would revert it since it is not really related to the point of this PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Can FST read bytes forward? [lucene]

2023-11-13 Thread via GitHub



mikemccand commented on issue #12355:
URL: https://github.com/apache/lucene/issues/12355#issuecomment-1808084756

   +1 to find a way to reverse the bytes at compilation time.
   
   The reversal of bytes during FST compilation is so hard to think about!  It 
happens because the FST is logically append-only, and sort of grows backwards 
(from the suffixes, inwards onto prefixes), and the newly written nodes always 
point backwards to the already written (appended to growing `byte[]`, or, soon 
`DataOutput`).
   
   But logically we ought to be able to write all the bytes backwards, then 
reverse them, but then when resolving absolute or relative node addresses at 
FST read time, we'd need to re-reverse those addresses.  Or, we could try to 
rewrite the embedded node address references during/after reversal so we don't 
need to re-reverse on each node read?  The pointers will necessarily be 
different (take different number of `byte[]` after reversal) since small node 
addresses would become big node addresses and take more bytes to encode 
absolute.  It might even make the FST larger, since the common suffixes today 
will have smallish/earliesh node addresses.  This is similar to what `pack` 
used to do (actually rewrite addresses), and it was hairy.
   
   So maybe for starters we do the simple "reverse `byte[]` after writing them 
all" and then "re-reverse addresses on decode"?  I wonder if Tantivy FST has 
some sort of post-write-reverse step?  Or does it always do cache-unfriendly 
read backwards during FST traversal?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum [lucene]

2023-11-13 Thread via GitHub



gf2121 commented on code in PR #12699:
URL: https://github.com/apache/lucene/pull/12699#discussion_r1391060906


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/IntersectTermsEnumFrame.java:
##
@@ -142,12 +138,20 @@ public void setState(int state) {
   }
 
   void load(BytesRef frameIndexData) throws IOException {
-if (frameIndexData != null) {

Review Comment:
   Yes. We checked null before for [line 
119](https://github.com/apache/lucene/pull/12699/files/71ef88acf5bc3b2c47086b55d63d64720b4232b7#diff-77d427360dcf008af183201e0e748d77ccadae1b2359fa96940c407b48b3f85bR119).
 It is cast to long now and we do the null check in [line 
153](https://github.com/apache/lucene/pull/12699/files/71ef88acf5bc3b2c47086b55d63d64720b4232b7#diff-77d427360dcf008af183201e0e748d77ccadae1b2359fa96940c407b48b3f85bR153)
 instead.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum [lucene]

2023-11-13 Thread via GitHub



gf2121 commented on code in PR #12699:
URL: https://github.com/apache/lucene/pull/12699#discussion_r1391079102


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnum.java:
##
@@ -1190,4 +1176,65 @@ public void seekExact(long ord) {
   public long ord() {
 throw new UnsupportedOperationException();
   }
+
+  static class OutputAccumulator extends DataInput {
+
+/** We could have at most 10 no-empty arcs: 9 for vLong and 1 for floor 
data. */
+private static final int MAX_ARC = 10;

Review Comment:
   +1, growing on demand is much more flexible!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] surpriseMePolygon and createRegularPolygon in test util class returns invalid polygon [lucene]

2023-11-13 Thread via GitHub



stefanvodita commented on issue #12596:
URL: https://github.com/apache/lucene/issues/12596#issuecomment-1808119452

   Another edge case that can cause problems is that of multiple points on the 
same line (e.g. 3 points in a row with the same x coordinate). This happens for 
regular and for "surprise" polygons.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-13 Thread via GitHub



mikemccand commented on code in PR #12624:
URL: https://github.com/apache/lucene/pull/12624#discussion_r1391064549


##
lucene/core/src/java/org/apache/lucene/util/fst/BytesStore.java:
##
@@ -359,7 +383,9 @@ public void truncate(long newLen) {
 assert newLen == getPosition();
   }
 
-  public void finish() {
+  @Override
+  public void freeze() {
+this.frozen = true;

Review Comment:
   Should we `assert frozen == false` before setting it to `true`?  Is caller 
allowed to `freeze()` more than once?
   
   And maybe remove `this.` prefix?



##
lucene/core/src/java/org/apache/lucene/util/fst/Freezeable.java:
##
@@ -0,0 +1,24 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.fst;
+
+/** Represent a datastructure that can be frozen and become immutable. */
+public interface Freezeable {

Review Comment:
   Can this be package private?
   
   Also `Freezeable -> Freezable`, tricky.



##
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##
@@ -120,31 +122,54 @@ public class FSTCompiler {
   final float directAddressingMaxOversizingFactor;
   long directAddressingExpansionCredit;
 
-  final BytesStore bytes;
+  // writer for frozen nodes
+  final FSTDataOutputWriter fstWriter;
+  // reader for the frozen nodes
+  final FSTReader fstReader;
+
+  // buffer to store the scratch bytes before writing to the fstWriter
+  final BytesStore bytes = new BytesStore(DEFAULT_SCRATCH_PAGE_BITS);

Review Comment:
   Rename to `scratchBytes`, and maybe say in the comment `// buffer to store 
bytes for the one node we are currently writing`?
   
   Also, instead of `BytesStore`, could we use Lucene's existing `oal.store` 
`ByteBuffersDataOutput.newResettableInstance()`?  Let's try to get away from 
FST implementing so many of its own storage classes/interfaces...



##
lucene/core/src/java/org/apache/lucene/util/fst/FSTDataOutputWriter.java:
##
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.fst;
+
+import java.io.IOException;
+import org.apache.lucene.store.DataOutput;
+import org.apache.lucene.util.Accountable;
+import org.apache.lucene.util.RamUsageEstimator;
+
+/**
+ * Allow the FST to be written to a {@link DataOutput}. Generally it only 
supports writing to the
+ * {@link DataOutput} and not reading from it. To read, you must either: 1. 
construct a
+ * corresponding {@link org.apache.lucene.store.DataInput} and use the {@link 
FSTStore} to read pr
+ * 2. use a DataOutput which also implements {@link FSTReader}
+ */
+final class FSTDataOutputWriter implements Freezeable {
+
+  private static final long BASE_RAM_BYTES_USED =
+  RamUsageEstimator.shallowSizeOfInstance(FSTDataOutputWriter.class);
+
+  /** the main DataOutput to store the FST bytes */
+  private final DataOutput dataOutput;
+
+  private long size = 0L;

Review Comment:
   You don't need the `= 0L` -- it's java's default already.



##
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##
@@ -120,31 +122,54 @@ public class FSTCompiler {
   final float directAddressingMaxOversizingFactor;
   long directAddressingExpansionCredit;
 
-  final BytesStore bytes;
+  // writer for frozen nodes
+  final FSTDataOutputWriter fstWriter;
+  // reader for the frozen nodes
+  final FSTReader fstReader;
+
+  // buffer to store the scratch bytes before writing to

Re: [I] Can FST read bytes forward? [lucene]

2023-11-13 Thread via GitHub



dungba88 commented on issue #12355:
URL: https://github.com/apache/lucene/issues/12355#issuecomment-1808180025

   Looking at https://github.com/BurntSushi/fst/blob/master/src/raw/node.rs, it 
seems Tantivy also read bytes in backward. However this Node class only works 
with byte array. I think the array was memory-mapped from file before that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Fix CheckIndex to detect major corruption with old (not the latest) commit point [lucene]

2023-11-13 Thread via GitHub



mikemccand commented on PR #12530:
URL: https://github.com/apache/lucene/pull/12530#issuecomment-1808188829

   Hmm the Gradle Precommit Checks failed with:
   
   ```
   * What went wrong:
   Execution failed for task ':lucene:documentation:createDocumentationIndex'.
   > Out of space in CodeCache for adapters
   ```
   
   I'm not sure what the adapters are, why they must be stored in `CodeCache`, 
nor why `CodeCache` ran out of space.  Maybe the adapters are massvie?  I'll 
retry the job -- `precommit` passes locally.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Generalize LSBRadixSorter and use it in SortingPostingsEnum [lucene]

2023-11-13 Thread via GitHub



mikemccand commented on code in PR #12800:
URL: https://github.com/apache/lucene/pull/12800#discussion_r1391137486


##
lucene/benchmark-jmh/src/java/org/apache/lucene/benchmark/jmh/DocSorterBenchmark.java:
##
@@ -0,0 +1,241 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.benchmark.jmh;
+
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import java.util.concurrent.ThreadLocalRandom;
+import java.util.concurrent.TimeUnit;
+import java.util.stream.Collectors;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BaseLSBRadixSorter;
+import org.apache.lucene.util.IntsRef;
+import org.apache.lucene.util.LongsRef;
+import org.apache.lucene.util.TimSorter;
+import org.apache.lucene.util.packed.PackedInts;
+import org.openjdk.jmh.annotations.*;
+
+@BenchmarkMode(Mode.Throughput)
+@OutputTimeUnit(TimeUnit.MILLISECONDS)
+@State(Scope.Benchmark)
+// first iteration is complete garbage, so make sure we really warmup
+@Warmup(iterations = 3, time = 1)
+// real iterations. not useful to spend tons of time here, better to fork more
+@Measurement(iterations = 5, time = 1)
+// engage some noise reduction
+@Fork(
+value = 1,
+jvmArgsAppend = {"-Xmx2g", "-Xms2g", "-XX:+AlwaysPreTouch"})
+public class DocSorterBenchmark {
+
+  private DocOffsetTimSorter timSorter;
+  private DocOffsetRadixSorter radixSorter;
+
+  @Param({"10"})
+  int size;
+
+  @Param({"natural", "reverse", "random", "partial"})
+  String order;
+
+  @Param({"24", "31"})
+  int bit;
+
+  @Setup(Level.Invocation)
+  public void init() {
+ThreadLocalRandom random = ThreadLocalRandom.current();
+
+int maxDoc = 1 << (bit - 1);
+assert PackedInts.bitsRequired(maxDoc) == bit;
+// random byte arrays for binary methods
+int[] doc = new int[size];
+long[] off = new long[size];
+for (int i = 0; i < size; ++i) {
+  doc[i] = random.nextInt(maxDoc);
+  off[i] = random.nextLong(Long.MAX_VALUE);
+}
+
+switch (order) {
+  case "natural" -> Arrays.sort(doc);
+  case "reverse" -> {
+List docs = 
Arrays.stream(doc).sorted().boxed().collect(Collectors.toList());
+Collections.reverse(docs);
+doc = docs.stream().mapToInt(i -> i).toArray();
+  }

Review Comment:
   Hmm do you need `break` in each of these cases?  Else control just flows to 
the next case?  Or did java make that unnecessary at some point?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Generalize LSBRadixSorter and use it in SortingPostingsEnum [lucene]

2023-11-13 Thread via GitHub



mikemccand commented on code in PR #12800:
URL: https://github.com/apache/lucene/pull/12800#discussion_r1391138011


##
lucene/benchmark-jmh/src/java/org/apache/lucene/benchmark/jmh/DocSorterBenchmark.java:
##
@@ -0,0 +1,241 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.benchmark.jmh;
+
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import java.util.concurrent.ThreadLocalRandom;
+import java.util.concurrent.TimeUnit;
+import java.util.stream.Collectors;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BaseLSBRadixSorter;
+import org.apache.lucene.util.IntsRef;
+import org.apache.lucene.util.LongsRef;
+import org.apache.lucene.util.TimSorter;
+import org.apache.lucene.util.packed.PackedInts;
+import org.openjdk.jmh.annotations.*;
+
+@BenchmarkMode(Mode.Throughput)
+@OutputTimeUnit(TimeUnit.MILLISECONDS)
+@State(Scope.Benchmark)
+// first iteration is complete garbage, so make sure we really warmup
+@Warmup(iterations = 3, time = 1)
+// real iterations. not useful to spend tons of time here, better to fork more
+@Measurement(iterations = 5, time = 1)
+// engage some noise reduction
+@Fork(
+value = 1,
+jvmArgsAppend = {"-Xmx2g", "-Xms2g", "-XX:+AlwaysPreTouch"})
+public class DocSorterBenchmark {
+
+  private DocOffsetTimSorter timSorter;
+  private DocOffsetRadixSorter radixSorter;
+
+  @Param({"10"})
+  int size;
+
+  @Param({"natural", "reverse", "random", "partial"})
+  String order;
+
+  @Param({"24", "31"})
+  int bit;
+
+  @Setup(Level.Invocation)
+  public void init() {
+ThreadLocalRandom random = ThreadLocalRandom.current();
+
+int maxDoc = 1 << (bit - 1);
+assert PackedInts.bitsRequired(maxDoc) == bit;
+// random byte arrays for binary methods
+int[] doc = new int[size];
+long[] off = new long[size];
+for (int i = 0; i < size; ++i) {
+  doc[i] = random.nextInt(maxDoc);
+  off[i] = random.nextLong(Long.MAX_VALUE);

Review Comment:
   Thank you for improving our sorting @gf2121!
   
   When possible I much prefer benchmarks based on real data/corpora -- if you 
run benchmarks on random data you draw random conclusions :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Generalize LSBRadixSorter and use it in SortingPostingsEnum [lucene]

2023-11-13 Thread via GitHub



mikemccand commented on PR #12800:
URL: https://github.com/apache/lucene/pull/12800#issuecomment-1808206830

   I wonder whether `Arrays.sort` might be a good choice instead of making our 
own powerful sorting classes?  [OpenJDK is (gradually?) taking advantage of 
fast SIMD sorting](https://github.com/apache/lucene/issues/12399) so at some 
point / on some CPUs / for certain JDK versions, `Arrays.sort` should become 
faster than our own sorts.  (But, I don't think `Arrays.sort` has a parallel 
array option?).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Would SIMD powered sort (on top of Panama) be worth it? [lucene]

2023-11-13 Thread via GitHub



mikemccand commented on issue #12399:
URL: https://github.com/apache/lucene/issues/12399#issuecomment-1808206348

   At least OpenJDK 22 on modern-ish Intel x86-64 CPUs will [sometimes using 
SIMD for fast `Arrays.sort`](https://bugs.openjdk.org/browse/JDK-8309130)!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Remove patching for doc blocks. [lucene]

2023-11-13 Thread via GitHub



mikemccand commented on PR #12741:
URL: https://github.com/apache/lucene/pull/12741#issuecomment-1808208970

   > I ran a new luceneutil benchmark on Saturday with my commit 
[8ae598b](https://github.com/apache/lucene/commit/8ae598bae593e1faa4ff82a87f4cd45f120f1059)
 (using Lucene99PostingsFormat) as candidate and the commit's parent as 
baseline (using Lucene90PostingsFormat).
   
   Is this `wikimediumall` or `wikibigall`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-13 Thread via GitHub



dungba88 commented on code in PR #12624:
URL: https://github.com/apache/lucene/pull/12624#discussion_r1391146752


##
lucene/core/src/java/org/apache/lucene/util/fst/BytesStore.java:
##
@@ -26,7 +26,8 @@
 // TODO: merge with PagedBytes, except PagedBytes doesn't
 // let you read while writing which FST needs
 
-class BytesStore extends DataOutput implements FSTReader {
+// TODO: Separate the scratch writer and reader functionality
+class BytesStore extends DataOutput implements FSTReader, Freezeable {

Review Comment:
   Hm, I'm wondering why should we let the FSTCompiler to know about the 
RandomAccessInput? In the most case, we would let the FSTCompiler to just write 
to a DataOutput, then users are to read it with the public FST constructor 
right?
   
   The case where we want to read from the FST immediately after writing is the 
one with BytesStore.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Generalize LSBRadixSorter and use it in SortingPostingsEnum [lucene]

2023-11-13 Thread via GitHub



gf2121 commented on code in PR #12800:
URL: https://github.com/apache/lucene/pull/12800#discussion_r1391150384


##
lucene/benchmark-jmh/src/java/org/apache/lucene/benchmark/jmh/DocSorterBenchmark.java:
##
@@ -0,0 +1,241 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.benchmark.jmh;
+
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import java.util.concurrent.ThreadLocalRandom;
+import java.util.concurrent.TimeUnit;
+import java.util.stream.Collectors;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BaseLSBRadixSorter;
+import org.apache.lucene.util.IntsRef;
+import org.apache.lucene.util.LongsRef;
+import org.apache.lucene.util.TimSorter;
+import org.apache.lucene.util.packed.PackedInts;
+import org.openjdk.jmh.annotations.*;
+
+@BenchmarkMode(Mode.Throughput)
+@OutputTimeUnit(TimeUnit.MILLISECONDS)
+@State(Scope.Benchmark)
+// first iteration is complete garbage, so make sure we really warmup
+@Warmup(iterations = 3, time = 1)
+// real iterations. not useful to spend tons of time here, better to fork more
+@Measurement(iterations = 5, time = 1)
+// engage some noise reduction
+@Fork(
+value = 1,
+jvmArgsAppend = {"-Xmx2g", "-Xms2g", "-XX:+AlwaysPreTouch"})
+public class DocSorterBenchmark {
+
+  private DocOffsetTimSorter timSorter;
+  private DocOffsetRadixSorter radixSorter;
+
+  @Param({"10"})
+  int size;
+
+  @Param({"natural", "reverse", "random", "partial"})
+  String order;
+
+  @Param({"24", "31"})
+  int bit;
+
+  @Setup(Level.Invocation)
+  public void init() {
+ThreadLocalRandom random = ThreadLocalRandom.current();
+
+int maxDoc = 1 << (bit - 1);
+assert PackedInts.bitsRequired(maxDoc) == bit;
+// random byte arrays for binary methods
+int[] doc = new int[size];
+long[] off = new long[size];
+for (int i = 0; i < size; ++i) {
+  doc[i] = random.nextInt(maxDoc);
+  off[i] = random.nextLong(Long.MAX_VALUE);
+}
+
+switch (order) {
+  case "natural" -> Arrays.sort(doc);
+  case "reverse" -> {
+List docs = 
Arrays.stream(doc).sorted().boxed().collect(Collectors.toList());
+Collections.reverse(docs);
+doc = docs.stream().mapToInt(i -> i).toArray();
+  }

Review Comment:
   > Or did java make that unnecessary at some point?
   
   Yes! The `->` switch expression will not fall though, see [JEP 
361](https://openjdk.org/jeps/361).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-13 Thread via GitHub



dungba88 commented on code in PR #12624:
URL: https://github.com/apache/lucene/pull/12624#discussion_r1391146752


##
lucene/core/src/java/org/apache/lucene/util/fst/BytesStore.java:
##
@@ -26,7 +26,8 @@
 // TODO: merge with PagedBytes, except PagedBytes doesn't
 // let you read while writing which FST needs
 
-class BytesStore extends DataOutput implements FSTReader {
+// TODO: Separate the scratch writer and reader functionality
+class BytesStore extends DataOutput implements FSTReader, Freezeable {

Review Comment:
   Hm, I'm wondering why should we let the FSTCompiler to know about the 
RandomAccessInput? In the most case, we would let the FSTCompiler to just write 
to a DataOutput, then users are to read it with the public FST constructor 
right?
   
   The case where we want to read from the FST immediately after writing is the 
one with BytesStore.
   
   > Do we really need to publish a Freezable interface? Can't it just be a 
private boolean frozen in this class? Is anything needing freezing besides this 
BytesStore?
   
   I could remove the Freezeable interface. It was only used for the 
writeTo(DataOutput), but that would also be unnecessary.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-13 Thread via GitHub



dungba88 commented on code in PR #12624:
URL: https://github.com/apache/lucene/pull/12624#discussion_r1391146752


##
lucene/core/src/java/org/apache/lucene/util/fst/BytesStore.java:
##
@@ -26,7 +26,8 @@
 // TODO: merge with PagedBytes, except PagedBytes doesn't
 // let you read while writing which FST needs
 
-class BytesStore extends DataOutput implements FSTReader {
+// TODO: Separate the scratch writer and reader functionality
+class BytesStore extends DataOutput implements FSTReader, Freezeable {

Review Comment:
   Hm, I'm wondering why should we let the FSTCompiler to know about the 
RandomAccessInput? In the most case, we would let the FSTCompiler to just write 
to a DataOutput, then users are to read it with the public FST constructor 
right?
   
   The case where we want to read from the FST immediately after writing is the 
one with BytesStore.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] [DISCUSS] Should we change TieredMergePolicy's segment deletion accounting to use numDocs in the denominator rather than MaxDoc? [lucene]

2023-11-13 Thread via GitHub



mikemccand commented on issue #12792:
URL: https://github.com/apache/lucene/issues/12792#issuecomment-1808225864

   +1, I like that interpretation @jpountz.  @yugushihuang maybe we could 
improve the javadocs to express the formula and @jpountz sentiment about it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Deprecated public constructor of FSTCompiler in favor of the Builder. [lucene]

2023-11-13 Thread via GitHub



mikemccand commented on code in PR #12715:
URL: https://github.com/apache/lucene/pull/12715#discussion_r1391162919


##
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##
@@ -125,8 +125,11 @@ public class FSTCompiler {
   /**
* Instantiates an FST/FSA builder with default settings and pruning options 
turned off. For more
* tuning and tweaking, see {@link Builder}.
+   *
+   * @deprecated Use FSTCompiler.Builder instead. This method will be removed 
in an upcoming major
+   * release.
*/
-  // TODO: remove this?  Builder API should be the only entry point?
+  @Deprecated
   public FSTCompiler(FST.INPUT_TYPE inputType, Outputs outputs) {

Review Comment:
   Let's remove instead?



##
lucene/CHANGES.txt:
##
@@ -7,6 +7,8 @@ http://s.apache.org/luceneversions
 
 API Changes
 -
+* GITHUB-12695: Deprecated public constructor of FSTCompiler. Please use 
FSTCompiler.Builder

Review Comment:
   Note that the FST API is `@lucene.internal` so we are free to 1) skip 
deprecation -- just remove the old API, and 2) backport to 9.9.
   
   We are delaying backporting all these recent FST changes for now (letting 
things bake in trunk) so we would first commit this to `main` and later 
backport with the full patch.  So for now, let's leave this `CHANGES` entry in 
10.0.0 section, but let's remove the deprecated APIs entirely?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-13 Thread via GitHub



dungba88 commented on code in PR #12624:
URL: https://github.com/apache/lucene/pull/12624#discussion_r1391165022


##
lucene/core/src/java/org/apache/lucene/util/fst/BytesStore.java:
##
@@ -26,7 +26,8 @@
 // TODO: merge with PagedBytes, except PagedBytes doesn't
 // let you read while writing which FST needs
 
-class BytesStore extends DataOutput implements FSTReader {
+// TODO: Separate the scratch writer and reader functionality
+class BytesStore extends DataOutput implements FSTReader, Freezeable {

Review Comment:
   > Could we remove this class entirely? And callers that want to write FST 
and immediately use it in RAM should just use ByteBuffersDataOutput for their 
scratch area?
   
   This BytesStore class currently serves as 3 purposes:
   - Acts as a scratch area writer. Their operations (copying random bytes from 
one place to other, truncating, skipping the bytes) seems to be complicated to 
model with other DataOutput
   - Acts as a FST writer. Any other DataOutput can be used.
   - Acts as a FST reader. This is tricky as DataOutput does not support read 
operation, and a separate DataInput needs to be provided.
   
   To replace the third one, I think we should keep both the second and third 
as a single class, because they have to match (a ByteBufferDataOutput needs to 
go with the ByteBufferRandomAccessInput). However the third purpose is only 
useful for write-then-read-immediately, which I assume is not the most common 
use case. Here we can either use the existing BytesStore, or create something 
based on ByteBufferDataOutput as you suggested. Both are easily replaceable in 
the future.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-13 Thread via GitHub



dungba88 commented on code in PR #12624:
URL: https://github.com/apache/lucene/pull/12624#discussion_r1391172089


##
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##
@@ -120,31 +122,54 @@ public class FSTCompiler {
   final float directAddressingMaxOversizingFactor;
   long directAddressingExpansionCredit;
 
-  final BytesStore bytes;
+  // writer for frozen nodes
+  final FSTDataOutputWriter fstWriter;
+  // reader for the frozen nodes
+  final FSTReader fstReader;

Review Comment:
   That's true, I'll remove it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-13 Thread via GitHub



dungba88 commented on code in PR #12624:
URL: https://github.com/apache/lucene/pull/12624#discussion_r1391172089


##
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##
@@ -120,31 +122,54 @@ public class FSTCompiler {
   final float directAddressingMaxOversizingFactor;
   long directAddressingExpansionCredit;
 
-  final BytesStore bytes;
+  // writer for frozen nodes
+  final FSTDataOutputWriter fstWriter;
+  // reader for the frozen nodes
+  final FSTReader fstReader;

Review Comment:
   ~~That's true, I'll remove it.~~
   
   Actually, the FSTReader was needed to create the FST itself. This reader was 
mainly used for:
   - Read from the FST with `getBytesReader()`
   - Save the FST with `save()`
   
   So even if we track the size ourselves, we still need to have the FSTReader 
here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Port PR management bot from Apache Beam [lucene]

2023-11-13 Thread via GitHub



mikemccand commented on issue #12796:
URL: https://github.com/apache/lucene/issues/12796#issuecomment-1808250719

   Maybe we could start super simple here, e.g. adding labels (hmm, how?), and 
commenting on PRs that are becoming stale?  I don't really like the 
auto-closing of very stale PRs ...
   
   Assigning reviewers is maybe a bit tricky since we'd need a component -> 
reviewer sort of mapping somewhere?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Port PR management bot from Apache Beam [lucene]

2023-11-13 Thread via GitHub



mikemccand commented on issue #12796:
URL: https://github.com/apache/lucene/issues/12796#issuecomment-1808252953

   Where does this Apache Beam bot actually live?  Is it a GitHub action?  Do 
you have a link to its sources @stefanvodita?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Remove patching for doc blocks. [lucene]

2023-11-13 Thread via GitHub



slow-J commented on PR #12741:
URL: https://github.com/apache/lucene/pull/12741#issuecomment-1808258638

   > > I ran a new luceneutil benchmark on Saturday with my commit 
[8ae598b](https://github.com/apache/lucene/commit/8ae598bae593e1faa4ff82a87f4cd45f120f1059)
 (using Lucene99PostingsFormat) as candidate and the commit's parent as 
baseline (using Lucene90PostingsFormat).
   > 
   > Is this `wikimediumall` or `wikibigall`?
   
   Should have specified, it's `wikimediumall`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Port PR management bot from Apache Beam [lucene]

2023-11-13 Thread via GitHub



mikemccand commented on issue #12796:
URL: https://github.com/apache/lucene/issues/12796#issuecomment-1808259142

   Oooh I see some interesting GH action sources in Beam e.g. 
https://github.com/apache/beam/blob/master/.github/workflows/pr-bot-new-prs.yml


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-13 Thread via GitHub



dungba88 commented on code in PR #12624:
URL: https://github.com/apache/lucene/pull/12624#discussion_r1391183424


##
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##
@@ -120,31 +122,54 @@ public class FSTCompiler {
   final float directAddressingMaxOversizingFactor;
   long directAddressingExpansionCredit;
 
-  final BytesStore bytes;
+  // writer for frozen nodes
+  final FSTDataOutputWriter fstWriter;
+  // reader for the frozen nodes
+  final FSTReader fstReader;
+
+  // buffer to store the scratch bytes before writing to the fstWriter
+  final BytesStore bytes = new BytesStore(DEFAULT_SCRATCH_PAGE_BITS);
 
   /**
* Instantiates an FST/FSA builder with default settings and pruning options 
turned off. For more
* tuning and tweaking, see {@link Builder}.
*/
   // TODO: remove this?  Builder API should be the only entry point?
   public FSTCompiler(FST.INPUT_TYPE inputType, Outputs outputs) {
-this(inputType, 32.0, outputs, true, 15, 1f);
+this(
+inputType,
+32.0,
+outputs,
+true,
+new FSTDataOutputWriter(getLegacyDataOutput(DEFAULT_BLOCK_BITS)),
+1f);
+  }
+
+  static DataOutput getLegacyDataOutput(int blockBits) {
+return new BytesStore(blockBits);
   }
 
   private FSTCompiler(
   FST.INPUT_TYPE inputType,
   double suffixRAMLimitMB,
   Outputs outputs,
   boolean allowFixedLengthArcs,
-  int bytesPageBits,
+  FSTDataOutputWriter fstWriter,
   float directAddressingMaxOversizingFactor) {
 this.allowFixedLengthArcs = allowFixedLengthArcs;
 this.directAddressingMaxOversizingFactor = 
directAddressingMaxOversizingFactor;
-bytes = new BytesStore(bytesPageBits);
 // pad: ensure no node gets address 0 which is reserved to mean
 // the stop state w/ no arcs
-bytes.writeByte((byte) 0);
-fst = new FST<>(new FST.FSTMetadata<>(inputType, null, -1, 
VERSION_CURRENT, 0), outputs, bytes);
+try {
+  fstWriter.writeByte((byte) 0);
+} catch (IOException e) {

Review Comment:
   Can we do this in a follow-up PR? Having this constructor throwing exception 
means the public one also needs to throw exception, and that was used in a lot 
of places :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Speedup concurrent multi-segment HNWS graph search [lucene]

2023-11-13 Thread via GitHub



benwtrent commented on PR #12794:
URL: https://github.com/apache/lucene/pull/12794#issuecomment-1808282034

   @mayya-sharipova two important measurements we need to check here:
   
- When comparing baseline & candidate, can the `candidate` get to higher 
recall than baseline with lower latency?
- When using the same query parameters (fanout, k, etc.), how does 
`candidate` measure against a fully merged graph (1 segment)? I am guessing a 
single graph (at least when using only one search thread), will always be 
faster, I am just wondering by how much.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-11-13 Thread via GitHub



mikemccand commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1808286805

   > I've got my framework set up for testing larger than memory indexes and 
have some somewhat interesting first results.
   
   Thank you for setting this up @kevindrosendahl -- these are very interesting 
results.  It's great that PQ goes quite a ways before hurting recall.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] [Minor] Improvements to slice pools [lucene]

2023-11-13 Thread via GitHub



mikemccand merged PR #12795:
URL: https://github.com/apache/lucene/pull/12795


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Use group-varint encoding for the tail of postings [lucene]

2023-11-13 Thread via GitHub



jpountz commented on PR #12782:
URL: https://github.com/apache/lucene/pull/12782#issuecomment-1808314262

   At least in theory, group varint could be made faster than vints even with 
single-byte integers, because a single check on `flag == 0` would tell us that 
all 4 integers have a single byte. Now, I don't know if we should do it, this 
doesn't sound like the most common case for doc IDs.
   
   Your change keep mixing doc IDs and frequencies. I wonder if we should write 
them in separate varint blocks?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Fix NFAQuery in TestRegexpRandom2 [lucene]

2023-11-13 Thread via GitHub



mikemccand commented on PR #12793:
URL: https://github.com/apache/lucene/pull/12793#issuecomment-1808325688

   > I didn't realize our random searcher will use threadpool randomly, fixed 
it to use a rewrite method that will not do concurrent rewrite
   
   Ahh, sneaky.  Does this mean users must not pass executor to `IndexSearcher` 
if they use NFA regexp query?  Or is it just concurrent rewrite?  Do we need to 
update javadocs to make this clear or so?  Or add checks that multiple threads 
don't enter NFA rewrite?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] [DISCUSS] Should we change TieredMergePolicy's segment deletion accounting to use numDocs in the denominator rather than MaxDoc? [lucene]

2023-11-13 Thread via GitHub



vigyasharma commented on issue #12792:
URL: https://github.com/apache/lucene/issues/12792#issuecomment-1808328118

   > There will be scenario that developers expect a segment deletion pct to be 
`delCount / (maxDoc-delCount)` and this accounting seems more realistic than 
current accounting.
   
   I do like the simplicity of `delCount / maxDoc`. In a simple example with 
`(maxDoc=100, delCount=30)`, it is easier to reason that merges will claim back 
30% of the docId space, and roughly similar disk space assuming similarly sized 
docs. I can use this as a mental model for `setDeletesPctAllowed()` API. 
   
   Would like to understand the scenarios where people would prefer `delCount / 
(maxDoc-delCount)` i.e. `(30/(100-30)=42.8%`... Perhaps some examples would 
help. 
   
   +1 to improving documentation, we could add a line in the 
`setDeletesPctAllowed` docstring.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Minor change to IndexOrDocValuesQuery#toString [lucene]

2023-11-13 Thread via GitHub



mikemccand merged PR #12791:
URL: https://github.com/apache/lucene/pull/12791


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-11-13 Thread via GitHub

benwtrent commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1808340050

Thank you @kevindrosendahl this does seem to confirm my suspicion that the
improvement isn't necessarily due to the data structure, but due to
quantization. But, this does confuse me as Vamana is supposedly good when
things don't all fit in memory. It may be due to how we fetch things (MMAP). I
wonder if Vamana is any good at all when using MMAP...

For your testing infra, int8 quantization might close the gap at that scale.
FYI, as significant (and needed) refactor occurred recently for int8
quantization & HNSW, so your testing branch might be significantly impacted :(.

> recall actually improves when introducing pq, and only starts to decrease
at a factor of 16

I am surprised it decreases as the number of sub-spaces increases. This
makes me thing that JVector's PQ implementation is weird.

Or is `pq=` not the number of subspaces, but `vectorDim/pq ==
numberOfSubSpaces`? If so, then recall should reduce as it increases.

Regardless, is there any oversampling that is occurring when PQ is enabled
in JVector?

> It's great that PQ goes quite a ways before hurting recall.

PQ is a sharp tool, at scale it can have significant draw backs (eventually
you have to start dramatically oversampling as centroids get very noisy).
Though, I am not sure there is a significant recall cliff.

Two significant issues with a Lucene implementation I can think of are:

- Segment merge time: We can maybe do some fancy things with better starter
centroids in Lloyd's algorithm, but I think we will have to rerun Lloyd's
algorithm on every merge. Additionally the graph building probably cannot be
done with the PQ'd vectors.
- Graph quality: I don't think we can build the graph with PQ'd vectors and
retain good recall. Meaning at merge time, we have to page in larger raw (or
differently quantized) vectors and only do PQ after graph creation.

There are [interesting approaches to PQ w/ graph
exploration](https://medium.com/@masajiro.iwasaki/fusion-of-graph-based-indexing-and-product-quantization-for-ann-search-7d1f0336d0d0)
and a different PQ implementation via Microsoft that is worthwhile
[OPQ](https://www.microsoft.com/en-us/research/wp-content/uploads/2013/11/pami13opq.pdf)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-13 Thread via GitHub



dungba88 commented on code in PR #12624:
URL: https://github.com/apache/lucene/pull/12624#discussion_r1391247632


##
lucene/core/src/java/org/apache/lucene/util/fst/BytesStore.java:
##
@@ -337,11 +349,23 @@ public long size() {
 return getPosition();
   }
 
+  /** Similar to {@link #truncate(long)} with newLen=0 but keep the first 
block to reduce GC. */
+  public void reset() {

Review Comment:
   The public API of this class seems to only allow appending only, but for 
scratch area we still need to some backward writing right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Cache buckets to speed up BytesRefHash#sort [lucene]

2023-11-13 Thread via GitHub



mikemccand commented on PR #12784:
URL: https://github.com/apache/lucene/pull/12784#issuecomment-1808361557

   Did we see any bump in nightly benchmarks?  This should make initial segment 
flush when there are many terms in an inverted field faster?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-13 Thread via GitHub



dungba88 commented on code in PR #12624:
URL: https://github.com/apache/lucene/pull/12624#discussion_r1391172089


##
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##
@@ -120,31 +122,54 @@ public class FSTCompiler {
   final float directAddressingMaxOversizingFactor;
   long directAddressingExpansionCredit;
 
-  final BytesStore bytes;
+  // writer for frozen nodes
+  final FSTDataOutputWriter fstWriter;
+  // reader for the frozen nodes
+  final FSTReader fstReader;

Review Comment:
   ~~That's true, I'll remove it.~~
   
   Actually, the FSTReader was needed to create the FST itself. This reader was 
mainly used for:
   - Read from the FST with `getBytesReader()`
   - Save the FST with `save()`
   
   So even if we track the size ourselves, we still need to have the FSTReader 
when creating the FST object. But I have removed it from the class attributes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager [lucene]

2023-11-13 Thread via GitHub



mikemccand commented on code in PR #240:
URL: https://github.com/apache/lucene/pull/240#discussion_r1391263968


##
lucene/core/src/java/org/apache/lucene/search/TopFieldCollectorManager.java:
##
@@ -0,0 +1,198 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Collection;
+import java.util.List;
+
+/**
+ * Create a TopFieldCollectorManager which uses a shared hit counter to 
maintain number of hits and
+ * a shared {@link MaxScoreAccumulator} to propagate the minimum score across 
segments if the
+ * primary sort is by relevancy.
+ *
+ * Note that a new collectorManager should be created for each search due 
to its internal states.
+ */
+public class TopFieldCollectorManager implements 
CollectorManager {
+  private final Sort sort;
+  private final int numHits;
+  private final FieldDoc after;
+  private final HitsThresholdChecker hitsThresholdChecker;
+  private final MaxScoreAccumulator minScoreAcc;
+  private final List collectors;
+  private final boolean supportsConcurrency;
+  private boolean collectorCreated;
+
+  /**
+   * Creates a new {@link TopFieldCollectorManager} from the given arguments.
+   *
+   * NOTE: The instances returned by this method pre-allocate a full 
array of length
+   * numHits.
+   *
+   * @param sort the sort criteria (SortFields).
+   * @param numHits the number of results to collect.
+   * @param after the previous doc after which matching docs will be collected.
+   * @param totalHitsThreshold the number of docs to count accurately. If the 
query matches more
+   * than {@code totalHitsThreshold} hits then its hit count will be a 
lower bound. On the other
+   * hand if the query matches less than or exactly {@code 
totalHitsThreshold} hits then the hit
+   * count of the result will be accurate. {@link Integer#MAX_VALUE} may 
be used to make the hit
+   * count accurate, but this will also make query processing slower.
+   * @param supportsConcurrency to use thread-safe and slower internal states 
for count tracking.
+   */
+  public TopFieldCollectorManager(
+  Sort sort, int numHits, FieldDoc after, int totalHitsThreshold, boolean 
supportsConcurrency) {

Review Comment:
   No need to do anything now -- let's focus on getting this awesome first step 
in :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager [lucene]

2023-11-13 Thread via GitHub



mikemccand commented on code in PR #240:
URL: https://github.com/apache/lucene/pull/240#discussion_r1391264578


##
lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/SearchWithCollectorTask.java:
##
@@ -45,20 +43,6 @@ public boolean withCollector() {
 return true;
   }
 
-  @Override
-  protected Collector createCollector() throws Exception {

Review Comment:
   Yeah +1 to revert this part?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-13 Thread via GitHub



dungba88 commented on code in PR #12624:
URL: https://github.com/apache/lucene/pull/12624#discussion_r1391272833


##
lucene/core/src/java/org/apache/lucene/util/fst/BytesStore.java:
##
@@ -359,7 +383,9 @@ public void truncate(long newLen) {
 assert newLen == getPosition();
   }
 
-  public void finish() {
+  @Override
+  public void freeze() {
+this.frozen = true;

Review Comment:
   I removed the Freezeable as well as the freeze functionality. It was 
previously used in writeTo(DataOutput) but even then it's not really needed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-13 Thread via GitHub



dungba88 commented on code in PR #12624:
URL: https://github.com/apache/lucene/pull/12624#discussion_r1391247632


##
lucene/core/src/java/org/apache/lucene/util/fst/BytesStore.java:
##
@@ -337,11 +349,23 @@ public long size() {
 return getPosition();
   }
 
+  /** Similar to {@link #truncate(long)} with newLen=0 but keep the first 
block to reduce GC. */
+  public void reset() {

Review Comment:
   The public API of `ByteBuffersDataOutput` class seems to only allow 
appending only, but for scratch area we still need to some backward writing 
right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-13 Thread via GitHub



dungba88 commented on code in PR #12624:
URL: https://github.com/apache/lucene/pull/12624#discussion_r1391285510


##
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##
@@ -120,31 +122,54 @@ public class FSTCompiler {
   final float directAddressingMaxOversizingFactor;
   long directAddressingExpansionCredit;
 
-  final BytesStore bytes;
+  // writer for frozen nodes
+  final FSTDataOutputWriter fstWriter;
+  // reader for the frozen nodes
+  final FSTReader fstReader;
+
+  // buffer to store the scratch bytes before writing to the fstWriter
+  final BytesStore bytes = new BytesStore(DEFAULT_SCRATCH_PAGE_BITS);
 
   /**
* Instantiates an FST/FSA builder with default settings and pruning options 
turned off. For more
* tuning and tweaking, see {@link Builder}.
*/
   // TODO: remove this?  Builder API should be the only entry point?
   public FSTCompiler(FST.INPUT_TYPE inputType, Outputs outputs) {
-this(inputType, 32.0, outputs, true, 15, 1f);
+this(
+inputType,
+32.0,
+outputs,
+true,
+new FSTDataOutputWriter(getLegacyDataOutput(DEFAULT_BLOCK_BITS)),
+1f);
+  }
+
+  static DataOutput getLegacyDataOutput(int blockBits) {
+return new BytesStore(blockBits);
   }
 
   private FSTCompiler(
   FST.INPUT_TYPE inputType,
   double suffixRAMLimitMB,
   Outputs outputs,
   boolean allowFixedLengthArcs,
-  int bytesPageBits,
+  FSTDataOutputWriter fstWriter,
   float directAddressingMaxOversizingFactor) {
 this.allowFixedLengthArcs = allowFixedLengthArcs;
 this.directAddressingMaxOversizingFactor = 
directAddressingMaxOversizingFactor;
-bytes = new BytesStore(bytesPageBits);
 // pad: ensure no node gets address 0 which is reserved to mean
 // the stop state w/ no arcs
-bytes.writeByte((byte) 0);
-fst = new FST<>(new FST.FSTMetadata<>(inputType, null, -1, 
VERSION_CURRENT, 0), outputs, bytes);
+try {
+  fstWriter.writeByte((byte) 0);
+} catch (IOException e) {

Review Comment:
   Ah, if the public ctor is be removed in 
https://github.com/apache/lucene/pull/12715 then we can throw it here. I'll 
wait for that PR to be merged.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Copy directly between 2 ByteBlockPool to avoid double-copy [lucene]

2023-11-13 Thread via GitHub



mikemccand commented on code in PR #12786:
URL: https://github.com/apache/lucene/pull/12786#discussion_r1391299752


##
lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java:
##
@@ -289,21 +273,38 @@ public long getNodeAddress(long hashSlot) {
 }
 
 /**
- * Set the node address and bytes from the provided hash slot
+ * Set the node address from the provided hash slot
  *
  * @param hashSlot the hash slot to write to
  * @param nodeAddress the node address
- * @param bytes the node bytes to be copied
  */
-public void setNode(long hashSlot, long nodeAddress, byte[] bytes) {
+public void setNodeAddress(long hashSlot, long nodeAddress) {
   assert fstNodeAddress.get(hashSlot) == 0;
   fstNodeAddress.set(hashSlot, nodeAddress);
   count++;
+}
 
+/** copy the node bytes from the FST */
+private void copyNodeBytes(long hashSlot, byte[] bytes) {
+  assert copiedNodeAddress.get(hashSlot) == 0;
   copiedNodes.append(bytes);
   // write the offset, which points to the last byte of the node we copied 
since we later read
   // this node in reverse
+  copiedNodeAddress.set(hashSlot, copiedNodes.getPosition() - 1);
+}
+
+/** promote the node bytes from the fallback table */
+private void copyFallbackNodeBytes(

Review Comment:
   Ooh looks like you added a test case, thanks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Cache buckets to speed up BytesRefHash#sort [lucene]

2023-11-13 Thread via GitHub



gf2121 commented on PR #12784:
URL: https://github.com/apache/lucene/pull/12784#issuecomment-1808443275

   Thanks for tracking in ! @mikemccand 
   
   > Did we see any bump in nightly benchmarks?
   
   I would expect this change more likely bring some improvements for flushing 
high cardinality `StringField`, or other places taking advantages of the 
`BytesRefHash#sort` like 
[`DeletedTerms`](https://github.com/apache/lucene/blob/be27303e3ace53b8c5be9a480d5ad7d2c609a28f/lucene/core/src/java/org/apache/lucene/index/BufferedUpdates.java#L180).
 
   
   However, I had not expected this change would bring a bump in nightly 
benchmarks - If you look at the [CPU profile of the nightly 
indexing](https://blunders.io/jfr-demo/indexing-4kb-2023.11.12.18.03.28/cpu-samples-drill-down),
 you may find that most of CPU was used to tokenize / deduplicate terms before 
flushing. After deduplication the terms count will get reduced and the sort is 
not the bottleneck of the indexing speed (sort only use ~1% CPU).
   
   Small difference may be found in the flame graph that `reorder` is not there 
any more. But the proportion of sorting overhead is too low to bring a 
noticeable E2E difference for nightly indexing.
   
   **Before this patch**
   
![image](https://github.com/apache/lucene/assets/52390227/b4e21371-9760-465b-9593-f7066c9bc7fd)
   
   **After this patch**
   
![image](https://github.com/apache/lucene/assets/52390227/23f48ac2-45b2-4275-bd96-3fd8569a2707)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] javadocs cleanup in Lucene99PostingsFormat [lucene]

2023-11-13 Thread via GitHub



mikemccand merged PR #12776:
URL: https://github.com/apache/lucene/pull/12776


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Use `instanceof` pattern-matching where possible [lucene]

2023-11-13 Thread via GitHub



mikemccand commented on code in PR #12295:
URL: https://github.com/apache/lucene/pull/12295#discussion_r1391341171


##
lucene/analysis/morfologik/src/java/org/apache/lucene/analysis/morfologik/MorphosyntacticTagsAttributeImpl.java:
##
@@ -52,8 +52,8 @@ public void clear() {
 
   @Override
   public boolean equals(Object other) {
-if (other instanceof MorphosyntacticTagsAttribute) {
-  return equal(this.getTags(), ((MorphosyntacticTagsAttribute) 
other).getTags());
+if (other instanceof MorphosyntacticTagsAttribute 
morphosyntacticTagsAttribute) {

Review Comment:
   I guess we keep this `equals` since it was already using `instanceof`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-11-13 Thread via GitHub



jbellis commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1808489834

   > recall actually improves when introducing pq, and only starts to decrease 
at a factor of 16
   
   I would guess that either there is a bug or you happen to be testing with a 
really unusual dataset.  PQ is fundamentally a lossy compression and can't 
magically create similarity that didn't exist in the original.
   
   > Regardless, is there any oversampling that is occurring when PQ is enabled 
in JVector?
   
   Today it's up to the caller (so on the Cassandra side, in Astra) but it's 
possible that it should move into JVector.
   
   > Additionally the graph building probably cannot be done with the PQ'd 
vectors.
   
   I suppose it's not impossible that you could compress first and then build 
but I have not seen anyone do it yet.  JVector follows DiskANN's lead and 
builds the graph using uncompressed vectors. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Fix NFAQuery in TestRegexpRandom2 [lucene]

2023-11-13 Thread via GitHub



zhaih commented on PR #12793:
URL: https://github.com/apache/lucene/pull/12793#issuecomment-1808581207

   @mikemccand It's just concurrent rewrite, I already have some javadoc warn 
about this: 
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/RegexpQuery.java#L162
   
   But I'm thinking about adding a new API to add to `RewriteMethod` to 
indicate whether they're concurrent rewrite or not (when searcher has the 
executor), and then we can use that combined with IndexSearcher to throw some 
`IllegalArgumentException`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Copy directly between 2 ByteBlockPool to avoid double-copy [lucene]

2023-11-13 Thread via GitHub



mikemccand commented on PR #12786:
URL: https://github.com/apache/lucene/pull/12786#issuecomment-1808651797

   `Test2BFST` is happy:
   
   ```
   The slowest tests (exceeding 500 ms) during this run:

 
 3752.61s Test2BFST.test (:lucene:core) 

 
   The slowest suites (exceeding 1s) during this run:   

 
 3752.74s Test2BFST (:lucene:core)  

 


 
   BUILD SUCCESSFUL in 1h 2m 51s

 
   246 actionable tasks: 158 executed, 88 up-to-date

 
   ```
   
   I'll merge to main, later to backport to 9.x.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Copy directly between 2 ByteBlockPool to avoid double-copy [lucene]

2023-11-13 Thread via GitHub



mikemccand merged PR #12786:
URL: https://github.com/apache/lucene/pull/12786


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[PR] Fix large 99 percentile match latency in Monitor during concurrent commits and purges [lucene]

2023-11-13 Thread via GitHub

daviscook477 opened a new pull request, #12801:
URL: https://github.com/apache/lucene/pull/12801

### Description
Within Lucene Monitor, there is a thread contention issue that manifests as
multi-second latencies in the `Monitor.match` function when it is called
concurrently with `Monitor.register/Monitor.deleteById` and
`QueryIndex.purgeCache`. The source of the issue is the usage of `purgeLock`, a
ReentrantReadWriteLock.

Within the QueryIndex, there are 3 usages of the `purgeLock`. They are
`commit` (called by `Monitor.register/Monitor.deleteById`), ‘search’ (called by
Monitor.match), and `purgeCache` (called by a background thread at a consistent
interval). Within `commit` and `search`, the read lock is acquired, and within
`purgeCache`, the write lock is acquired. `search` calls are very fast and only
hold the `purgeLock` for short durations. However, `commit` calls are quite
long and will hold the `purgeLock` for multi-second durations. Ostensibly
because both `commit` and `search` only need the read lock, they should be able
to execute concurrently and the long duration of `commit` should not impact
`search`. However, when `purgeCache` attempts to acquire the write lock this no
longer holds true. This is because once `purgeCache` is waiting to acquire the
write lock, attempts to acquire the read lock
[probabilistically](https://github.com/openjdk/jdk17u-dev/blob/2486ba2899f1e51418fa73f6216

b5224b4a348c5/src/java.base/share/classes/java/util/concurrent/locks/ReentrantReadWriteLock.java#L686)
might wait for the write lock to be acquired + released (otherwise attempts to
acquire the write lock would be starved). What this means is that if `commit`
is holding the read lock, and then `purgeCache` attempts to acquire the write
lock, some amount of calls to `match` that want to acquire the read lock will
be queued behind the `purgeCache` attempt to acquire the write lock. Because
`purgeCache` cannot acquire the write lock until `commit` releases the read
lock, and because those `search` calls cannot acquire the read lock until
`purgeCache` acquires + releases the write lock, those `search` calls end up
having to wait for `commit` to complete. This makes it possible for `search`
calls to take as long as `commit` to complete (multi-second rather than
millisecond).

### Diagrams
For illustration purposes I have the following 2 diagrams of the problem.
The first diagram demonstrates the wait order for the `ReentrantReadWriteLock`
in the high latency match scenario. For the purpose of the diagram I am
assuming that `commit` takes 3 seconds, `purgeCache` takes 3 milliseconds, and
`search` takes 3 milliseconds as well. Most of the calls to `search` will not
end up in this scenario, but without the fix included in this PR *some* of the
calls to `search` will end up in this scenario and be slow.

![image](https://github.com/apache/lucene/assets/10566967/e05fe134-2ea2-4f05-8319-64773ae4feff)
The way in which the calls to `search` end up being slow is more clearly
shown in the second diagram which shows a timeline that has a multisecond
delayed `search` call - making the same assumptions about the runtime of the 3
operations as the first diagram. In the below diagram, `Search Thread A` avoids
the problem and has expected latency, but `Search Thread B` runs into the
problem and has very high latency.

![image](https://github.com/apache/lucene/assets/10566967/7c462aa3-18fc-4183-a2d8-d98d8341f489)

### Solution
This issue can be resolved by ensuring that `purgeCache` never attempts to
acquire the write lock when `commit` is holding the read lock. This can be done
by simply using a mutual exclusion lock between the two since:
1. `purgeCache` is not time-sensitive so it can afford to wait until
`commit` completes.
2. `commit` is arguably time sensitive but `purgeCache` is a very quick
operation so sometimes having to wait on it won’t make a meaningful difference
in the time it takes for `commit` to complete by itself.

### Test Case / Demo
To demonstrate the problem I have this
[gist](https://gist.github.com/daviscook477/a1b1fbebe75b0fdc41c889e542a2ed42)
that runs `commit` and `search` concurrently while recording the durations of
the calls to `search`. Then it runs `commit`, `search` and `purgeCache`
concurrently while recording the durations of `search`. Comparing the p99 -
p100 durations between when `purgeCache` wasn’t run concurrently vs when it was
run concurrently shows the high latency behavior for combining all 3 operations.

Without this change, running the gist on my laptop gives:
1. No `purgeCache` - mean of p99 thru p100 is 20ms
2. Yes `purgeCache` - mean of p99 thru p100 is 651ms

With this change, running the gist on my laptop gives:
1. No `purgeCache` - mean of p99 thru p100 is 16ms
2. Yes `purgeCache` - mean of p99 thru p100 is 9ms

Even though the latencies in this tes

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-11-13 Thread via GitHub



kevindrosendahl commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1808717985

   @benwtrent 
   > Thank you @kevindrosendahl this does seem to confirm my suspicion that the 
improvement isn't necessarily due to the data structure, but due to 
quantization. But, this does confuse me as Vamana is supposedly good when 
things don't all fit in memory. It may be due to how we fetch things (MMAP). I 
wonder if Vamana is any good at all when using MMAP...
   
   Yeah, upon reflection the result makes sense. DiskANN only uses the 
in-memory PQ vectors for the distance calculations for deciding which new 
candidates it should consider visiting in the future. This is the critical 
piece which reduces I/O. The in-graph vectors mean that when you have decided 
the next candidate, you get the full fidelity vector for free on the I/O to get 
the adjacency list, but at that point you've already done the distance 
calculation against that candidate. If you were to grab the full fidelity 
vector from the graph for the distance calculation like the non-PQ vamana 
implementations do, you've moved the I/O complexity back from 
`O(candidates_explored)` to `O(nodes_considered)`, and probably need to fetch 
the page again when you've decided to actually consider a candidate to re-grab 
it's adjacency list.
   
   What that full fidelity vector is useful for is re-scoring, so you can just 
keep that full fidelity vector in memory until the end of the search ([jvector 
reference](https://github.com/jbellis/jvector/blob/main/jvector-base/src/main/java/io/github/jbellis/jvector/graph/GraphSearcher.java#L218-L220))
 and resort the final result list using their true distances ([jvector 
reference](https://github.com/jbellis/jvector/blob/main/jvector-base/src/main/java/io/github/jbellis/jvector/graph/GraphSearcher.java#L258C6-L259)).
   
   I'm not sure how impactful this re-ranking is, hoping to A/B it when I get 
PQ working, but assuming it's impactful, the index structure could save up to 
`numCandidates` I/Os. In this case jvector was only doing ~70 I/Os, so adding 
up to 100 on top could be a pretty big deal.
   
   The main thing MMAP may prevent us from "easily" doing is parallelizing the 
I/O, which I believe the DiskANN paper does but jvector does not currently 
(possibly due to the results looking pretty decent without it, and it being 
hard with mmap/Java).
   
   > Or is `pq=` not the number of subspaces, but `vectorDim/pq == 
numberOfSubSpaces`? If so, then recall should reduce as it increases.
   
   This is correct, `pq` in the tables relates to the `pqFactor` in JVector, 
which is as you've described. Agreed with you and @jbellis that the observed 
result is bizarre.
   
   > For your testing infra, int8 quantization might close the gap at that 
scale.
   
   I ran some tests with int8 quantization, nothing very surprising in the 
results. If you can get the full index to fit in memory it performs great, but 
performance degrades significantly once it falls out of memory proportional to 
I/O. The scalar quant vectors were 7.3 GB with the HNSW graph being 329 MB. For 
reference, jvector `pqFactor=8` was 32.45 GB graph with vectors + 0.97 GB 
quantized vectors.
   | index configuration | memory configuration | average recall | average 
latency | average page faults |
   |-|-|-|-|-|
   | hnsw scalar quant | fully in memory | 0.79819 | 0.001259s | 0 |
   | jvector `pqFactor=8` | fully in memory | 0.9121 | 0.00148s | 0 |
   | hnsw scalar quant | 10GB system 2GB heap | 0.79819 | 0.00119s | 0 |
   | hnsw scalar quant |  8GB system 4GB heap | 0.79819 | 0.856s | 606.98 |
   | jvector `pqFactor=8` | 4GB system 2GB heap | 0.9121 | 0.151s | 80.15 |
   
   So fundamentally with these graph structures, the key is to have the vectors 
needed for distance calculations of potential neighbor candidates in memory. 
Scalar quantization helps the cause here by reducing the size of the vectors by 
4. PQ can help even more by even more drastically reducing the memory impact. 
JVector further ensures that the quantized vectors are in memory by storing 
them on the heap.
   
   > FYI, as significant (and needed) refactor occurred recently for int8 
quantization & HNSW, so your testing branch might be significantly impacted :(.
   
   Is there any expected performance difference, or is it mainly 
organizational? The code in my branch is not in a state I would be comfortable 
suggesting for upstreaming, so if it's "just" a problem for that, I'm ok to 
keep my branch a bit outdated, and we could make any suitable ideas fit in 
if/when we thought upstreaming could be worthwhile.
   
   
   > Two significant issues with a Lucene implementation I can think of are:
   > 
   > * Segment merge time: We can maybe do some fancy things with better 
starter centroids in Lloyd's algorithm, but I think we will have to rerun 
Lloyd's algorithm on every merge. Additionally the graph building probably 
cannot be done wi

Re: [PR] CheckIndex - Making -fast the default behaviour [lucene]

2023-11-13 Thread via GitHub



slow-J commented on code in PR #12797:
URL: https://github.com/apache/lucene/pull/12797#discussion_r1391501828


##
lucene/core/src/java/org/apache/lucene/index/CheckIndex.java:
##
@@ -3974,7 +3974,7 @@ public static class Options {
 boolean doExorcise = false;
 boolean doSlowChecks = false;
 boolean verbose = false;
-boolean doChecksumsOnly = false;
+boolean doChecksumsOnly = true;

Review Comment:
   Thanks for the suggestions Robert!
   I should have some time in the week to implement them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Random access term dictionary [lucene]

2023-11-13 Thread via GitHub



Tony-X commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1391564228


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermsIndexBuilder.java:
##
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import java.io.IOException;
+import java.util.Arrays;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IntsRefBuilder;
+import org.apache.lucene.util.fst.FST;
+import org.apache.lucene.util.fst.FSTCompiler;
+import org.apache.lucene.util.fst.PositiveIntOutputs;
+import org.apache.lucene.util.fst.Util;
+
+/**
+ * Builds a term index for a given field. Logically this is a map: term -> 
(type, ord) where the
+ * ordinals are scoped to type (not global).
+ */
+final class TermsIndexBuilder {
+  private static long MAX_ORD = (1L << 60) - 1;
+
+  private final long[] countPerType = new long[TermType.NUM_TOTAL_TYPES];
+  private final FSTCompiler fstCompiler =
+  new FSTCompiler<>(FST.INPUT_TYPE.BYTE1, 
PositiveIntOutputs.getSingleton());
+
+  TermsIndexBuilder() {
+Arrays.fill(countPerType, -1);
+  }
+
+  public void addTerm(BytesRef term, TermType termType) throws IOException {
+IntsRefBuilder scratchInts = new IntsRefBuilder();
+long ord = ++countPerType[termType.getId()];
+fstCompiler.add(Util.toIntsRef(term, scratchInts), encode(ord, termType));
+  }
+
+  public TermsIndex build() throws IOException {
+return new TermsIndex(fstCompiler.compile());
+  }
+
+  private long encode(long ord, TermType termType) {
+// use a single long to encode `ord` and `termType`
+// also consider the special value of `PositiveIntOutputs.NO_OUTPUT == 0`
+// so it looks like this |...  ord ...| termType| ... hasOutput  ...|
+// where termType takes 3 bit and hasOutput takes the lowest bit. The rest 
is taken by ord
+if (ord < 0) {
+  throw new IllegalArgumentException("can't encode negative ord");
+}
+if (ord > MAX_ORD) {
+  throw new IllegalArgumentException(
+  "Input ord is too large for TermType: "

Review Comment:
   I see. That's a good point. I'll keep that in mind when implementing the 
main FieldsConsumer!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Random access term dictionary [lucene]

2023-11-13 Thread via GitHub



Tony-X commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1391566961


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene99/randomaccess/TermStateCodec.java:
##
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene99.randomaccess;
+
+import 
org.apache.lucene.codecs.lucene99.Lucene99PostingsFormat.IntBlockTermState;
+import 
org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking.BitPacker;
+import 
org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking.BitUnpacker;
+import org.apache.lucene.util.BytesRef;
+
+interface TermStateCodec {

Review Comment:
   I know I started off calling it `TermStateEncoder` but I soon realized 
it needs decoding, too. 
   
   I'm open to change its names :) 
   
   One good thing is that this is totally implementation detail which needs not 
to be exposed outside this package.  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Random access term dictionary [lucene]

2023-11-13 Thread via GitHub



Tony-X commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1391575474


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene99/randomaccess/TermStateCodecImpl.java:
##
@@ -0,0 +1,234 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene99.randomaccess;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import 
org.apache.lucene.codecs.lucene99.Lucene99PostingsFormat.IntBlockTermState;
+import org.apache.lucene.index.IndexOptions;
+import 
org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.DocFreq;
+import 
org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.DocStartFP;
+import 
org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.LastPositionBlockOffset;
+import 
org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.PayloadStartFP;
+import 
org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.PositionStartFP;
+import 
org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.SingletonDocId;
+import 
org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.SkipOffset;
+import 
org.apache.lucene.sandbox.codecs.lucene99.randomaccess.TermStateCodecComponent.TotalTermFreq;
+import 
org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking.BitPacker;
+import 
org.apache.lucene.sandbox.codecs.lucene99.randomaccess.bitpacking.BitUnpacker;
+import org.apache.lucene.store.ByteArrayDataInput;
+import org.apache.lucene.store.ByteArrayDataOutput;
+import org.apache.lucene.util.BytesRef;
+
+final class TermStateCodecImpl implements TermStateCodec {

Review Comment:
   Probably not --  I started off with an interface (perhaps because i was 
influenced by developing with Rust) since it has an emphasis on what exactly 
are needed. Also it makes mocks easier. (Though I didn't need it)
   
   Let's revisit this later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Use `instanceof` pattern-matching where possible [lucene]

2023-11-13 Thread via GitHub



uschindler commented on code in PR #12295:
URL: https://github.com/apache/lucene/pull/12295#discussion_r1391562831


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90PointsWriter.java:
##
@@ -182,7 +181,7 @@ public void merge(MergeState mergeState) throws IOException 
{
  * a bulk merge of the points.
  */
 for (PointsReader reader : mergeState.pointsReaders) {
-  if (reader instanceof Lucene90PointsReader == false) {
+  if (!(reader instanceof Lucene90PointsReader)) {

Review Comment:
   This change is not needed. Some people here prefer the `== false` so let's 
keep it at least here.



##
lucene/codecs/src/java/org/apache/lucene/codecs/memory/DirectPostingsFormat.java:
##
@@ -1470,20 +1475,26 @@ public PostingsEnum postings(PostingsEnum reuse, int 
flags) {
 
 // TODO: the logic of which enum impl to choose should be refactored 
to be simpler...
 if (hasPos && PostingsEnum.featureRequested(flags, 
PostingsEnum.POSITIONS)) {
-  if (terms[termOrd] instanceof LowFreqTerm) {
-final LowFreqTerm term = ((LowFreqTerm) terms[termOrd]);
-final int[] postings = term.postings;
-final byte[] payloads = term.payloads;
+  var term = terms[termOrd];

Review Comment:
   I would revert this whole cleanup here. The code was much more readable 
before. The new syntax is fine if we only have one branch using the auto-casted 
instance. Here we have if/else now using different syntax, so please revert!



##
lucene/core/src/java/org/apache/lucene/util/RamUsageEstimator.java:
##
@@ -448,38 +448,38 @@ private static long sizeOfObject(Object o, int depth, 
long defSize) {
   return 0;
 }
 long size;
-if (o instanceof Accountable) {
-  size = ((Accountable) o).ramBytesUsed();
-} else if (o instanceof String) {
-  size = sizeOf((String) o);
-} else if (o instanceof boolean[]) {
-  size = sizeOf((boolean[]) o);
-} else if (o instanceof byte[]) {
-  size = sizeOf((byte[]) o);
-} else if (o instanceof char[]) {
-  size = sizeOf((char[]) o);
-} else if (o instanceof double[]) {
-  size = sizeOf((double[]) o);
-} else if (o instanceof float[]) {
-  size = sizeOf((float[]) o);
-} else if (o instanceof int[]) {
-  size = sizeOf((int[]) o);
-} else if (o instanceof Integer) {
-  size = sizeOf((Integer) o);
-} else if (o instanceof Long) {
-  size = sizeOf((Long) o);
-} else if (o instanceof long[]) {
-  size = sizeOf((long[]) o);
-} else if (o instanceof short[]) {
-  size = sizeOf((short[]) o);
-} else if (o instanceof String[]) {
-  size = sizeOf((String[]) o);
-} else if (o instanceof Query) {
-  size = sizeOf((Query) o, defSize);
-} else if (o instanceof Map) {
-  size = sizeOfMap((Map) o, ++depth, defSize);
-} else if (o instanceof Collection) {
-  size = sizeOfCollection((Collection) o, ++depth, defSize);
+if (o instanceof Accountable accountable) {

Review Comment:
   Rewrite this to pattern-matching (new) switch statement.



##
lucene/codecs/src/java/org/apache/lucene/codecs/memory/DirectPostingsFormat.java:
##
@@ -1470,20 +1475,26 @@ public PostingsEnum postings(PostingsEnum reuse, int 
flags) {
 
 // TODO: the logic of which enum impl to choose should be refactored 
to be simpler...
 if (hasPos && PostingsEnum.featureRequested(flags, 
PostingsEnum.POSITIONS)) {
-  if (terms[termOrd] instanceof LowFreqTerm) {
-final LowFreqTerm term = ((LowFreqTerm) terms[termOrd]);
-final int[] postings = term.postings;
-final byte[] payloads = term.payloads;
+  var term = terms[termOrd];
+  if (term instanceof LowFreqTerm) {
+final LowFreqTerm lowFreqTerm = (LowFreqTerm) terms[termOrd];
+final int[] postings = lowFreqTerm.postings;
+final byte[] payloads = lowFreqTerm.payloads;
 return new LowFreqPostingsEnum(hasOffsets, 
hasPayloads).reset(postings, payloads);
   } else {
-final HighFreqTerm term = (HighFreqTerm) terms[termOrd];
+final HighFreqTerm highFreqTerm = (HighFreqTerm) terms[termOrd];
 return new HighFreqPostingsEnum(hasOffsets)
-.reset(term.docIDs, term.freqs, term.positions, term.payloads);
+.reset(
+highFreqTerm.docIDs,
+highFreqTerm.freqs,
+highFreqTerm.positions,
+highFreqTerm.payloads);
   }
 }
 
-if (terms[termOrd] instanceof LowFreqTerm) {
-  final int[] postings = ((LowFreqTerm) terms[termOrd]).postings;
+var term = terms[termOrd];

Review Comment:
   same here.



##
lucene/core/src/java/org/apache/lucene/util/IOUtils.java:
##
@@ -390,16 +390,16 @@ public static Error rethrowAlways(Th

Re: [PR] Use `instanceof` pattern-matching where possible [lucene]

2023-11-13 Thread via GitHub



uschindler commented on PR #12295:
URL: https://github.com/apache/lucene/pull/12295#issuecomment-1808887454

   P.S.: When pushing more changes, please **do not squash and force-push**! 
This makes reviewing not working easy. We squash later when merging, it is not 
needed to do this while developing a PR.
   
   When You add separate commits I can verify what you have changed, when its 
squashed I have to do all the work again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Fix large 99 percentile match latency in Monitor during concurrent commits and purges [lucene]

2023-11-13 Thread via GitHub



daviscook477 commented on PR #12801:
URL: https://github.com/apache/lucene/pull/12801#issuecomment-1809121843

   Thank you for taking a look!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Always collect sparsely in TaxonomyFacets & switch to dense if there are enough unique labels [lucene]

2023-11-13 Thread via GitHub



gautamworah96 commented on issue #12576:
URL: https://github.com/apache/lucene/issues/12576#issuecomment-1809159446

   I am not actively working on this problem as of now. 
   I am still in the process of figuring out what would be the correct thing to 
test/do here as a first step. Jotting down some thoughts that I had.
   
   1. We could try the `Collect into a sparse structure at first, and if it 
gets too big, switch to dense.` experiment. It adds a costs to the `setValue` 
call when it will have to suddenly switch to a dense structure. But that should 
be fine? It is all within the same query. So no cases of `a single query in a 
sea of queries suddenly being too slow`. What heuristic on when to switch to 
use here?
   2. Do we need to make the `setValue` call operation thread safe? How will 
that work with the dynamic changing?
   
   `luceneutil` needs to be hacked to record JFR stats for determining if this 
change improves the heap allocation or not.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-13 Thread via GitHub



dungba88 commented on code in PR #12624:
URL: https://github.com/apache/lucene/pull/12624#discussion_r1384518777


##
lucene/core/src/java/org/apache/lucene/util/fst/FSTWriter.java:
##
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.fst;
+
+import java.io.IOException;
+
+/** Abstract class which provides low-level functionality to write to a FST */
+public interface FSTWriter {
+
+  /**

Review Comment:
   These 2 methods are same as DataOutput, but DataOutput is a class which we 
want to avoid extending from.



##
lucene/core/src/java/org/apache/lucene/util/fst/BytesStore.java:
##
@@ -26,7 +27,8 @@
 // TODO: merge with PagedBytes, except PagedBytes doesn't
 // let you read while writing which FST needs
 
-class BytesStore extends DataOutput implements FSTReader {
+// TODO: Separate the scratch writer and reader functionality
+class BytesStore extends DataOutput implements FSTReader, Closeable {

Review Comment:
   I was thinking of changing this to something like Freezable or Finishable



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-13 Thread via GitHub



dungba88 commented on code in PR #12624:
URL: https://github.com/apache/lucene/pull/12624#discussion_r1391272833


##
lucene/core/src/java/org/apache/lucene/util/fst/BytesStore.java:
##
@@ -359,7 +383,9 @@ public void truncate(long newLen) {
 assert newLen == getPosition();
   }
 
-  public void finish() {
+  @Override
+  public void freeze() {
+this.frozen = true;

Review Comment:
   I removed the Freezeable as well as the freeze functionality. It was 
previously used in writeTo(DataOutput) but even then it's not really needed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[PR] Remove FST constructors with DataInput for metadata [lucene]

2023-11-13 Thread via GitHub



dungba88 opened a new pull request, #12803:
URL: https://github.com/apache/lucene/pull/12803

   ### Description
   
   Remove FST constructors with DataInput for metadata, in favor of the new 
constructor with FSTMetadata


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-13 Thread via GitHub



dungba88 commented on code in PR #12624:
URL: https://github.com/apache/lucene/pull/12624#discussion_r1391183424


##
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##
@@ -120,31 +122,54 @@ public class FSTCompiler {
   final float directAddressingMaxOversizingFactor;
   long directAddressingExpansionCredit;
 
-  final BytesStore bytes;
+  // writer for frozen nodes
+  final FSTDataOutputWriter fstWriter;
+  // reader for the frozen nodes
+  final FSTReader fstReader;
+
+  // buffer to store the scratch bytes before writing to the fstWriter
+  final BytesStore bytes = new BytesStore(DEFAULT_SCRATCH_PAGE_BITS);
 
   /**
* Instantiates an FST/FSA builder with default settings and pruning options 
turned off. For more
* tuning and tweaking, see {@link Builder}.
*/
   // TODO: remove this?  Builder API should be the only entry point?
   public FSTCompiler(FST.INPUT_TYPE inputType, Outputs outputs) {
-this(inputType, 32.0, outputs, true, 15, 1f);
+this(
+inputType,
+32.0,
+outputs,
+true,
+new FSTDataOutputWriter(getLegacyDataOutput(DEFAULT_BLOCK_BITS)),
+1f);
+  }
+
+  static DataOutput getLegacyDataOutput(int blockBits) {
+return new BytesStore(blockBits);
   }
 
   private FSTCompiler(
   FST.INPUT_TYPE inputType,
   double suffixRAMLimitMB,
   Outputs outputs,
   boolean allowFixedLengthArcs,
-  int bytesPageBits,
+  FSTDataOutputWriter fstWriter,
   float directAddressingMaxOversizingFactor) {
 this.allowFixedLengthArcs = allowFixedLengthArcs;
 this.directAddressingMaxOversizingFactor = 
directAddressingMaxOversizingFactor;
-bytes = new BytesStore(bytesPageBits);
 // pad: ensure no node gets address 0 which is reserved to mean
 // the stop state w/ no arcs
-bytes.writeByte((byte) 0);
-fst = new FST<>(new FST.FSTMetadata<>(inputType, null, -1, 
VERSION_CURRENT, 0), outputs, bytes);
+try {
+  fstWriter.writeByte((byte) 0);
+} catch (IOException e) {

Review Comment:
   Can we do this in a follow-up PR? Having this constructor throwing exception 
means the public one also needs to throw exception, and that was used in a lot 
of places :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Ensure DrillSidewaysScorer calls LeafCollector#finish on all sideways-dim FacetsCollectors [lucene]

2023-11-13 Thread via GitHub



gsmiller merged PR #12640:
URL: https://github.com/apache/lucene/pull/12640


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-13 Thread via GitHub



dungba88 commented on code in PR #12624:
URL: https://github.com/apache/lucene/pull/12624#discussion_r1391146752


##
lucene/core/src/java/org/apache/lucene/util/fst/BytesStore.java:
##
@@ -26,7 +26,8 @@
 // TODO: merge with PagedBytes, except PagedBytes doesn't
 // let you read while writing which FST needs
 
-class BytesStore extends DataOutput implements FSTReader {
+// TODO: Separate the scratch writer and reader functionality
+class BytesStore extends DataOutput implements FSTReader, Freezeable {

Review Comment:
   Hm, I'm wondering why should we let the FSTCompiler to know about the 
RandomAccessInput? In the most case, we would let the FSTCompiler to just write 
to a DataOutput, then users are to read it with the public FST constructor 
right?
   
   The case where we want to read from the FST immediately after writing is the 
one with BytesStore.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-13 Thread via GitHub



dungba88 commented on code in PR #12624:
URL: https://github.com/apache/lucene/pull/12624#discussion_r1391973476


##
lucene/core/src/java/org/apache/lucene/util/fst/BytesStore.java:
##
@@ -26,7 +26,8 @@
 // TODO: merge with PagedBytes, except PagedBytes doesn't
 // let you read while writing which FST needs
 
-class BytesStore extends DataOutput implements FSTReader {
+// TODO: Separate the scratch writer and reader functionality
+class BytesStore extends DataOutput implements FSTReader, Freezeable {

Review Comment:
   Now thinking again, I think even the ByteBuffersDataOutput would need 
freezing as well, otherwise its last block will always be a full block, and 
might be a waste of RAM.
   
   Anyhow I removed the interface and the method for now. It's easy to add back.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-13 Thread via GitHub



dungba88 commented on code in PR #12624:
URL: https://github.com/apache/lucene/pull/12624#discussion_r1373993909


##
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##
@@ -287,9 +315,9 @@ public long getMappedStateCount() {
 return dedupHash == null ? 0 : nodeCount;
   }
 
-  private CompiledNode compileNode(UnCompiledNode nodeIn, int tailLength) 
throws IOException {

Review Comment:
   This `tailLength` is not being used anywhere, hence removed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-13 Thread via GitHub



dungba88 commented on code in PR #12624:
URL: https://github.com/apache/lucene/pull/12624#discussion_r1391973476


##
lucene/core/src/java/org/apache/lucene/util/fst/BytesStore.java:
##
@@ -26,7 +26,8 @@
 // TODO: merge with PagedBytes, except PagedBytes doesn't
 // let you read while writing which FST needs
 
-class BytesStore extends DataOutput implements FSTReader {
+// TODO: Separate the scratch writer and reader functionality
+class BytesStore extends DataOutput implements FSTReader, Freezeable {

Review Comment:
   Now thinking again, I think even the ByteBuffersDataOutput would need 
freezing as well, otherwise its last block will always be a full block, and 
might be a waste of RAM. Freezable provides a chance to optimize those 
datastructure as they are not to be modified again.
   
   Anyhow I removed the interface and the method for now. It's easy to add back.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager [lucene]

2023-11-13 Thread via GitHub



zacharymorn commented on code in PR #240:
URL: https://github.com/apache/lucene/pull/240#discussion_r1392083799


##
lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/SearchWithCollectorTask.java:
##
@@ -45,20 +43,6 @@ public boolean withCollector() {
 return true;
   }
 
-  @Override
-  protected Collector createCollector() throws Exception {

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager [lucene]

2023-11-13 Thread via GitHub



zacharymorn commented on PR #240:
URL: https://github.com/apache/lucene/pull/240#issuecomment-1809654589

   > Looks great, thanks @zacharymorn! I just think we should revert the 
`lucene/benchmark` change that breaks testing of a custom collector for this 
first step ...
   
   Thanks @mikemccand for the review feedback and approval!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

85 matches

Mail list logo