Re: [I] Upgrade to Gradle 8.4 [lucene]

2023-10-12 Thread via GitHub


dweiss commented on issue #12655:
URL: https://github.com/apache/lucene/issues/12655#issuecomment-1759046099

   So it seems something has changed and the javaCompiler property is now 
always non-null, causing -X options to be passed to the forked compiler 
(instead of -J...). I've changed this check to this:
   ```
   diff --git a/gradle/hacks/turbocharge-jvm-opts.gradle 
b/gradle/hacks/turbocharge-jvm-opts.gradle
   index 0f404e096b8..e730d878080 100644
   --- a/gradle/hacks/turbocharge-jvm-opts.gradle
   +++ b/gradle/hacks/turbocharge-jvm-opts.gradle
   @@ -57,8 +57,8 @@ allprojects {
// are not part of up-to-date checks but these are internal 
JVM flags so we
// don't care.
//
   -// Pass VM options via -J only with a custom javaHome AND 
when java toolchains are not used.
   -if (task.options.forkOptions.javaHome != null && 
task.javaCompiler.getOrNull() == null) {
   +// Pass VM options via -J when a custom javaHome is used 
and we're in fork mode.
   +if (task.options.fork && task.options.forkOptions.javaHome 
!= null) {
return vmOpts.collect {"-J" + it}
} else {
return vmOpts
   ```
   and it seems to pass for me. The toolchains condition was about dodging some 
problem with jdk api regeneration (like generateJdkApiJar19) but I just ran it 
with and without RUNTIME_JAVA_HOME and it worked fine... Perhaps @uschindler 
will remember what the problem was. I'll commit the above as a temporary 
workaround.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Increase the number of dims for KNN vectors to 2048 [LUCENE-10471] [lucene]

2023-10-12 Thread via GitHub


MarcusSorealheis commented on issue #11507:
URL: https://github.com/apache/lucene/issues/11507#issuecomment-1759050979

   I think we should close it for sure. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2023-10-12 Thread via GitHub


vsop-479 commented on PR #11888:
URL: https://github.com/apache/lucene/pull/11888#issuecomment-1759050886

   Append some performance data. Note that the results have quite diversity 
from different rounds.
   
   # round1
   Task QPS baseline   StdDev   QPS bsearch
StdDev  Pct diffp-value
BrowseRandomLabelTaxoFacets 2601.88  (5.0%) 2437.93 
(11.1%)   -6.3% ( -21% -   10%) 0.021
BrowseRandomLabelSSDVFacets 1796.93  (7.1%) 1696.76 
 (8.0%)   -5.6% ( -19% -   10%) 0.019
  MedPhrase 4021.51  (6.7%) 3813.81 
 (7.6%)   -5.2% ( -18% -9%) 0.022
 HighPhrase  821.97  (7.3%)  788.46 
 (6.3%)   -4.1% ( -16% -   10%) 0.059
   HighTerm 6221.04  (7.3%) 6032.11 
 (5.6%)   -3.0% ( -14% -   10%) 0.138
 Fuzzy2  276.46  (6.0%)  268.79 
 (4.8%)   -2.8% ( -12% -8%) 0.105
Respell  707.62  (5.9%)  692.48 
 (4.9%)   -2.1% ( -12% -9%) 0.211
  BrowseDayOfYearSSDVFacets 6517.91  (5.4%) 6392.55 
 (6.0%)   -1.9% ( -12% -   10%) 0.287
   BrowseDateSSDVFacets 2195.69 (19.2%) 2155.70 
 (9.7%)   -1.8% ( -25% -   33%) 0.705
  BrowseMonthSSDVFacets 6724.35  (6.7%) 6606.24 
 (6.1%)   -1.8% ( -13% -   11%) 0.386
Prefix3 3023.99  (4.5%) 2974.60 
 (5.3%)   -1.6% ( -10% -8%) 0.293
 Fuzzy1  886.09  (5.1%)  877.56 
 (6.0%)   -1.0% ( -11% -   10%) 0.587
   PKLookup  396.87 (15.8%)  393.83 
(19.2%)   -0.8% ( -30% -   40%) 0.890
  OrHighLow 3242.80  (6.0%) 3223.53 
 (7.2%)   -0.6% ( -13% -   13%) 0.778
  BrowseMonthTaxoFacets 5874.67  (6.1%) 5856.79 
 (5.1%)   -0.3% ( -10% -   11%) 0.865
 IntNRQ 2799.54  (4.7%) 2792.69 
 (5.4%)   -0.2% (  -9% -   10%) 0.878
LowIntervalsOrdered 1336.37  (8.1%) 1333.20 
 (5.6%)   -0.2% ( -12% -   14%) 0.914
   HighSpanNear 2660.49  (6.7%) 2654.45 
 (6.1%)   -0.2% ( -12% -   13%) 0.911
LowTerm 9965.56  (8.1%) 9961.77 
(10.6%)   -0.0% ( -17% -   20%) 0.990
AndHighHigh 3384.43  (7.0%) 3388.41 
(11.3%)0.1% ( -16% -   19%) 0.968
   HighSloppyPhrase 1984.76  (5.8%) 1988.83 
 (5.3%)0.2% ( -10% -   12%) 0.908
MedIntervalsOrdered 7914.54  (8.2%) 7944.02 
(10.2%)0.4% ( -16% -   20%) 0.899
 AndHighMed 4097.29  (7.7%) 4121.75 
 (8.0%)0.6% ( -14% -   17%) 0.811
LowSpanNear 5107.67  (9.3%) 5145.19 
 (7.6%)0.7% ( -14% -   19%) 0.785
  HighTermMonthSort 3221.73  (5.2%) 3245.54 
 (8.5%)0.7% ( -12% -   15%) 0.739
   HighIntervalsOrdered 1333.81  (7.6%) 1349.72 
 (5.3%)1.2% ( -10% -   15%) 0.564
  LowPhrase 5029.07  (8.3%) 5091.95 
(11.1%)1.3% ( -16% -   22%) 0.687
   Wildcard 1327.36  (3.8%) 1346.91 
 (3.4%)1.5% (  -5% -9%) 0.197
 AndHighLow 4382.38  (7.8%) 4447.59 
 (6.9%)1.5% ( -12% -   17%) 0.524
  OrHighMed 3121.72  (7.2%) 3169.60 
 (6.4%)1.5% ( -11% -   16%) 0.478
  HighTermDayOfYearSort 3766.72  (6.2%) 3825.04 
 (7.2%)1.5% ( -11% -   15%) 0.467
MedTerm 8666.16  (7.3%) 8841.37 
 (8.1%)2.0% ( -12% -   18%) 0.406
LowSloppyPhrase 3303.11  (6.5%) 3374.89 
 (7.8%)2.2% ( -11% -   17%) 0.341
 OrHighHigh 2458.90  (7.7%) 2512.82 
 (5.2%)2.2% (  -9% -   16%) 0.289
   BrowseDateTaxoFacets 6229.11  (5.5%) 6366.43 
 (5.7%)2.2% (  -8% -   14%) 0.211
  BrowseDayOfYearTaxoFacets 5695.81  (6.8%) 5830.89 
 (6.6%)2.4% ( -10% -   16%) 0.265
MedSpanNear 3161.49  (6.8%) 3242.45 
 (5.4%)2.6% (  -8% -   15%) 0.186
MedSloppyPhrase 3363.11  (7.0%) 3456.66 
 (7.6%)2.8% ( -11% -   18%) 0.230
   
   # round2
   
   Task QPS baseline   StdDev   QPS bsearch 
   StdDev  Pct diffp-va

Re: [I] Upgrade to Gradle 8.4 [lucene]

2023-10-12 Thread via GitHub


uschindler commented on issue #12655:
URL: https://github.com/apache/lucene/issues/12655#issuecomment-1759069434

   I don't remember any problem there. The toolchain use for 
`generateJdkApiJarXX` was added much later and I did not touch that. Maybe 
check git blame.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Upgrade to Gradle 8.4 [lucene]

2023-10-12 Thread via GitHub


dweiss commented on issue #12655:
URL: https://github.com/apache/lucene/issues/12655#issuecomment-1759073292

   I did check that file's history - it used an explicit exclusion for 
compileMain19Java:
   ```
   if (task.path == ":lucene:core:compileMain19Java") {
 // uses "uschindler" toolchain method, which erases "dweiss" 
method, but later at configure time
 task.options.forkOptions.jvmArgs += vmOpts
   ```
   Not worth spending more time on it - if something fails, we'll fix it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Nightly benchmark regression for term dict queries [lucene]

2023-10-12 Thread via GitHub


gf2121 commented on issue #12659:
URL: https://github.com/apache/lucene/issues/12659#issuecomment-1759141669

   I write a JMH benchmark to compare reading `vint` vs `msbvint`.
   ```
   Benchmark (size)  (valueBit)   Mode  Cnt   Score   
Error   Units
   ReadVLongBenchmark.readVLong 128   7  thrpt5  29.406 ± 
0.148  ops/us
   ReadVLongBenchmark.readVLong 128  14  thrpt5   3.644 ± 
0.013  ops/us
   ReadVLongBenchmark.readVLong 128  21  thrpt5   2.416 ± 
0.004  ops/us
   ReadVLongBenchmark.readVLong 128  28  thrpt5   2.074 ± 
0.005  ops/us
   ReadVLongBenchmark.readVLong 128  35  thrpt5   1.838 ± 
0.001  ops/us
   ReadVLongBenchmark.readVLong 128  42  thrpt5   1.958 ± 
0.004  ops/us
   ReadVLongBenchmark.readVLong 128  49  thrpt5   1.511 ± 
0.012  ops/us
   ReadVLongBenchmark.readVLong 128  56  thrpt5   1.463 ± 
0.005  ops/us
   ReadVLongBenchmark.readVLong 128  63  thrpt5   1.299 ± 
0.092  ops/us
   ReadVLongBenchmark.writeVLongMSB 128   7  thrpt5  28.622 ± 
0.865  ops/us
   ReadVLongBenchmark.writeVLongMSB 128  14  thrpt5   4.479 ± 
0.013  ops/us
   ReadVLongBenchmark.writeVLongMSB 128  21  thrpt5   3.647 ± 
0.007  ops/us
   ReadVLongBenchmark.writeVLongMSB 128  28  thrpt5   2.434 ± 
0.372  ops/us
   ReadVLongBenchmark.writeVLongMSB 128  35  thrpt5   1.885 ± 
0.006  ops/us
   ReadVLongBenchmark.writeVLongMSB 128  42  thrpt5   1.516 ± 
0.005  ops/us
   ReadVLongBenchmark.writeVLongMSB 128  49  thrpt5   1.271 ± 
0.004  ops/us
   ReadVLongBenchmark.writeVLongMSB 128  56  thrpt5   1.095 ± 
0.003  ops/us
   ReadVLongBenchmark.writeVLongMSB 128  63  thrpt5   0.961 ± 
0.002  ops/us
   ```
   
   
   Code 
   
   ```
   package testing;
   
   import org.apache.lucene.store.ByteArrayDataOutput;
   import org.apache.lucene.store.DataOutput;
   import org.apache.lucene.util.packed.PackedInts;
   import org.openjdk.jmh.annotations.*;
   
   import java.io.IOException;
   import java.util.concurrent.ThreadLocalRandom;
   import java.util.concurrent.TimeUnit;
   
   @BenchmarkMode(Mode.Throughput)
   @OutputTimeUnit(TimeUnit.MICROSECONDS)
   @State(Scope.Benchmark)
   @Warmup(iterations = 3, time = 3)
   @Measurement(iterations = 5, time = 3)
   @Fork(value = 1, jvmArgsPrepend = {"--add-modules=jdk.incubator.vector"})
   public class ReadVLongBenchmark {
   
 private byte[] bytes;
 private byte[] msbBytes;
 private int written;
   
 @Param({"128"})
 int size;
   
 @Param({"7", "14", "21", "28", "35", "42", "49", "56", "63"})
 int valueBit;
   
 @Setup(Level.Trial)
 public void init() throws Exception {
   bytes = new byte[size * (valueBit / 7)];
   msbBytes = new byte[size * (valueBit / 7)];
   ByteArrayDataOutput out = new ByteArrayDataOutput(bytes);
   ByteArrayDataOutput msbOut = new ByteArrayDataOutput(msbBytes);
   written = 0;
   final long min = (1L << valueBit - 1);
   final long max = valueBit == 63 ? Long.MAX_VALUE : (1L << valueBit) - 1;
   while (out.getPosition() < bytes.length - 10) {
 written++;
 long l = ThreadLocalRandom.current().nextLong(min, max);
 if (PackedInts.bitsRequired(l) != valueBit) {
   throw new IllegalStateException();
 }
 out.writeVLong(l);
 writeMSBVLong(l, msbOut);
   }
   if (readVLong() != writeVLongMSB()) {
 throw new IllegalStateException();
   }
 }
   
 @Benchmark
 public long readVLong() {
   int pos = 0;
   long res = 0;
   for (int iter = 0; iter < written; iter++) {
 byte b = bytes[pos++];
 long i = b & 0x7F;
 for (int shift = 7; (b & 0x80) != 0; shift += 7) {
   b = bytes[pos++];
   i |= (b & 0x7FL) << shift;
 }
 res ^= i;
   }
   return res;
 }
   
 @Benchmark
 public long writeVLongMSB() {
   int pos = 0;
   long res = 0;
   for (int iter = 0; iter < written; iter++) {
 long i = 0L;
 while (true) {
   byte b = msbBytes[pos++];
   i = (i << 7) | (b & 0x7FL);
   if ((b & 0x80) == 0) {
 break;
   }
 }
 res ^= i;
   }
   return res;
 }
   
 static void writeMSBVLong(long l, DataOutput scratchBytes) throws 
IOException {
   assert l >= 0;
   // Keep zero bits on most significant byte to have more chance to get 
prefix bytes shared.
   // e.g. we expect 0x7FFF stored as [0x81, 0xFF, 0x7F] but not [0xFF, 
0xFF, 0x40]
   final int bytesNeeded = (Long.SIZE - Long.numberOfLeadingZeros(l) - 1) / 
7 + 1;
   l <<= Long.SIZE - bytesNeeded * 7;
   for (int i = 1; i < bytesNeeded

Re: [I] Upgrade to Gradle 8.4 [lucene]

2023-10-12 Thread via GitHub


uschindler commented on issue #12655:
URL: https://github.com/apache/lucene/issues/12655#issuecomment-1759182112

   Hi, there are some other problem with the regenerate task, toolchains do not 
automatically download, although enabled. I am digging.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Upgrade to Gradle 8.4 [lucene]

2023-10-12 Thread via GitHub


uschindler commented on issue #12655:
URL: https://github.com/apache/lucene/issues/12655#issuecomment-1759187489

   Very strange output if I clean the JDK cache in `~/.gradle/jdks`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Upgrade to Gradle 8.4 [lucene]

2023-10-12 Thread via GitHub


uschindler commented on issue #12655:
URL: https://github.com/apache/lucene/issues/12655#issuecomment-1759191108

   Task info by `gradlew tasks` is correct:
   
   generateJdkApiJar19 - Regenerate the API-only JAR file with public Panama 
Foreign & Vector API from JDK 19
   generateJdkApiJar20 - Regenerate the API-only JAR file with public Panama 
Foreign & Vector API from JDK 20
   generateJdkApiJar21 - Regenerate the API-only JAR file with public Panama 
Foreign & Vector API from JDK 21
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Upgrade to Gradle 8.4 [lucene]

2023-10-12 Thread via GitHub


uschindler commented on issue #12655:
URL: https://github.com/apache/lucene/issues/12655#issuecomment-1759203200

   > I did check that file's history - it used an explicit exclusion for 
compileMain19Java:
   > 
   > ```
   > if (task.path == ":lucene:core:compileMain19Java") {
   >   // uses "uschindler" toolchain method, which erases "dweiss" 
method, but later at configure time
   >   task.options.forkOptions.jvmArgs += vmOpts
   > ```
   > 
   > Not worth spending more time on it - if something fails, we'll fix it.
   
   This is no longer relevant, as we no longer use the toolkit specific 
compiler. We now generate tha API jars and patch them into `java.base` module.
   
   Toolkits are only used for launching the API extractor. This is somehow 
broken since Gradle 8.4, I try to figure out what's wrong.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Nightly benchmark regression for term dict queries [lucene]

2023-10-12 Thread via GitHub


gf2121 commented on issue #12659:
URL: https://github.com/apache/lucene/issues/12659#issuecomment-1759244758

   I tried another reading way like:
   ```
   byte b = msbBytes[pos++];
   long i = b & 0x7FL;
   while (b < 0) {
 b = msbBytes[pos++];
 i = (i << 7) | (b & 0x7FL);
   }
   ```
   
   Following is the benchmark result for this (readVLongMSBNew)
   
   * Comparing `readVLongMSB` and `readVLong`, some `valueBit`s are better but 
others worse. 
   * Comparing `readMSBVLongNew` and `readVLong`, `readVLongMSBNew` win for all 
`valueBit`.
   
   I'm considering try the new reading way and see what will happen for nightly 
benchmark.
   ```
   Benchmark   (size)  (valueBit)   Mode  Cnt   Score   
Error   Units
   ReadVLongBenchmark.readVLong   128   7  thrpt5  29.333 ± 
0.626  ops/us
   ReadVLongBenchmark.readVLong   128  14  thrpt5   3.643 ± 
0.026  ops/us
   ReadVLongBenchmark.readVLong   128  21  thrpt5   2.411 ± 
0.047  ops/us
   ReadVLongBenchmark.readVLong   128  28  thrpt5   2.075 ± 
0.004  ops/us
   ReadVLongBenchmark.readVLong   128  35  thrpt5   1.836 ± 
0.002  ops/us
   ReadVLongBenchmark.readVLong   128  42  thrpt5   1.943 ± 
0.006  ops/us
   ReadVLongBenchmark.readVLong   128  49  thrpt5   1.514 ± 
0.006  ops/us
   ReadVLongBenchmark.readVLong   128  56  thrpt5   1.463 ± 
0.002  ops/us
   ReadVLongBenchmark.readVLong   128  63  thrpt5   1.338 ± 
0.002  ops/us
   ReadVLongBenchmark.readVLongMSB128   7  thrpt5  29.350 ± 
0.140  ops/us
   ReadVLongBenchmark.readVLongMSB128  14  thrpt5   4.474 ± 
0.025  ops/us
   ReadVLongBenchmark.readVLongMSB128  21  thrpt5   3.640 ± 
0.033  ops/us
   ReadVLongBenchmark.readVLongMSB128  28  thrpt5   2.439 ± 
0.324  ops/us
   ReadVLongBenchmark.readVLongMSB128  35  thrpt5   1.879 ± 
0.009  ops/us
   ReadVLongBenchmark.readVLongMSB128  42  thrpt5   1.508 ± 
0.037  ops/us
   ReadVLongBenchmark.readVLongMSB128  49  thrpt5   1.268 ± 
0.012  ops/us
   ReadVLongBenchmark.readVLongMSB128  56  thrpt5   1.091 ± 
0.024  ops/us
   ReadVLongBenchmark.readVLongMSB128  63  thrpt5   0.960 ± 
0.005  ops/us
   ReadVLongBenchmark.readVLongMSBNew 128   7  thrpt5  31.110 ± 
0.254  ops/us
   ReadVLongBenchmark.readVLongMSBNew 128  14  thrpt5   4.075 ± 
0.010  ops/us
   ReadVLongBenchmark.readVLongMSBNew 128  21  thrpt5   2.600 ± 
0.008  ops/us
   ReadVLongBenchmark.readVLongMSBNew 128  28  thrpt5   2.244 ± 
0.006  ops/us
   ReadVLongBenchmark.readVLongMSBNew 128  35  thrpt5   1.946 ± 
0.005  ops/us
   ReadVLongBenchmark.readVLongMSBNew 128  42  thrpt5   2.102 ± 
0.016  ops/us
   ReadVLongBenchmark.readVLongMSBNew 128  49  thrpt5   1.615 ± 
0.209  ops/us
   ReadVLongBenchmark.readVLongMSBNew 128  56  thrpt5   1.614 ± 
0.032  ops/us
   ReadVLongBenchmark.readVLongMSBNew 128  63  thrpt5   1.465 ± 
0.008  ops/us
   ```
   
    
   
   CODE
   
   ```
   package testing;
   
   import org.apache.lucene.store.ByteArrayDataOutput;
   import org.apache.lucene.store.DataOutput;
   import org.apache.lucene.util.packed.PackedInts;
   import org.openjdk.jmh.annotations.*;
   
   import java.io.IOException;
   import java.util.concurrent.ThreadLocalRandom;
   import java.util.concurrent.TimeUnit;
   
   @BenchmarkMode(Mode.Throughput)
   @OutputTimeUnit(TimeUnit.MICROSECONDS)
   @State(Scope.Benchmark)
   @Warmup(iterations = 3, time = 3)
   @Measurement(iterations = 5, time = 3)
   @Fork(value = 1, jvmArgsPrepend = {"--add-modules=jdk.incubator.vector"})
   public class ReadVLongBenchmark {
   
 private byte[] bytes;
 private byte[] msbBytes;
   
 @Param({"128"})
 int size;
   
 @Param({"7", "14", "21", "28", "35", "42", "49", "56", "63"})
 int valueBit;
   
 @Setup(Level.Trial)
 public void init() throws Exception {
   bytes = new byte[size * (valueBit / 7)];
   msbBytes = new byte[size * (valueBit / 7)];
   ByteArrayDataOutput out = new ByteArrayDataOutput(bytes);
   ByteArrayDataOutput msbOut = new ByteArrayDataOutput(msbBytes);
   final long min = 1L << (valueBit - 1);
   final long max = valueBit == 63 ? Long.MAX_VALUE : (1L << valueBit) - 1;
   for (int i = 0; i < size; i++) {
 long l = ThreadLocalRandom.current().nextLong(min, max);
 if (PackedInts.bitsRequired(l) != valueBit) {
   throw new IllegalStateException();
 }
 out.writeVLong(l);
 writeMSBVLong(l, msbOut);
   }
   if (readVLong() != readVLongMSB()) {
 System.out.println("vlong m

[PR] read MSB VLong in new way [lucene]

2023-10-12 Thread via GitHub


gf2121 opened a new pull request, #12661:
URL: https://github.com/apache/lucene/pull/12661

   ISSUE: https://github.com/apache/lucene/issues/12659
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Nightly benchmark regression for term dict queries [lucene]

2023-10-12 Thread via GitHub


gf2121 commented on issue #12659:
URL: https://github.com/apache/lucene/issues/12659#issuecomment-1759262343

   Another possible reason I'm thinking is that maybe `readMSBVLong` is not as 
hot as `readVLong` so compiler is not optimizing it enough, or the `readbyte` 
in statistic context may confuse JVM with the abstraction layer?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Upgrade to Gradle 8.4 [lucene]

2023-10-12 Thread via GitHub


uschindler commented on issue #12655:
URL: https://github.com/apache/lucene/issues/12655#issuecomment-1759281805

   OK, I found the bug for the icorrect logger output, but that's not the issue 
here. The fix is easy. I will provide a PR for the regenerator.
   
   The reason is that it looks like the `for (x : list) {}` you cannot use the 
`x` in a closure, because the loop variable is not final (like in Java). This 
may be a change in newer Groovy versions. 
   
   But it looks like at least for windows,, Gradle toolkits can't autodownload 
the JDK version. I will dig further.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Upgrade to Gradle 8.4 [lucene]

2023-10-12 Thread via GitHub


uschindler commented on issue #12655:
URL: https://github.com/apache/lucene/issues/12655#issuecomment-1759287508

   This fixes the logging problem:
   
   ```patch
   diff --git a/gradle/generation/extract-jdk-apis.gradle 
b/gradle/generation/extract-jdk-apis.gradle
   index 28f56ea6432..020dc052813 100644
   --- a/gradle/generation/extract-jdk-apis.gradle
   +++ b/gradle/generation/extract-jdk-apis.gradle
   @@ -38,7 +38,7 @@ configure(project(":lucene:core")) {
apiextractor "org.ow2.asm:asm:${scriptDepVersions['asm']}"
  }

   -  for (jdkVersion : mrjarJavaVersions) {
   +  mrjarJavaVersions.each { jdkVersion ->
def task = tasks.create(name: "generateJdkApiJar${jdkVersion}", type: 
JavaExec) {
  description "Regenerate the API-only JAR file with public Panama 
Foreign & Vector API from JDK ${jdkVersion}"
  group "generation"
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Upgrade to Gradle 8.4 [lucene]

2023-10-12 Thread via GitHub


uschindler commented on issue #12655:
URL: https://github.com/apache/lucene/issues/12655#issuecomment-1759309453

   It looks like gradle toolchain download is fully broken in Gradle 8.4. It 
always says that it cannot find any JDK, no matter which OS (Windows, Linux).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Fix Gradle toolchain download with Gradle 8.4 (#12655) [lucene]

2023-10-12 Thread via GitHub


uschindler opened a new pull request, #12662:
URL: https://github.com/apache/lucene/pull/12662

   Fix the toolchain support when using 8.4:
   
   - Logging of failures is wrong as in Groovy the for loop varaibale is not 
final, wo when it is used in closure which is executed later you get the last 
value of the loop.
   - Figure out why toolchain auto-download is broken (not yet fixed)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Upgrade to Gradle 8.4 [lucene]

2023-10-12 Thread via GitHub


uschindler commented on issue #12655:
URL: https://github.com/apache/lucene/issues/12655#issuecomment-1759316894

   Here is the draft PR, it only fixes the logging, but I have no idea why the 
toolchains are not downloaded. If you clean `~/.gradle/jdks` and then run 
`gradlew :lucene:core:regenerate` it tells you that it found no suitable JDK.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Initial take at adding JMH micro-benchmarks [lucene]

2023-10-12 Thread via GitHub


dweiss opened a new pull request, #12663:
URL: https://github.com/apache/lucene/pull/12663

   As per discussion in: https://github.com/apache/lucene/issues/12641
   
   This patch adds the ability to compile JMH microbenchmarks and run them from 
within Lucene codebase. I didn't use any plugins as they, in my opinion, 
obscure the view of what's happening and what's needed. I also wanted it to 
work with the module system and without uber-jars.
   
   ```
   gradlew -p lucene/benchmark-jmh assemble
   ```
   
   compiles microbenchmarks with JMH, then displays a short help on what to do 
next:
   ```
   JMH benchmarks compiled. Run them with:
   
   java -jar 
lucene\benchmark-jmh\build\benchmarks\lucene-benchmark-jmh-10.0.0-SNAPSHOT.jar
   
   or
   
   java --module-path lucene\benchmark-jmh\build\benchmarks --module 
org.apache.lucene.benchmark.jmh
   
   JMH options you can use with the above:
   
 -h  displays verbose help for all options
 -l  list available benchmarks
 -lp list benchmarks that pass the filter and their parameters
 regexp  execute all benchmark containing regexp
   ```
   
   I didn't add any gradle-specific tasks to run the benchmarks directly - it's 
technically easy but it seems like a bad idea to fork a subprocess from (quite 
heavy) gradle when such delicate things are being measured? I'm not sure here.
   
   Some future improvements/ remaining issues include:
   
   1) multi-source compilation so that code can be compiled against future APIs 
(not the one gradle runs with).
   2) forbiddenapis doesn't have access to vectors and I had to turn it off,
   3) jmh is GPL. It's a dev tool, I don't think it's a problem. I left license 
checks out of it for the time being.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] ability to run JMH benchmarks from gradle [lucene]

2023-10-12 Thread via GitHub


dweiss commented on issue #12641:
URL: https://github.com/apache/lucene/issues/12641#issuecomment-1759370475

   I filed an initial take on this here:
   https://github.com/apache/lucene/pull/12663
   
   Copied just one of your existing benchmarks, Rob. Try if it fits your needs. 
I mentioned on the PR that there is no specific task to launch a benchmark - 
this is easy to add, I'm just not sure if it's a good idea to do so... 
Especially that you can always pipe:
   
   ```
   gradlew -p lucene/benchmark-jmh assemble && java --module-path 
lucene/benchmark-jmh/build/benchmarks --module org.apache.lucene.benchmark.jmh
   ```
   
   or even create a bash alias for it?...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Upgrade to Gradle 8.4 [lucene]

2023-10-12 Thread via GitHub


uschindler commented on issue #12655:
URL: https://github.com/apache/lucene/issues/12655#issuecomment-1759372394

   I found the issue, in Gradle 8+ you need to explicitely configure toolchain 
repositories:
   
   > [Using automatic toolchain downloading without having a repository 
configured](https://docs.gradle.org/current/userguide/upgrading_version_7.html#using_automatic_toolchain_downloading_without_having_a_repository_configured)
   > Automatic toolchain downloading without explicitly providing repositories 
to use is no longer supported. See [user 
manual](https://docs.gradle.org/current/userguide/toolchains.html#sub:download_repositories)
 for more information.
   
   I will add the foojay one in the PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Upgrade to Gradle 8.4 [lucene]

2023-10-12 Thread via GitHub


dweiss commented on issue #12655:
URL: https://github.com/apache/lucene/issues/12655#issuecomment-1759389210

   > because the loop variable is not final (like in Java)
   
   In groovy closures do have access to their full context (including local 
variables outside). That code used gstring, which is a dynamically evaluated 
string - if you used a string concatenation, it'd be immutable and worked fine.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Upgrade to Gradle 8.4 [lucene]

2023-10-12 Thread via GitHub


uschindler commented on issue #12655:
URL: https://github.com/apache/lucene/issues/12655#issuecomment-1759397039

   OK, PR is ready. The plugin needs to be in settings.gradle (it is a 
so-called "settings" plugin, which works on gradle toplevel without project 
context.
   
   PR: #12662 (please review)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Upgrade to Gradle 8.4 [lucene]

2023-10-12 Thread via GitHub


uschindler commented on issue #12655:
URL: https://github.com/apache/lucene/issues/12655#issuecomment-1759401511

   > > because the loop variable is not final (like in Java)
   > 
   > In groovy closures do have access to their full context (including local 
variables outside). That code used gstring, which is a dynamically evaluated 
string - if you used a string concatenation, it'd be immutable and worked fine.
   
   The problem was not the gstring/strings. Problem was that the `onlyIf` 
closure is executed at execution time (so delayed) and not at config time. At 
execution time, the for loop was already on jdkVersion==21 (the last entry 
because the loop had already executed at config time). By using a `each` 
closure the `jdkVersion` is private to the closure and the build-time 
evaluation has its own `jdkVersion` variable (which is a constant).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix Gradle toolchain download with Gradle 8.4 (#12655) [lucene]

2023-10-12 Thread via GitHub


uschindler merged PR #12662:
URL: https://github.com/apache/lucene/pull/12662


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Upgrade to Gradle 8.4 [lucene]

2023-10-12 Thread via GitHub


uschindler closed issue #12655: Upgrade to Gradle 8.4
URL: https://github.com/apache/lucene/issues/12655


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Extract the hnsw graph merging from being part of the vector writer [lucene]

2023-10-12 Thread via GitHub


benwtrent commented on code in PR #12657:
URL: https://github.com/apache/lucene/pull/12657#discussion_r1356732427


##
lucene/core/src/java/org/apache/lucene/codecs/lucene95/IncrementalHnswGraphMerger.java:
##
@@ -0,0 +1,226 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene95;
+
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+
+import java.io.IOException;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.codecs.HnswGraphProvider;
+import org.apache.lucene.codecs.KnnVectorsReader;
+import org.apache.lucene.codecs.KnnVectorsWriter;
+import org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat;
+import org.apache.lucene.index.ByteVectorValues;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.FloatVectorValues;
+import org.apache.lucene.index.MergeState;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.util.Bits;
+import org.apache.lucene.util.hnsw.HnswGraph;
+import org.apache.lucene.util.hnsw.HnswGraphBuilder;
+import org.apache.lucene.util.hnsw.InitializedHnswGraphBuilder;
+import org.apache.lucene.util.hnsw.RandomVectorScorerSupplier;
+
+/**
+ * This selects the biggest Hnsw graph from the provided merge state and 
initializes a new
+ * HnswGraphBuilder with that graph as a starting point.
+ */
+public class IncrementalHnswGraphMerger {
+
+  private final MergeState mergeState;
+  private final FieldInfo fieldInfo;
+
+  /**
+   * @param mergeState MergeState containing the readers to merge
+   * @param fieldInfo FieldInfo for the field being merged
+   */
+  public IncrementalHnswGraphMerger(MergeState mergeState, FieldInfo 
fieldInfo) {
+this.mergeState = mergeState;
+this.fieldInfo = fieldInfo;
+  }
+
+  /**
+   * Selects the biggest Hnsw graph from the provided merge state and 
initializes a new
+   * HnswGraphBuilder with that graph as a starting point.
+   *
+   * @param scorerSupplier ScorerSupplier to use for scoring vectors
+   * @param M The number of connections to allow per node
+   * @param beamWidth The number of nodes to consider when searching for the 
nearest neighbor
+   * @return HnswGraphBuilder initialized with the biggest graph from the 
merge state
+   * @throws IOException If an error occurs while reading from the merge state
+   */
+  public HnswGraphBuilder build(RandomVectorScorerSupplier scorerSupplier, int 
M, int beamWidth)
+  throws IOException {
+// Find the KnnVectorReader with the most docs that meets the following 
criteria:

Review Comment:
   @zhaih I was also thinking of a "true" builder interface that is actually 
extensible. I will have another crack at it today.
   
   The tricky part is that we are not 100% sure what another merger kind would 
look like. 
   
   Something else that would be good could be
   
   `HnswGraphMerger#addGraph` and `HnswGraphMerger#merge`. I will iterate some 
more to see what I can come up with.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Extract the hnsw graph merging from being part of the vector writer [lucene]

2023-10-12 Thread via GitHub


benwtrent commented on code in PR #12657:
URL: https://github.com/apache/lucene/pull/12657#discussion_r1356733186


##
lucene/core/src/java/org/apache/lucene/codecs/lucene95/IncrementalHnswGraphMerger.java:
##
@@ -0,0 +1,226 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene95;
+
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+
+import java.io.IOException;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.codecs.HnswGraphProvider;
+import org.apache.lucene.codecs.KnnVectorsReader;
+import org.apache.lucene.codecs.KnnVectorsWriter;
+import org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat;
+import org.apache.lucene.index.ByteVectorValues;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.FloatVectorValues;
+import org.apache.lucene.index.MergeState;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.util.Bits;
+import org.apache.lucene.util.hnsw.HnswGraph;
+import org.apache.lucene.util.hnsw.HnswGraphBuilder;
+import org.apache.lucene.util.hnsw.InitializedHnswGraphBuilder;
+import org.apache.lucene.util.hnsw.RandomVectorScorerSupplier;
+
+/**
+ * This selects the biggest Hnsw graph from the provided merge state and 
initializes a new
+ * HnswGraphBuilder with that graph as a starting point.
+ */
+public class IncrementalHnswGraphMerger {
+
+  private final MergeState mergeState;
+  private final FieldInfo fieldInfo;
+
+  /**
+   * @param mergeState MergeState containing the readers to merge
+   * @param fieldInfo FieldInfo for the field being merged
+   */
+  public IncrementalHnswGraphMerger(MergeState mergeState, FieldInfo 
fieldInfo) {
+this.mergeState = mergeState;
+this.fieldInfo = fieldInfo;
+  }
+
+  /**
+   * Selects the biggest Hnsw graph from the provided merge state and 
initializes a new
+   * HnswGraphBuilder with that graph as a starting point.
+   *
+   * @param scorerSupplier ScorerSupplier to use for scoring vectors
+   * @param M The number of connections to allow per node
+   * @param beamWidth The number of nodes to consider when searching for the 
nearest neighbor
+   * @return HnswGraphBuilder initialized with the biggest graph from the 
merge state
+   * @throws IOException If an error occurs while reading from the merge state
+   */
+  public HnswGraphBuilder build(RandomVectorScorerSupplier scorerSupplier, int 
M, int beamWidth)

Review Comment:
   @zhaih I think so. But thinking about this more, it makes sense to me that a 
`HnswGraphMerger` accepts graphs, and then returns a fully merged graph. So, I 
am gonna reiterate on the API a bit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Extract the hnsw graph merging from being part of the vector writer [lucene]

2023-10-12 Thread via GitHub


benwtrent commented on code in PR #12657:
URL: https://github.com/apache/lucene/pull/12657#discussion_r1356734118


##
lucene/core/src/java/org/apache/lucene/codecs/lucene95/IncrementalHnswGraphMerger.java:
##
@@ -0,0 +1,226 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene95;
+
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+
+import java.io.IOException;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.codecs.HnswGraphProvider;
+import org.apache.lucene.codecs.KnnVectorsReader;
+import org.apache.lucene.codecs.KnnVectorsWriter;
+import org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat;
+import org.apache.lucene.index.ByteVectorValues;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.FloatVectorValues;
+import org.apache.lucene.index.MergeState;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.util.Bits;
+import org.apache.lucene.util.hnsw.HnswGraph;
+import org.apache.lucene.util.hnsw.HnswGraphBuilder;
+import org.apache.lucene.util.hnsw.InitializedHnswGraphBuilder;
+import org.apache.lucene.util.hnsw.RandomVectorScorerSupplier;
+
+/**
+ * This selects the biggest Hnsw graph from the provided merge state and 
initializes a new
+ * HnswGraphBuilder with that graph as a starting point.
+ */
+public class IncrementalHnswGraphMerger {
+
+  private final MergeState mergeState;
+  private final FieldInfo fieldInfo;
+
+  /**
+   * @param mergeState MergeState containing the readers to merge
+   * @param fieldInfo FieldInfo for the field being merged
+   */
+  public IncrementalHnswGraphMerger(MergeState mergeState, FieldInfo 
fieldInfo) {
+this.mergeState = mergeState;
+this.fieldInfo = fieldInfo;
+  }
+
+  /**
+   * Selects the biggest Hnsw graph from the provided merge state and 
initializes a new
+   * HnswGraphBuilder with that graph as a starting point.
+   *
+   * @param scorerSupplier ScorerSupplier to use for scoring vectors
+   * @param M The number of connections to allow per node
+   * @param beamWidth The number of nodes to consider when searching for the 
nearest neighbor
+   * @return HnswGraphBuilder initialized with the biggest graph from the 
merge state
+   * @throws IOException If an error occurs while reading from the merge state
+   */
+  public HnswGraphBuilder build(RandomVectorScorerSupplier scorerSupplier, int 
M, int beamWidth)

Review Comment:
   @zhaih yes, I am thinking of a concurrent merger & a more advanced merger 
(not one that just initializes from the largest graph, but instead actually 
merges graphs...)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Optimize reading data from postings/impacts enums. [lucene]

2023-10-12 Thread via GitHub


jpountz commented on PR #12664:
URL: https://github.com/apache/lucene/pull/12664#issuecomment-1759536405

   `luceneutil` on `wikibigall` gave good results, better than I expected:
   
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
  HighTermDayOfYearSort  270.41  (2.0%)  255.63  
(2.3%)   -5.5% (  -9% -   -1%) 0.000
CountOrHighHigh   57.44 (16.1%)   54.37 
(11.0%)   -5.4% ( -27% -   25%) 0.220
 CountOrHighMed   89.17 (15.9%)   84.60 
(11.1%)   -5.1% ( -27% -   25%) 0.237
CountAndHighMed  122.57  (3.4%)  120.69  
(4.4%)   -1.5% (  -8% -6%) 0.212
Prefix3  105.15  (5.9%)  104.66  
(5.4%)   -0.5% ( -11% -   11%) 0.795
   Wildcard  112.31  (3.2%)  111.94  
(3.4%)   -0.3% (  -6% -6%) 0.758
CountPhrase4.41  (2.2%)4.40  
(4.2%)   -0.3% (  -6% -6%) 0.776
 IntNRQ  164.43 (13.4%)  163.96 
(12.7%)   -0.3% ( -23% -   29%) 0.945
Respell   73.48  (1.9%)   73.51  
(1.7%)0.0% (  -3% -3%) 0.944
  CountTerm17115.51  (4.7%)17122.81  
(6.5%)0.0% ( -10% -   11%) 0.981
 Fuzzy2  126.72  (1.2%)  127.18  
(1.4%)0.4% (  -2% -2%) 0.380
   PKLookup  224.32  (1.8%)  225.19  
(2.1%)0.4% (  -3% -4%) 0.532
 Fuzzy1  148.99  (1.3%)  149.61  
(1.5%)0.4% (  -2% -3%) 0.345
  LowPhrase   22.22  (3.9%)   22.38  
(3.6%)0.7% (  -6% -8%) 0.529
LowTerm 1050.65  (6.0%) 1058.66  
(4.2%)0.8% (  -8% -   11%) 0.641
  MedPhrase   69.20  (4.0%)   69.91  
(3.3%)1.0% (  -6% -8%) 0.377
MedTerm  611.39  (7.0%)  618.72  
(4.8%)1.2% (  -9% -   13%) 0.525
  HighTermMonthSort 5173.19  (2.6%) 5241.52  
(2.2%)1.3% (  -3% -6%) 0.084
   HighTerm  425.31  (7.9%)  431.22  
(6.0%)1.4% ( -11% -   16%) 0.532
 HighPhrase   45.70  (5.1%)   46.39  
(4.3%)1.5% (  -7% -   11%) 0.311
   CountAndHighHigh   40.59  (4.0%)   41.33  
(5.1%)1.8% (  -6% -   11%) 0.208
  OrHighLow  554.87  (3.7%)  579.95  
(3.8%)4.5% (  -2% -   12%) 0.000
 OrHighHigh   47.17  (4.9%)   49.35  
(5.1%)4.6% (  -5% -   15%) 0.003
AndHighHigh   69.41  (4.1%)   73.39  
(3.9%)5.7% (  -2% -   14%) 0.000
  OrHighMed  244.77  (3.4%)  261.52  
(3.3%)6.8% (   0% -   13%) 0.000
 AndHighMed  124.85  (3.4%)  134.47  
(3.4%)7.7% (   0% -   15%) 0.000
 AndHighLow 1059.79  (2.7%) 1152.15  
(2.2%)8.7% (   3% -   14%) 0.000
   ```
   
   All `OrXY` and `AndYY` tasks show a good speedup with a p-value equal to 0. 
I'm pretty sure that the queries that show a regression are noise.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Optimize reading data from postings/impacts enums. [lucene]

2023-10-12 Thread via GitHub


jpountz commented on code in PR #12664:
URL: https://github.com/apache/lucene/pull/12664#discussion_r1356764107


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90PostingsReader.java:
##
@@ -1143,14 +1151,16 @@ private void refillDocs() throws IOException {
   if (left >= BLOCK_SIZE) {
 pforUtil.decodeAndPrefixSum(docIn, accum, docBuffer);
 if (indexHasFreqs) {
-  pforUtil.decode(docIn, freqBuffer);
+  isFreqsRead = false;

Review Comment:
   This was the change needed to decode freqs lazily.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Upgrade to Gradle 8.4 [lucene]

2023-10-12 Thread via GitHub


risdenk commented on issue #12655:
URL: https://github.com/apache/lucene/issues/12655#issuecomment-1759556593

   Thanks @uschindler and @dweiss sorry for the work here :( I'll try to make 
sure I test all these cases next time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Optimize reading data from postings/impacts enums. [lucene]

2023-10-12 Thread via GitHub


jpountz commented on code in PR #12664:
URL: https://github.com/apache/lucene/pull/12664#discussion_r1356769495


##
lucene/core/src/java/org/apache/lucene/codecs/MultiLevelSkipListReader.java:
##
@@ -113,13 +113,14 @@ public int skipTo(int target) throws IOException {
 
 // walk up the levels until highest level is found that has a skip
 // for this target
+// TODO: start skip docs and postings at -1 instead of 0 to avoid this 
special case on skipDoc == 0?
 int level = 0;
-while (level < numberOfSkipLevels - 1 && target > skipDoc[level + 1]) {
+while (level < numberOfSkipLevels - 1 && (target > skipDoc[level + 1] || 
skipDoc[level + 1] == 0)) {
   level++;
 }
 
 while (level >= 0) {
-  if (target > skipDoc[level]) {
+  if (target > skipDoc[level] || skipDoc[level] == 0) {

Review Comment:
   It's a bit annoying I had to change skip lists, but it was required so that 
skip lists and postings blocks remain aligned. Otherwise `advanceShallow(0)` 
would set `nextSkipDoc` equal to 0 instead of the last doc ID of the first 
block.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Initial take at adding JMH micro-benchmarks [lucene]

2023-10-12 Thread via GitHub


rmuir commented on PR #12663:
URL: https://github.com/apache/lucene/pull/12663#issuecomment-1759558774

   @dweiss thank you for this! I tried it out and it works great. Glad to see 
the `cosineDistanceVectorUtil()` working in this example which is exactly what 
I want to do.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Optimize reading data from postings/impacts enums. [lucene]

2023-10-12 Thread via GitHub


jpountz commented on PR #12664:
URL: https://github.com/apache/lucene/pull/12664#issuecomment-1759583212

   I tried to re-inline advance() but it still suggested a small slowdown for 
`HighTermDayOfYearSort` and `CountAndHighMed` too so I suspect that there is 
something else. There is indeed one more check when one needs to move to a 
different block, but I would expect the cost to be dominated by the decoding of 
a full block in that case so that it shouldn't make a big difference.
   
   The slowdown on `CountOrHighHigh` and `CountOrHighMed` reported above should 
be either noise or due to the JVM compiling things differently because of other 
changes, given that `PostingsEnum#nextDoc()` wasn't changed.
   
   I'm tempted to not think too hard about `HighTermDayOfYearSort` and 
`CountAndHighMed` since there are more queries that get faster than slower with 
this change. While it's not big, I would expect the speedup on 
`CountAndHighHigh` to be real since it generally needs to advance postings by 
small increments, which is what the change to `PostingsEnum#advance` is 
supposed to help with.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Optimize reading data from postings/impacts enums. [lucene]

2023-10-12 Thread via GitHub


jpountz commented on PR #12664:
URL: https://github.com/apache/lucene/pull/12664#issuecomment-1759592840

   Another potential reason for the slowdowns is that they're caused by the 
changes to `MultiLevelSkipListReader`. Removing the additional checks would 
require changing the way we encode postings and skip lists to start from -1 
instead of 0, ie. changes on the write side too. (This feels like it would be a 
good change in general anyway, e.g. if you have a term that is present in every 
document today, we don't enable the "all deltas are the same" optimization for 
the first block of postings because the first delta in the block is actually 0 
and not 1.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Increase the number of dims for KNN vectors to 2048 [LUCENE-10471] [lucene]

2023-10-12 Thread via GitHub


mayya-sharipova commented on issue #11507:
URL: https://github.com/apache/lucene/issues/11507#issuecomment-1759594047

   Yes, thanks for the reminder. Now Codec is responsible for managing dims, we 
can close it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Increase the number of dims for KNN vectors to 2048 [LUCENE-10471] [lucene]

2023-10-12 Thread via GitHub


mayya-sharipova closed issue #11507: Increase the number of dims for KNN 
vectors to 2048 [LUCENE-10471]
URL: https://github.com/apache/lucene/issues/11507


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Initial take at adding JMH micro-benchmarks [lucene]

2023-10-12 Thread via GitHub


dweiss commented on PR #12663:
URL: https://github.com/apache/lucene/pull/12663#issuecomment-1759605550

   Let me review the minor remaining bits - I had to leave the computer before 
the tests completed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Initial take at adding JMH micro-benchmarks [lucene]

2023-10-12 Thread via GitHub


benwtrent commented on PR #12663:
URL: https://github.com/apache/lucene/pull/12663#issuecomment-1759607611

   @dweiss this is awesome!!! Thank you so much!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Extract the hnsw graph merging from being part of the vector writer [lucene]

2023-10-12 Thread via GitHub


benwtrent commented on code in PR #12657:
URL: https://github.com/apache/lucene/pull/12657#discussion_r1356818110


##
lucene/core/src/java/org/apache/lucene/codecs/lucene95/IncrementalHnswGraphMerger.java:
##
@@ -0,0 +1,226 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene95;
+
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+
+import java.io.IOException;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.codecs.HnswGraphProvider;
+import org.apache.lucene.codecs.KnnVectorsReader;
+import org.apache.lucene.codecs.KnnVectorsWriter;
+import org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat;
+import org.apache.lucene.index.ByteVectorValues;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.FloatVectorValues;
+import org.apache.lucene.index.MergeState;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.util.Bits;
+import org.apache.lucene.util.hnsw.HnswGraph;
+import org.apache.lucene.util.hnsw.HnswGraphBuilder;
+import org.apache.lucene.util.hnsw.InitializedHnswGraphBuilder;
+import org.apache.lucene.util.hnsw.RandomVectorScorerSupplier;
+
+/**
+ * This selects the biggest Hnsw graph from the provided merge state and 
initializes a new
+ * HnswGraphBuilder with that graph as a starting point.
+ */
+public class IncrementalHnswGraphMerger {
+
+  private final MergeState mergeState;
+  private final FieldInfo fieldInfo;
+
+  /**
+   * @param mergeState MergeState containing the readers to merge
+   * @param fieldInfo FieldInfo for the field being merged
+   */
+  public IncrementalHnswGraphMerger(MergeState mergeState, FieldInfo 
fieldInfo) {
+this.mergeState = mergeState;
+this.fieldInfo = fieldInfo;
+  }
+
+  /**
+   * Selects the biggest Hnsw graph from the provided merge state and 
initializes a new
+   * HnswGraphBuilder with that graph as a starting point.
+   *
+   * @param scorerSupplier ScorerSupplier to use for scoring vectors
+   * @param M The number of connections to allow per node
+   * @param beamWidth The number of nodes to consider when searching for the 
nearest neighbor
+   * @return HnswGraphBuilder initialized with the biggest graph from the 
merge state
+   * @throws IOException If an error occurs while reading from the merge state
+   */
+  public HnswGraphBuilder build(RandomVectorScorerSupplier scorerSupplier, int 
M, int beamWidth)

Review Comment:
   I guess, " fully merged graph" isn't possible as we will always need to 
account for vectors that couldn't be read from a graph (i.e. previous Lucene 
95). 
   
   So, accepting graphs and returning a builder seems like the best API. I will 
iterate on how it looks.
   
   I don't expect this to be perfect and we may have to change it when we 
actually add a new merger kind.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Avoid use docsSeen in BKDWriter [lucene]

2023-10-12 Thread via GitHub


jpountz commented on code in PR #12658:
URL: https://github.com/apache/lucene/pull/12658#discussion_r1356844946


##
lucene/core/src/java/org/apache/lucene/codecs/PointsWriter.java:
##
@@ -202,7 +202,7 @@ public long size() {
 
   @Override
   public int getDocCount() {
-throw new UnsupportedOperationException();
+return -1;

Review Comment:
   I don't like that we're returning -1 here since it's not supposed to be a 
legal return value for `getDocCount()`. Can we do it otherwise somehow?



##
lucene/core/src/java/org/apache/lucene/util/bkd/BKDWriter.java:
##
@@ -1248,7 +1259,8 @@ private void writeIndex(
 metaOut.writeBytes(maxPackedValue, 0, config.packedIndexBytesLength);
 
 metaOut.writeVLong(pointCount);
-metaOut.writeVInt(docsSeen.cardinality());
+assert docsSeen != null || docCount != -1;

Review Comment:
   I believe that the assertion is stronger than that, as we can't have 
docsSeen != null and docCount != -1?
   
   ```suggestion
   assert docsSeen != null ^ docCount != -1;
   ```



##
lucene/core/src/java/org/apache/lucene/util/bkd/BKDWriter.java:
##
@@ -519,9 +526,8 @@ private Runnable writeFieldNDims(
 // compute the min/max for this slice
 computePackedValueBounds(
 values, 0, Math.toIntExact(pointCount), minPackedValue, 
maxPackedValue, scratchBytesRef1);
-for (int i = 0; i < Math.toIntExact(pointCount); ++i) {
-  docsSeen.set(values.getDocID(i));
-}
+// docCount has already been set by {@link BKDWriter#writeField}
+assert docCount != -1;

Review Comment:
   Should we require that docCount cannot be `-1` in writeField then?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Upgrade to Gradle 8.4 [lucene]

2023-10-12 Thread via GitHub


dweiss commented on issue #12655:
URL: https://github.com/apache/lucene/issues/12655#issuecomment-1759671663

   > By using a each closure the jdkVersion is private to the closure and the 
build-time evaluation has its own jdkVersion variable (which is a constant).
   
   Ah, thanks for clarifying, Uwe. I only looked at the diff and suspected the 
gstring to blame - overall it's the same thing - the previous code referenced a 
local variable (not the current loop's value, the variable). Switching to 
'each' makes the variable a closure argument, so it's indeed constant... it is 
sometimes tricky because if you nest another loop/closure, it can still observe 
the outer variable's current value. It's a virtue and a pain, depends on when 
you use it... 
   
   For what it's worth, I've tried switching a few builds to Kotlin... but I 
wasn't convinced. It's nice to have a type checker but there were many things 
that required more verbose constructs to achieve.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Upgrade to Gradle 8.4 [lucene]

2023-10-12 Thread via GitHub


dweiss commented on issue #12655:
URL: https://github.com/apache/lucene/issues/12655#issuecomment-1759674565

   No worries, @risdenk - it's a big effort already, there'll always be 
problems, especially with major version upgrades. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Explicitly return needStats flag in TermStates [lucene]

2023-10-12 Thread via GitHub


jpountz commented on PR #12638:
URL: https://github.com/apache/lucene/pull/12638#issuecomment-1759681148

   @yugushihuang Your explanation suggests that a `TermStates` could be created 
somewhere and then used in a different context. But this is not how it's 
expected to be used in general, the expectation is that the same caller owns 
both the creation and then the consumption of the `TermStates` object, so it 
knows exactly how it's been created. An example is `TermWeight`: it creates the 
`TermStates` object itself based on the `ScoreMode` so it knows whether the 
`TermStates` object has stats or not later on when it consumes it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Ability to compute vector similarity scores with DoubleValuesSource [lucene]

2023-10-12 Thread via GitHub


shubhamvishu commented on PR #12548:
URL: https://github.com/apache/lucene/pull/12548#issuecomment-1759689286

   Thanks for reviewing! 
   
   @benwtrent I have added a CHANGES.txt entry now, could you help me merging 
this if all looks good?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Ability to compute vector similarity scores with DoubleValuesSource [lucene]

2023-10-12 Thread via GitHub


benwtrent commented on PR #12548:
URL: https://github.com/apache/lucene/pull/12548#issuecomment-1759701427

   @shubhamvishu running CI :). I will see about merging and such soon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Initial take at adding JMH micro-benchmarks [lucene]

2023-10-12 Thread via GitHub


benwtrent commented on PR #12663:
URL: https://github.com/apache/lucene/pull/12663#issuecomment-1759752950

   @dweiss one suggestion I have is adding a `README` for 
`lucene/benchmark-jmh` on how to run the benchmarks and add new ones. Your 
description on how to use it in this PR seems adequate.
   
   I tested this locally and was able to follow your directions and get the 
benchmarks to run without issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Initial take at adding JMH micro-benchmarks [lucene]

2023-10-12 Thread via GitHub


dweiss commented on PR #12663:
URL: https://github.com/apache/lucene/pull/12663#issuecomment-1759758762

   I enabled forbiddenapis to the extent possible, to make Uwe happier. ;) I 
like those flame-graphs that Solr has, etc., but these can come later. I'll add 
a simple ```gradlew :helpJmh``` target similar to everything else.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Cleanup flushing logic in DocumentsWriter [lucene]

2023-10-12 Thread via GitHub


s1monw merged PR #12647:
URL: https://github.com/apache/lucene/pull/12647


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Initial take at adding JMH micro-benchmarks [lucene]

2023-10-12 Thread via GitHub


dweiss commented on PR #12663:
URL: https://github.com/apache/lucene/pull/12663#issuecomment-1759777296

   @benwtrent Can you take a look? If there's any wording you think would work 
better, please do change it. Explaining how JMH works is perhaps beyond the 
scope of the readme file... ;)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Initial take at adding JMH micro-benchmarks [lucene]

2023-10-12 Thread via GitHub


benwtrent commented on PR #12663:
URL: https://github.com/apache/lucene/pull/12663#issuecomment-1759795868

   Looks good @dweiss I added a line with an example for running a single 
benchmark or suite of benchmarks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Initial take at adding JMH micro-benchmarks [lucene]

2023-10-12 Thread via GitHub


benwtrent commented on code in PR #12663:
URL: https://github.com/apache/lucene/pull/12663#discussion_r1356970803


##
lucene/benchmark-jmh/README.txt:
##
@@ -0,0 +1,21 @@
+The :lucene:benchmark-jmh module contains can be used to compile
+and execute JMH (https://github.com/openjdk/jmh) micro-benchmarks.
+
+Look at existing classes and JMH documentation for inspiration on how
+to write good micro-benchmarks.
+
+To compile the project and prepare JMH launcher, run:
+
+gradlew :lucene:benchmark-jmh:assemble
+
+The above target will display exact commands to execute JMH from
+command line, for example:
+
+java --module-path lucene\benchmark-jmh\build\benchmarks --module 
org.apache.lucene.benchmark.jmh
+
+You can pass any JMH options to the above command, for example:
+
+  -h  displays verbose help for all options
+  -l  list available benchmarks
+  -lp list benchmarks that pass the filter and their parameters
+  regexp  execute all benchmark containing regexp

Review Comment:
   ```suggestion
 regexp  execute all benchmark containing regexp

   Here is an example running a single benchmark:
   
   java --module-path lucene\benchmark-jmh\build\benchmarks --module 
org.apache.lucene.benchmark.jmh 
org.apache.lucene.benchmark.jmh.BinaryCosineBenchmark.cosineDistanceNew
   
   Or running all of BinaryCosineBenchmark 
   
   java --module-path lucene\benchmark-jmh\build\benchmarks --module 
org.apache.lucene.benchmark.jmh 
org.apache.lucene.benchmark.jmh.BinaryCosineBenchmark
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add interface VectorValues to be implemented by [Float/Byte]VectorValues [lucene]

2023-10-12 Thread via GitHub


shubhamvishu commented on PR #12636:
URL: https://github.com/apache/lucene/pull/12636#issuecomment-1759814356

   @benwtrent Do you mean we earlier had similar implementation or such 
interface?
   
   > we decided to switch it as a common interface required either
   
   Which interface are you referring to `RandomAccessVectorValues`? as that 
uses generics to avoid duplication?
   
   >If we are worried about internal duplication there may be better options 
around using a better abstraction for scoring (see RandomVectorScorer) and 
writing to a flat file (on flush and merges)
   
   You mean we could do something similar to `RandomVectorScorer` etc. what has 
been done for `RandomAccessVectorValues`(i.e. use generics)?
   
   
   So are you suggesting saying instead of making `[Float/Byte]VectorValues` 
use this `VectorValues` interface we should only try avoiding duplication at 
the places you pointed?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Ability to compute vector similarity scores with DoubleValuesSource [lucene]

2023-10-12 Thread via GitHub


msokolov commented on code in PR #12548:
URL: https://github.com/apache/lucene/pull/12548#discussion_r1357027508


##
lucene/core/src/java/org/apache/lucene/search/DoubleValuesSource.java:
##
@@ -172,6 +173,52 @@ public LongValuesSource rewrite(IndexSearcher searcher) 
throws IOException {
 }
   }
 
+  /**
+   * Returns a DoubleValues instance for computing the vector similarity score 
per document against
+   * the byte query vector
+   *
+   * @param ctx the context for which to return the DoubleValues
+   * @param queryVector byte query vector
+   * @param vectorField knn byte field name
+   * @return DoubleValues instance
+   * @throws IOException if an {@link IOException} occurs
+   */
+  public static DoubleValues similarityToQueryVector(

Review Comment:
   Let's move these to VectorSimilaritySource? This file seem to be a dumping 
ground for all the things; let's not pile on.



##
lucene/core/src/java/org/apache/lucene/search/ByteVectorSimilarityValuesSource.java:
##
@@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Objects;
+import org.apache.lucene.index.ByteVectorValues;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.VectorSimilarityFunction;
+
+/**
+ * A {@link DoubleValuesSource} which computes the vector similarity scores 
between the query vector

Review Comment:
   nit: my English profs always said to use "that" in place of "which" when 
possible



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Ability to compute vector similarity scores with DoubleValuesSource [lucene]

2023-10-12 Thread via GitHub


msokolov commented on PR #12548:
URL: https://github.com/apache/lucene/pull/12548#issuecomment-1759887696

   oh I missed that @benwtrent had approved -- it's OK w/me to merge as-is, 
just had some idea about organizing the static methods into a different class...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Avoid use docsSeen in BKDWriter [lucene]

2023-10-12 Thread via GitHub


easyice commented on code in PR #12658:
URL: https://github.com/apache/lucene/pull/12658#discussion_r1357032536


##
lucene/core/src/java/org/apache/lucene/util/bkd/BKDWriter.java:
##
@@ -1248,7 +1259,8 @@ private void writeIndex(
 metaOut.writeBytes(maxPackedValue, 0, config.packedIndexBytesLength);
 
 metaOut.writeVLong(pointCount);
-metaOut.writeVInt(docsSeen.cardinality());
+assert docsSeen != null || docCount != -1;

Review Comment:
   Wow, it is wonderful, thanks for your suggestion!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Ability to compute vector similarity scores with DoubleValuesSource [lucene]

2023-10-12 Thread via GitHub


shubhamvishu commented on PR #12548:
URL: https://github.com/apache/lucene/pull/12548#issuecomment-1759892123

   > oh I missed that @benwtrent had approved -- it's OK w/me to merge as-is, 
just had some idea about organizing the static methods into a different class...
   
   Thanks @msokolov ! I'll open up a follow up PR to address your ideas if that 
sounds good?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Avoid use docsSeen in BKDWriter [lucene]

2023-10-12 Thread via GitHub


easyice commented on code in PR #12658:
URL: https://github.com/apache/lucene/pull/12658#discussion_r1357049989


##
lucene/core/src/java/org/apache/lucene/codecs/PointsWriter.java:
##
@@ -202,7 +202,7 @@ public long size() {
 
   @Override
   public int getDocCount() {
-throw new UnsupportedOperationException();
+return -1;

Review Comment:
   Thank you for catch it, maybe we can add an function like `hasDocCount` to 
indicate there has valid `docCount` value, or the other way, catch the 
`UnsupportedOperationException` then decide whether to use `docCount` , I 
prefer the first one, what do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Ability to compute vector similarity scores with DoubleValuesSource [lucene]

2023-10-12 Thread via GitHub


shubhamvishu commented on code in PR #12548:
URL: https://github.com/apache/lucene/pull/12548#discussion_r1357050474


##
lucene/core/src/java/org/apache/lucene/search/DoubleValuesSource.java:
##
@@ -172,6 +173,52 @@ public LongValuesSource rewrite(IndexSearcher searcher) 
throws IOException {
 }
   }
 
+  /**
+   * Returns a DoubleValues instance for computing the vector similarity score 
per document against
+   * the byte query vector
+   *
+   * @param ctx the context for which to return the DoubleValues
+   * @param queryVector byte query vector
+   * @param vectorField knn byte field name
+   * @return DoubleValues instance
+   * @throws IOException if an {@link IOException} occurs
+   */
+  public static DoubleValues similarityToQueryVector(

Review Comment:
   Sure. I see different DVS use cases implemented as static classes in this 
class. How about we move those static classes and functions(like you pointed) 
into `DoubleValuesSourceUtil` maybe?. I'll open a followup PR to address that. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] [DISCUSS] Should there be a threshold-based vector search API? [lucene]

2023-10-12 Thread via GitHub


msokolov commented on issue #12579:
URL: https://github.com/apache/lucene/issues/12579#issuecomment-1759925268

   Interesting - so this is kind of like a noisy radius search in high 
dimensions? It makes sense to me intuitively since we don't generally expect 
searches to have the same number of results. If I search for `asd89HH8!@` I may 
get only very low-quality results, perhaps none that would exceed the threshold 
/ fall in the radius boundary whereas if I search for something that shares 
terms in common with many documents, I would expect vector-based search to find 
them all. It's kind of weird from an IR perspective that we just pick some 
query-independent K. I wonder if there is some precedent for this in semantic 
search literature?
   
   I found one system that supports a threshold: 
https://typesense.org/docs/0.25.0/api/vector-search.html#distance-threshold but 
I don't think we do support that yet, do we, in KnnVector*Query? Perhaps we 
ought to lead with tha in order to introduce the utility of radius-searching, 
since you can simulate this with a threshold + large K, right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Avoid use docsSeen in BKDWriter [lucene]

2023-10-12 Thread via GitHub


easyice commented on code in PR #12658:
URL: https://github.com/apache/lucene/pull/12658#discussion_r1357058015


##
lucene/core/src/java/org/apache/lucene/util/bkd/BKDWriter.java:
##
@@ -519,9 +526,8 @@ private Runnable writeFieldNDims(
 // compute the min/max for this slice
 computePackedValueBounds(
 values, 0, Math.toIntExact(pointCount), minPackedValue, 
maxPackedValue, scratchBytesRef1);
-for (int i = 0; i < Math.toIntExact(pointCount); ++i) {
-  docsSeen.set(values.getDocID(i));
-}
+// docCount has already been set by {@link BKDWriter#writeField}
+assert docCount != -1;

Review Comment:
   yes, you are right, we can add the assert in `writeField` too, this is more 
rigorous.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Avoid use docsSeen in BKDWriter [lucene]

2023-10-12 Thread via GitHub


easyice commented on code in PR #12658:
URL: https://github.com/apache/lucene/pull/12658#discussion_r1357058015


##
lucene/core/src/java/org/apache/lucene/util/bkd/BKDWriter.java:
##
@@ -519,9 +526,8 @@ private Runnable writeFieldNDims(
 // compute the min/max for this slice
 computePackedValueBounds(
 values, 0, Math.toIntExact(pointCount), minPackedValue, 
maxPackedValue, scratchBytesRef1);
-for (int i = 0; i < Math.toIntExact(pointCount); ++i) {
-  docsSeen.set(values.getDocID(i));
-}
+// docCount has already been set by {@link BKDWriter#writeField}
+assert docCount != -1;

Review Comment:
   yes, you are right, we can add the assert in `writeField` too, this is more 
rigorous, unless the merges can also avoid using `docSeen`, that might be some 
differences



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Ability to compute vector similarity scores with DoubleValuesSource [lucene]

2023-10-12 Thread via GitHub


shubhamvishu commented on code in PR #12548:
URL: https://github.com/apache/lucene/pull/12548#discussion_r1357050474


##
lucene/core/src/java/org/apache/lucene/search/DoubleValuesSource.java:
##
@@ -172,6 +173,52 @@ public LongValuesSource rewrite(IndexSearcher searcher) 
throws IOException {
 }
   }
 
+  /**
+   * Returns a DoubleValues instance for computing the vector similarity score 
per document against
+   * the byte query vector
+   *
+   * @param ctx the context for which to return the DoubleValues
+   * @param queryVector byte query vector
+   * @param vectorField knn byte field name
+   * @return DoubleValues instance
+   * @throws IOException if an {@link IOException} occurs
+   */
+  public static DoubleValues similarityToQueryVector(

Review Comment:
   Sure. I see different DVS use cases implemented as static classes here 
making this less readable. How about we move those static classes and 
functions(like you pointed) into a new class `DoubleValuesSourceUtil` 
maybe(only thing those classes are private currently)?. I'll open a followup PR 
to address that. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Ensure LeafCollector#finish is only called once on the main collector during drill-sideways [lucene]

2023-10-12 Thread via GitHub


gsmiller commented on code in PR #12642:
URL: https://github.com/apache/lucene/pull/12642#discussion_r1357072892


##
lucene/facet/src/test/org/apache/lucene/facet/TestDrillSideways.java:
##
@@ -316,6 +316,58 @@ public void testBasic() throws Exception {
 IOUtils.close(searcher.getIndexReader(), taxoReader, taxoWriter, dir, 
taxoDir);
   }
 
+  public void testCollectionTerminated() throws Exception {
+try (Directory dir = newDirectory();
+Directory taxoDir = newDirectory();
+RandomIndexWriter w = new RandomIndexWriter(random(), dir);
+DirectoryTaxonomyWriter taxoW =
+new DirectoryTaxonomyWriter(taxoDir, 
IndexWriterConfig.OpenMode.CREATE)) {
+  FacetsConfig facetsConfig = new FacetsConfig();
+
+  Document d = new Document();
+  d.add(new FacetField("foo", "bar"));
+  w.addDocument(facetsConfig.build(taxoW, d));
+
+  try (IndexReader r = w.getReader();
+  TaxonomyReader taxoR = new DirectoryTaxonomyReader(taxoW)) {
+IndexSearcher searcher = new IndexSearcher(r);
+
+Query baseQuery = new MatchAllDocsQuery();
+Query dimQ = new TermQuery(new Term("foo", "bar"));
+
+DrillDownQuery ddq = new DrillDownQuery(facetsConfig, baseQuery);
+ddq.add("foo", dimQ);
+DrillSideways drillSideways = new DrillSideways(searcher, 
facetsConfig, taxoR);
+
+CollectorManager cm =
+new CollectorManager<>() {
+  @Override
+  public Collector newCollector() throws IOException {
+return new Collector() {

Review Comment:
   Yeah, I'm kind of conflicted here to be honest. I'm hesitant to expose 
something like `hasFinishedCollectingPreviousLeaf` for this one testing 
use-case. Another option would be to somehow make `AssertingBulkScorer` "aware" 
of when it's allowed to score ranges of documents. That's the fundamental issue 
here is that we have this drill-sideways bulk scorer `DrillSidewaysScorer` that 
_must_ score all docs at once, and cannot score ranges. This really just 
exposes some fundamental design frictions with drill-sideways, but I don't want 
to spiral this change into something much bigger than it needs to be (i.e., I 
acknowledge that we might benefit from a rethinking of how we expose 
drill-sideways functionality, but that's a much bigger task).
   
   So... things we can do here without turning this into much more work:
   1. Expose `hasFinishedCollectingPreviousLeaf` as you suggest
   2. Skip using this "asserting" family of classes and just implement our own 
wrapper classes specific to this test that are meant to just test the calling 
of `finish`
   3. Add a new method to the `BulkScorer` API that communicates whether-or-not 
doc range scoring is legal, then make `AssertingBulkScorer` aware of that new 
method and have it check whether-or-not it can try scoring a range of docs
   4. Have `AssertingBulkScorer` check the instance type of the bulk scorer 
it's wrapping, which would require exposing the definition of 
`DrillSidewaysScorer` beyond the facets package (i.e., make it public)
   
   I don't like #3; adding a new method to the `BulkScorer` definition feels 
like overkill to solve this. I also don't really like #1 or #4 since they both 
require visibility modifications just for this test, but I think #1 is better 
than #4 since it only impacts testing code and not production code. I also 
don't love #2 since we miss out on other nice checks.
   
   I think I'm convinced that your suggestion of exposing 
`hasFinishedCollectingPreviousLeaf` is the least of all evils here. I'll give 
that a try and let's see what we think.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Avoid use docsSeen in BKDWriter [lucene]

2023-10-12 Thread via GitHub


easyice commented on PR #12658:
URL: https://github.com/apache/lucene/pull/12658#issuecomment-1759980060

   > I don't like when merges allocate O(maxDoc) memory so I'm keen to fixing 
this. The idea looks correct to me, I wonder if we can make it a bit more 
robust?
   
   okay, I will take a look at this, maybe we can do the optimize on segment 
merges without deleted docs only?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] 8.11 deps [lucene-solr]

2023-10-12 Thread via GitHub


risdenk opened a new pull request, #2681:
URL: https://github.com/apache/lucene-solr/pull/2681

   (no comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Ensure LeafCollector#finish is only called once on the main collector during drill-sideways [lucene]

2023-10-12 Thread via GitHub


gsmiller commented on code in PR #12642:
URL: https://github.com/apache/lucene/pull/12642#discussion_r1357145658


##
lucene/facet/src/test/org/apache/lucene/facet/TestDrillSideways.java:
##
@@ -316,6 +317,68 @@ public void testBasic() throws Exception {
 IOUtils.close(searcher.getIndexReader(), taxoReader, taxoWriter, dir, 
taxoDir);
   }
 
+  public void testLeafCollectorSingleFinishCall() throws Exception {
+try (Directory dir = newDirectory();
+Directory taxoDir = newDirectory();
+RandomIndexWriter w = new RandomIndexWriter(random(), dir);
+DirectoryTaxonomyWriter taxoW =
+new DirectoryTaxonomyWriter(taxoDir, 
IndexWriterConfig.OpenMode.CREATE)) {
+  FacetsConfig facetsConfig = new FacetsConfig();
+
+  Document d = new Document();
+  d.add(new FacetField("foo", "bar"));
+  w.addDocument(facetsConfig.build(taxoW, d));
+
+  try (IndexReader r = w.getReader();
+  TaxonomyReader taxoR = new DirectoryTaxonomyReader(taxoW)) {
+IndexSearcher searcher = new IndexSearcher(r);
+
+Query baseQuery = new MatchAllDocsQuery();
+Query dimQ = new TermQuery(new Term("foo", "bar"));
+
+DrillDownQuery ddq = new DrillDownQuery(facetsConfig, baseQuery);
+ddq.add("foo", dimQ);
+DrillSideways drillSideways = new DrillSideways(searcher, 
facetsConfig, taxoR);
+
+CollectorManager cm =
+new CollectorManager<>() {
+  @Override
+  public Collector newCollector() throws IOException {
+// We don't need the collector to actually do anything; we 
just care about the logic
+// in the AssertingCollector / AssertingLeafCollector (and 
AssertingIndexSearcher)
+// to make sure #finish is called exactly once on the leaf 
collector:
+return AssertingCollector.wrap(
+new Collector() {
+  @Override
+  public LeafCollector getLeafCollector(LeafReaderContext 
context)
+  throws IOException {
+return new LeafCollector() {

Review Comment:
   Yeah fair. I think I'll actually use a `CollectorManager` that's already 
defined for drill-sideways testing to simplify even further. We don't really 
need to collect anything for this test, but it's easier to just use it and not 
do all this custom setup.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Ensure LeafCollector#finish is only called once on the main collector during drill-sideways [lucene]

2023-10-12 Thread via GitHub


gf2121 commented on code in PR #12642:
URL: https://github.com/apache/lucene/pull/12642#discussion_r1357151214


##
lucene/facet/src/test/org/apache/lucene/facet/TestDrillSideways.java:
##
@@ -316,6 +316,58 @@ public void testBasic() throws Exception {
 IOUtils.close(searcher.getIndexReader(), taxoReader, taxoWriter, dir, 
taxoDir);
   }
 
+  public void testCollectionTerminated() throws Exception {
+try (Directory dir = newDirectory();
+Directory taxoDir = newDirectory();
+RandomIndexWriter w = new RandomIndexWriter(random(), dir);
+DirectoryTaxonomyWriter taxoW =
+new DirectoryTaxonomyWriter(taxoDir, 
IndexWriterConfig.OpenMode.CREATE)) {
+  FacetsConfig facetsConfig = new FacetsConfig();
+
+  Document d = new Document();
+  d.add(new FacetField("foo", "bar"));
+  w.addDocument(facetsConfig.build(taxoW, d));
+
+  try (IndexReader r = w.getReader();
+  TaxonomyReader taxoR = new DirectoryTaxonomyReader(taxoW)) {
+IndexSearcher searcher = new IndexSearcher(r);
+
+Query baseQuery = new MatchAllDocsQuery();
+Query dimQ = new TermQuery(new Term("foo", "bar"));
+
+DrillDownQuery ddq = new DrillDownQuery(facetsConfig, baseQuery);
+ddq.add("foo", dimQ);
+DrillSideways drillSideways = new DrillSideways(searcher, 
facetsConfig, taxoR);
+
+CollectorManager cm =
+new CollectorManager<>() {
+  @Override
+  public Collector newCollector() throws IOException {
+return new Collector() {

Review Comment:
   Thanks @gsmiller , conflicted +1 :)
   
   I think the root cause here is that `AssertingCollector` can not be 
perfectly self-contained —— it requires `AssertingIndexSeacher` to do some 
additional check.  As we are considering to make `AssertingCollector` public, 
we should at least expose something to allow users that want to use public 
`AssertingCollector` directly without `AssertingIndexSearcher` to do the 
additional check. Maybe there's a more elegant way than exposing 
`hasFinishedCollectingPreviousLeaf` directly, but I'm not sure what it is.



##
lucene/facet/src/test/org/apache/lucene/facet/TestDrillSideways.java:
##
@@ -316,6 +316,58 @@ public void testBasic() throws Exception {
 IOUtils.close(searcher.getIndexReader(), taxoReader, taxoWriter, dir, 
taxoDir);
   }
 
+  public void testCollectionTerminated() throws Exception {
+try (Directory dir = newDirectory();
+Directory taxoDir = newDirectory();
+RandomIndexWriter w = new RandomIndexWriter(random(), dir);
+DirectoryTaxonomyWriter taxoW =
+new DirectoryTaxonomyWriter(taxoDir, 
IndexWriterConfig.OpenMode.CREATE)) {
+  FacetsConfig facetsConfig = new FacetsConfig();
+
+  Document d = new Document();
+  d.add(new FacetField("foo", "bar"));
+  w.addDocument(facetsConfig.build(taxoW, d));
+
+  try (IndexReader r = w.getReader();
+  TaxonomyReader taxoR = new DirectoryTaxonomyReader(taxoW)) {
+IndexSearcher searcher = new IndexSearcher(r);
+
+Query baseQuery = new MatchAllDocsQuery();
+Query dimQ = new TermQuery(new Term("foo", "bar"));
+
+DrillDownQuery ddq = new DrillDownQuery(facetsConfig, baseQuery);
+ddq.add("foo", dimQ);
+DrillSideways drillSideways = new DrillSideways(searcher, 
facetsConfig, taxoR);
+
+CollectorManager cm =
+new CollectorManager<>() {
+  @Override
+  public Collector newCollector() throws IOException {
+return new Collector() {

Review Comment:
   Thanks @gsmiller , conflicted +1 :)
   
   I think the root cause here is that `AssertingCollector` can not be 
perfectly self-contained —— it requires `AssertingIndexSeacher` to do some 
additional check.  As we are considering to make `AssertingCollector` public, 
we should at least expose something to allow users that want to use public 
`AssertingCollector` directly without `AssertingIndexSearcher` to do the 
additional check. Maybe there's a more elegant way than exposing 
`hasFinishedCollectingPreviousLeaf` directly, but I'm not sure what it is.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Ability to compute vector similarity scores with DoubleValuesSource [lucene]

2023-10-12 Thread via GitHub


benwtrent commented on PR #12548:
URL: https://github.com/apache/lucene/pull/12548#issuecomment-1760058838

   These are good ideas @msokolov , @shubhamvishu a follow up PR is most 
welcome.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Add the ability to compute vector similarity scores with the new ValuesSource API [lucene]

2023-10-12 Thread via GitHub


benwtrent closed issue #12394: Add the ability to compute vector similarity 
scores with the new ValuesSource API
URL: https://github.com/apache/lucene/issues/12394


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Ability to compute vector similarity scores with DoubleValuesSource [lucene]

2023-10-12 Thread via GitHub


benwtrent merged PR #12548:
URL: https://github.com/apache/lucene/pull/12548


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-12 Thread via GitHub


mayya-sharipova commented on code in PR #12582:
URL: https://github.com/apache/lucene/pull/12582#discussion_r1357182135


##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsFormat.java:
##
@@ -0,0 +1,209 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.lucene99;
+
+import java.io.IOException;
+import org.apache.lucene.codecs.KnnVectorsFormat;
+import org.apache.lucene.codecs.KnnVectorsReader;
+import org.apache.lucene.codecs.KnnVectorsWriter;
+import org.apache.lucene.codecs.lucene90.IndexedDISI;
+import org.apache.lucene.index.SegmentReadState;
+import org.apache.lucene.index.SegmentWriteState;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.util.hnsw.HnswGraph;
+
+/**
+ * Lucene 9.5 vector format, which encodes numeric vector values and an 
optional associated graph
+ * connecting the documents having values. The graph is used to power HNSW 
search. The format
+ * consists of three files:
+ *
+ * .vec (vector data) file
+ *
+ * For each field:
+ *
+ * 
+ *   Vector data ordered by field, document ordinal, and vector dimension. 
When the
+ *   vectorEncoding is BYTE, each sample is stored as a single byte. When 
it is FLOAT32, each
+ *   sample is stored as an IEEE float in little-endian byte order.
+ *   DocIds encoded by {@link IndexedDISI#writeBitSet(DocIdSetIterator, 
IndexOutput, byte)},
+ *   note that only in sparse case
+ *   OrdToDoc was encoded by {@link 
org.apache.lucene.util.packed.DirectMonotonicWriter}, note
+ *   that only in sparse case
+ * 
+ *
+ * .vex (vector index)
+ *
+ * Stores graphs connecting the documents for each field organized as a 
list of nodes' neighbours
+ * as following:
+ *
+ * 
+ *   For each level:
+ *   
+ * For each node:
+ * 
+ *   [vint] the number of neighbor nodes
+ *   array[vint] the delta encoded neighbor ordinals
+ * 
+ *   
+ *   After all levels are encoded memory offsets for each node's neighbor 
nodes encoded by
+ *   {@link org.apache.lucene.util.packed.DirectMonotonicWriter} are 
appened to the end of the
+ *   file.
+ * 
+ *
+ * .vem (vector metadata) file
+ *
+ * For each field:
+ *
+ * 
+ *   [int32] field number
+ *   [int32] vector similarity function ordinal
+ *   [vlong] offset to this field's vectors in the .vec file
+ *   [vlong] length of this field's vectors, in bytes
+ *   [vlong] offset to this field's index in the .vex file
+ *   [vlong] length of this field's index data, in bytes
+ *   [vint] dimension of this field's vectors
+ *   [int] the number of documents having values for this field
+ *   [int8] if equals to -1, dense – all documents have values for 
a field. If equals to
+ *   0, sparse – some documents missing values.
+ *   DocIds were encoded by {@link 
IndexedDISI#writeBitSet(DocIdSetIterator, IndexOutput, byte)}
+ *   OrdToDoc was encoded by {@link 
org.apache.lucene.util.packed.DirectMonotonicWriter}, note
+ *   that only in sparse case
+ *   [vint] the maximum number of connections (neigbours) that each 
node can have
+ *   [vint] number of levels in the graph
+ *   Graph nodes by level. For each level
+ *   
+ * [vint] the number of nodes on this level
+ * array[vint] for levels greater than 0 list of nodes on 
this level, stored as
+ * the level 0th delta encoded nodes' ordinals.
+ *   
+ * 
+ *
+ * @lucene.experimental
+ */
+public final class Lucene99HnswVectorsFormat extends KnnVectorsFormat {

Review Comment:
   This would be helpful for the review to get a general idea how the format is 
organized.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: iss

Re: [PR] Extract the hnsw graph merging from being part of the vector writer [lucene]

2023-10-12 Thread via GitHub


benwtrent commented on PR #12657:
URL: https://github.com/apache/lucene/pull/12657#issuecomment-1760088558

   @zhaih I updated the API a bit. This is more like I was thinking. Having a 
builder that accepts readers, doc maps, etc. And then can build with the final 
merge state. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Ability to compute vector similarity scores with DoubleValuesSource [lucene]

2023-10-12 Thread via GitHub


msokolov commented on code in PR #12548:
URL: https://github.com/apache/lucene/pull/12548#discussion_r1357190877


##
lucene/core/src/java/org/apache/lucene/search/DoubleValuesSource.java:
##
@@ -172,6 +173,52 @@ public LongValuesSource rewrite(IndexSearcher searcher) 
throws IOException {
 }
   }
 
+  /**
+   * Returns a DoubleValues instance for computing the vector similarity score 
per document against
+   * the byte query vector
+   *
+   * @param ctx the context for which to return the DoubleValues
+   * @param queryVector byte query vector
+   * @param vectorField knn byte field name
+   * @return DoubleValues instance
+   * @throws IOException if an {@link IOException} occurs
+   */
+  public static DoubleValues similarityToQueryVector(

Review Comment:
   well I don't think we should rename *all* the existing static methods here - 
that would just cause confusion. Maybe that's not what you were saying? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add interface VectorValues to be implemented by [Float/Byte]VectorValues [lucene]

2023-10-12 Thread via GitHub


benwtrent commented on PR #12636:
URL: https://github.com/apache/lucene/pull/12636#issuecomment-1760104677

   > Which interface are you referring to RandomAccessVectorValues? as that 
uses generics to avoid duplication?
   
   During the refactoring for byte/float I did a while back, during 
refactoring, one of the iterations I had used an interface similar to the one 
you have made here. It was shot down as it was unnecessarily complex without 
much gain.
   
   I am inclined to agree. It doesn't gain us much. I would rather a little 
code duplication than a bad abstraction.
   
   > You mean we could do something similar to RandomVectorScorer etc. what has 
been done for RandomAccessVectorValues(i.e. use generics)?
   
   I think there are places where we duplicate code unnecessarily because we 
are not using scorers, but instead creating a vector reader & scorer 
separately. 
   
   > So are you suggesting saying instead of making [Float/Byte]VectorValues 
use this VectorValues interface we should only try avoiding duplication at the 
places you pointed?
   
   I think so. I don't want us adding an abstraction we are not 100% sure of 
for very little gain.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [DRAFT] Concurrent HNSW Merge [lucene]

2023-10-12 Thread via GitHub


msokolov commented on PR #12660:
URL: https://github.com/apache/lucene/pull/12660#issuecomment-1760118913

   This looks like a great start. I worked up a very similar PR but I think I 
have some concurrency bugs around the update/add of new entry points in the 
upper graph levels. I might post mine anyway because it has some syntax sugar I 
think would be nice to bring over here. One idea I had was to create a 
NeighborArrayIterator that would handle read-locking in the OnHeapGraph case 
with another implementation that just encapsulates the iteration with no 
locking for use by other (off-heap) HnswGraph. With that and a little 
refactoring we can use the same HnswGraphSearcher for both cases?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add a merge policy wrapper that performs recursive graph bisection on merge. [lucene]

2023-10-12 Thread via GitHub


s1monw commented on code in PR #12622:
URL: https://github.com/apache/lucene/pull/12622#discussion_r1357159891


##
lucene/core/src/java/org/apache/lucene/index/MergePolicy.java:
##
@@ -288,11 +288,32 @@ final void close(
   }
 }
 
-/** Wrap the reader in order to add/remove information to the merged 
segment. */
+/**
+ * Wrap a reader prior to merging in order to add/remove fields or 
documents.
+ *
+ * NOTE: It is illegal to reorder doc IDs here, use {@link

Review Comment:
   maybe a long shot but can we assert that somehow if we do that anywhere with 
an asserting reader in tests?



##
lucene/core/src/java/org/apache/lucene/index/IndexWriter.java:
##
@@ -5144,20 +5145,71 @@ public int length() {
 }
 mergeReaders.add(wrappedReader);
   }
+
+  MergeState.DocMap[] reorderDocMaps = null;
+  if (config.getIndexSort() == null) {
+// Create a merged view of the input segments. This effectively does 
the merge.
+CodecReader mergedView = 
SlowCompositeCodecReaderWrapper.wrap(mergeReaders);
+Sorter.DocMap docMap = merge.reorder(mergedView, directory);
+if (docMap != null) {
+  reorderDocMaps = new MergeState.DocMap[mergeReaders.size()];
+  int docBase = 0;
+  int i = 0;
+  for (CodecReader reader : mergeReaders) {
+final int finalDocBase = docBase;

Review Comment:
   nit: it confused me when I read the code. I read "final" as the eventual 
DocBase not _docBase_ var being final. maybe call it currentDocBase or so?



##
lucene/core/src/java/org/apache/lucene/index/IndexWriter.java:
##
@@ -5144,20 +5145,71 @@ public int length() {
 }
 mergeReaders.add(wrappedReader);
   }
+
+  MergeState.DocMap[] reorderDocMaps = null;
+  if (config.getIndexSort() == null) {
+// Create a merged view of the input segments. This effectively does 
the merge.
+CodecReader mergedView = 
SlowCompositeCodecReaderWrapper.wrap(mergeReaders);
+Sorter.DocMap docMap = merge.reorder(mergedView, directory);
+if (docMap != null) {
+  reorderDocMaps = new MergeState.DocMap[mergeReaders.size()];
+  int docBase = 0;
+  int i = 0;
+  for (CodecReader reader : mergeReaders) {
+final int finalDocBase = docBase;
+reorderDocMaps[i] =
+new MergeState.DocMap() {
+  @Override
+  public int get(int docID) {
+Objects.checkIndex(docID, reader.maxDoc());
+return docMap.oldToNew(finalDocBase + docID);
+  }
+};
+i++;
+docBase += reader.maxDoc();
+  }
+  // This makes merging more expensive as it disables some bulk 
merging optimizations, so
+  // only do this if a non-null DocMap is returned.
+  mergeReaders =
+  Collections.singletonList(SortingCodecReader.wrap(mergedView, 
docMap, null));
+}
+  }
+
   final SegmentMerger merger =
   new SegmentMerger(
   mergeReaders, merge.info.info, infoStream, dirWrapper, 
globalFieldNumberMap, context);
   merge.info.setSoftDelCount(Math.toIntExact(softDeleteCount.get()));
   merge.checkAborted();
 
+  MergeState mergeState = merger.mergeState;
+  MergeState.DocMap[] docMaps;
+  if (reorderDocMaps == null) {
+docMaps = mergeState.docMaps;
+  } else {
+assert mergeState.docMaps.length == 1;
+MergeState.DocMap compactionDocMap = mergeState.docMaps[0];
+docMaps = new MergeState.DocMap[reorderDocMaps.length];
+for (int i = 0; i < docMaps.length; ++i) {
+  MergeState.DocMap reorderDocMap = reorderDocMaps[i];
+  docMaps[i] =
+  new MergeState.DocMap() {
+@Override
+public int get(int docID) {
+  int reorderedDocId = reorderDocMap.get(docID);

Review Comment:
   maybe I am missing something but what do we need the _reorderDocMap_ for? 
Can't we do the calculation we do in the loop above in-line with this? 



##
lucene/core/src/java/org/apache/lucene/index/IndexWriter.java:
##
@@ -5144,20 +5145,71 @@ public int length() {
 }
 mergeReaders.add(wrappedReader);
   }
+
+  MergeState.DocMap[] reorderDocMaps = null;
+  if (config.getIndexSort() == null) {
+// Create a merged view of the input segments. This effectively does 
the merge.
+CodecReader mergedView = 
SlowCompositeCodecReaderWrapper.wrap(mergeReaders);
+Sorter.DocMap docMap = merge.reorder(mergedView, directory);
+if (docMap != null) {
+  reorderDocMaps = new MergeState.DocMap[mergeReaders.size()];
+  int docBase = 0;
+  int i = 0;
+  for (CodecReader reader : mergeReaders) {
+final int finalDocBase = docB

Re: [PR] Add a merge policy wrapper that performs recursive graph bisection on merge. [lucene]

2023-10-12 Thread via GitHub


s1monw commented on PR #12622:
URL: https://github.com/apache/lucene/pull/12622#issuecomment-1760126316

   just for the record. I think we should record if a segment was written and 
it contains blocks and make it viral. that might be a good idea anyway. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Initial take at adding JMH micro-benchmarks [lucene]

2023-10-12 Thread via GitHub


dweiss merged PR #12663:
URL: https://github.com/apache/lucene/pull/12663


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] ability to run JMH benchmarks from gradle [lucene]

2023-10-12 Thread via GitHub


dweiss closed issue #12641: ability to run JMH benchmarks from gradle
URL: https://github.com/apache/lucene/issues/12641


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] ability to run JMH benchmarks from gradle [lucene]

2023-10-12 Thread via GitHub


dweiss commented on issue #12641:
URL: https://github.com/apache/lucene/issues/12641#issuecomment-1760149489

   I left just a single benchmark in, @rmuir - if you'd like to move all of 
them, please do so. If something doesn't work, let me know (it should though).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Initial take at adding JMH micro-benchmarks [lucene]

2023-10-12 Thread via GitHub


dweiss commented on code in PR #12663:
URL: https://github.com/apache/lucene/pull/12663#discussion_r1357235935


##
lucene/benchmark-jmh/README.txt:
##
@@ -0,0 +1,21 @@
+The :lucene:benchmark-jmh module contains can be used to compile
+and execute JMH (https://github.com/openjdk/jmh) micro-benchmarks.
+
+Look at existing classes and JMH documentation for inspiration on how
+to write good micro-benchmarks.
+
+To compile the project and prepare JMH launcher, run:
+
+gradlew :lucene:benchmark-jmh:assemble
+
+The above target will display exact commands to execute JMH from
+command line, for example:
+
+java --module-path lucene\benchmark-jmh\build\benchmarks --module 
org.apache.lucene.benchmark.jmh
+
+You can pass any JMH options to the above command, for example:
+
+  -h  displays verbose help for all options
+  -l  list available benchmarks
+  -lp list benchmarks that pass the filter and their parameters
+  regexp  execute all benchmark containing regexp

Review Comment:
   Oh, apologies - I thought you added that change to the branch, not as 
github's suggested change. I added this manually.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Optimize OnHeapHnswGraph's data structure [lucene]

2023-10-12 Thread via GitHub


zhaih commented on code in PR #12651:
URL: https://github.com/apache/lucene/pull/12651#discussion_r1357260551


##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##
@@ -288,7 +294,6 @@ private void selectAndLinkDiverse(
   // only adding it if it is closer to the target than to any of the other 
selected neighbors
   int cNode = candidates.node[i];
   float cScore = candidates.score[i];
-  assert cNode < hnsw.size();

Review Comment:
   Yeah the `size()` is actually a little bit broken after the initialize from 
graph change. Since when we init from another graph, the node Ids are not 
necessarily added in order, so in that change the `size()` actually returns 
`(the max node Id that is in the graph) + 1`. In this change I fixed the 
`size()` behavior back to `actual number of nodes inserted` and this assertion 
is broken.
   
   But you're right, it is still useful (especially on searcher side), so I'll 
introduce a new method `capacity` to indicate `(the max node Id that is in the 
graph) + 1` and use it here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Optimize OnHeapHnswGraph's data structure [lucene]

2023-10-12 Thread via GitHub


zhaih commented on code in PR #12651:
URL: https://github.com/apache/lucene/pull/12651#discussion_r1357261063


##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java:
##
@@ -232,7 +231,6 @@ void searchLevel(
   graphSeek(graph, level, topCandidateNode);
   int friendOrd;
   while ((friendOrd = graphNextNeighbor(graph)) != NO_MORE_DOCS) {
-assert friendOrd < size : "friendOrd=" + friendOrd + "; size=" + size;

Review Comment:
   Same reason above, I'll update the way the searcher calculate the size



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Optimize OnHeapHnswGraph's data structure [lucene]

2023-10-12 Thread via GitHub


zhaih commented on code in PR #12651:
URL: https://github.com/apache/lucene/pull/12651#discussion_r1357264961


##
lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java:
##
@@ -40,31 +41,29 @@ public final class OnHeapHnswGraph extends HnswGraph 
implements Accountable {
   // to vectors
   // added to HnswBuilder, and the node values are the ordinals of those 
vectors.
   // Thus, on all levels, neighbors expressed as the level 0's nodes' ordinals.
-  private final List graphLevel0;

Review Comment:
   > Hmm that is interesting but since it will usually be sparse we will still 
have a lot of mutations to do
   
   That's not the case I think, we can always make the code to insert the 
up-most level of that node such that we allocate the level for the node and 
never need to resize for the second dimension. And there'll be no waste of 
space on the second dimension because we allocate them according to need. See 
`OnHeapHnswGraph#addNode` I think that code explains itself.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] Enable recursive graph bisection out of the box? [lucene]

2023-10-12 Thread via GitHub


jpountz opened a new issue, #12665:
URL: https://github.com/apache/lucene/issues/12665

   ### Description
   
   It would be nice to enable recursive graph bisection out of the box, so that 
users don't even have to know that it exists or what it is to enjoy its 
search-time performance benefits.
   
   
   - [x] #12489
   - [x] #12622
   - [ ] Record segments that contains blocks of documents 
(`IndexWriter#addDocuments` with an `s`)
   - [ ] Figure out lightweight parameters of recursive graph bisection that 
would make the merging overhead reasonable
   - [ ] Enable recursive graph bisection automatically when no index sort is 
configured and only `addDocument` is used.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Optimize OnHeapHnswGraph's data structure [lucene]

2023-10-12 Thread via GitHub


zhaih commented on PR #12651:
URL: https://github.com/apache/lucene/pull/12651#issuecomment-1760235738

   OK I have incorporate all the learning I have from #12660 and added several 
more assertions to make it safer, please take a look again when you have time 
@msokolov, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Ability to compute vector similarity scores with DoubleValuesSource [lucene]

2023-10-12 Thread via GitHub


shubhamvishu commented on code in PR #12548:
URL: https://github.com/apache/lucene/pull/12548#discussion_r1357312134


##
lucene/core/src/java/org/apache/lucene/search/DoubleValuesSource.java:
##
@@ -172,6 +173,52 @@ public LongValuesSource rewrite(IndexSearcher searcher) 
throws IOException {
 }
   }
 
+  /**
+   * Returns a DoubleValues instance for computing the vector similarity score 
per document against
+   * the byte query vector
+   *
+   * @param ctx the context for which to return the DoubleValues
+   * @param queryVector byte query vector
+   * @param vectorField knn byte field name
+   * @return DoubleValues instance
+   * @throws IOException if an {@link IOException} occurs
+   */
+  public static DoubleValues similarityToQueryVector(

Review Comment:
   Yes I didn't mean to move all the static methods here(sorry for the 
confusion). My idea was along your suggestion, should we also move the private 
static classes that are in this class(a lot in number, example - 
[one](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/DoubleValuesSource.java#L309-L372),
 
[two](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/DoubleValuesSource.java#L396-L556)
 and more) which is bloating this class out into another pkg-private class lets 
say `DoubleValuesSourceUtil` (the only implication being we would make these 
private classes pkg-private which is fine I think?). This way we could only 
keep the required API methods in DVS class and move out the rest stuff making 
this lean and readable. If that makes sense I'll put up these changes too with 
the followup CR(like below). Let me know what do you think?
   
   Here is an short example commit I made to explain what I was saying - 
https://github.com/shubhamvishu/lucene/commit/604f1fdcad96a3c5e09356190645b6f2ec16bb2c



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] ability to run JMH benchmarks from gradle [lucene]

2023-10-12 Thread via GitHub


rmuir commented on issue #12641:
URL: https://github.com/apache/lucene/issues/12641#issuecomment-1760262080

   @dweiss thank you, I will fix and cleanup all the benchmarks and archive 
https://github.com/rmuir/vectorbench as soon as I can :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] ability to run JMH benchmarks from gradle [lucene]

2023-10-12 Thread via GitHub


rmuir commented on issue #12641:
URL: https://github.com/apache/lucene/issues/12641#issuecomment-1760263361

   I can also add some stuff from the README there such as how to install the 
disassembler. It is stuff that you forget easily if you haven't done it in a 
while.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [DRAFT] Concurrent HNSW Merge [lucene]

2023-10-12 Thread via GitHub


zhaih commented on PR #12660:
URL: https://github.com/apache/lucene/pull/12660#issuecomment-1760280351

   That sounds great, thanks Mike! Please post the PR and I will try to see
   how to put it together. Meanwhile let's try to get the prerequisite PR
   reviewed and pushed first? Such that this one can look cleaner.
   
   On Thu, Oct 12, 2023, 11:07 Michael Sokolov ***@***.***>
   wrote:
   
   > This looks like a great start. I worked up a very similar PR but I think I
   > have some concurrency bugs around the update/add of new entry points in the
   > upper graph levels. I might post mine anyway because it has some syntax
   > sugar I think would be nice to bring over here. One idea I had was to
   > create a NeighborArrayIterator that would handle read-locking in the
   > OnHeapGraph case with another implementation that just encapsulates the
   > iteration with no locking for use by other (off-heap) HnswGraph. With that
   > and a little refactoring we can use the same HnswGraphSearcher for both
   > cases?
   >
   > —
   > Reply to this email directly, view it on GitHub
   > , or
   > unsubscribe
   > 

   > .
   > You are receiving this because you authored the thread.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] normalize() override provided in Simple example in Analyzer class doc is missing String fieldName parameter [lucene]

2023-10-12 Thread via GitHub


Bluetopia opened a new issue, #12666:
URL: https://github.com/apache/lucene/issues/12666

   ### Description
   
   Using: 
https://javadoc.io/doc/org.apache.lucene/lucene-core/latest/org/apache/lucene/analysis/Analyzer.html
 which should hopefully be up to date. The current state of the source file as 
of October 12, 2023 in the repository appears to match from a cursory glance.
   
   **Issue:** 
   The example for the createComponents() method has been updated to include 
the String fieldName parameter, but it appears that the normalize() example is 
missing this parameter and as such the sample does not work.
   
   **Fix:**
   Update example to match the method signature for normalize() in question: 
   ```
   protected TokenStream normalize(String fieldName, TokenStream in) {
   ...
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-12 Thread via GitHub


mayya-sharipova commented on code in PR #12582:
URL: https://github.com/apache/lucene/pull/12582#discussion_r1357345297


##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99ScalarQuantizedVectorsWriter.java:
##
@@ -0,0 +1,782 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.lucene99;
+
+import static 
org.apache.lucene.codecs.lucene99.Lucene99ScalarQuantizedVectorsFormat.QUANTIZED_VECTOR_COMPONENT;
+import static 
org.apache.lucene.codecs.lucene99.Lucene99ScalarQuantizedVectorsFormat.calculateDefaultQuantile;
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+import static org.apache.lucene.util.RamUsageEstimator.shallowSizeOfInstance;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.nio.ByteOrder;
+import java.util.ArrayList;
+import java.util.List;
+import org.apache.lucene.codecs.CodecUtil;
+import org.apache.lucene.codecs.KnnVectorsReader;
+import org.apache.lucene.codecs.KnnVectorsWriter;
+import org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat;
+import org.apache.lucene.index.DocIDMerger;
+import org.apache.lucene.index.DocsWithFieldSet;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.FloatVectorValues;
+import org.apache.lucene.index.MergeState;
+import org.apache.lucene.index.SegmentWriteState;
+import org.apache.lucene.index.Sorter;
+import org.apache.lucene.index.VectorEncoding;
+import org.apache.lucene.index.VectorSimilarityFunction;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.store.IndexInput;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.util.Accountable;
+import org.apache.lucene.util.IOUtils;
+import org.apache.lucene.util.InfoStream;
+import org.apache.lucene.util.RamUsageEstimator;
+import org.apache.lucene.util.ScalarQuantizer;
+import org.apache.lucene.util.VectorUtil;
+import org.apache.lucene.util.hnsw.CloseableRandomVectorScorerSupplier;
+import org.apache.lucene.util.hnsw.RandomVectorScorer;
+
+/**
+ * Writes quantized vector values and metadata to index segments.
+ *
+ * @lucene.experimental
+ */
+public final class Lucene99ScalarQuantizedVectorsWriter implements Accountable 
{
+
+  private static final long BASE_RAM_BYTES_USED =
+  shallowSizeOfInstance(Lucene99ScalarQuantizedVectorsWriter.class);
+
+  private static final float QUANTIZATION_RECOMPUTE_LIMIT = 32;
+  private final IndexOutput quantizedVectorData;
+  private final Float quantile;
+  private boolean finished;
+
+  Lucene99ScalarQuantizedVectorsWriter(IndexOutput quantizedVectorData, Float 
quantile) {
+this.quantile = quantile;
+this.quantizedVectorData = quantizedVectorData;
+  }
+
+  QuantizationVectorWriter addField(FieldInfo fieldInfo, InfoStream 
infoStream) {
+if (fieldInfo.getVectorEncoding() != VectorEncoding.FLOAT32) {
+  throw new IllegalArgumentException(
+  "Only float32 vector fields are supported for quantization");
+}
+float quantile =
+this.quantile == null
+? calculateDefaultQuantile(fieldInfo.getVectorDimension())
+: this.quantile;
+if (infoStream.isEnabled(QUANTIZED_VECTOR_COMPONENT)) {
+  infoStream.message(
+  QUANTIZED_VECTOR_COMPONENT,
+  "quantizing field="
+  + fieldInfo.name
+  + " dimension="
+  + fieldInfo.getVectorDimension()
+  + " quantile="
+  + quantile);
+}
+return QuantizationVectorWriter.create(fieldInfo, quantile, infoStream);
+  }
+
+  long[] flush(
+  Sorter.DocMap sortMap, QuantizationVectorWriter field, DocsWithFieldSet 
docsWithField)
+  throws IOException {
+field.finish();
+return sortMap == null ? writeField(field) : writeSortingField(field, 
sortMap, docsWithField);
+  }
+
+  void finish() throws IOException {
+if (finished) {
+  throw new IllegalStateException("already finished");
+}
+finished = true;
+if (quantizedVectorData != null) {
+  CodecUtil.writeFooter(quantizedVectorData);
+}
+  }
+
+  private long[] writeField(QuantizationVectorWriter fieldData) throws 
IOException {
+   

Re: [PR] Extract the hnsw graph merging from being part of the vector writer [lucene]

2023-10-12 Thread via GitHub


benwtrent commented on code in PR #12657:
URL: https://github.com/apache/lucene/pull/12657#discussion_r1357376455


##
lucene/core/src/java/org/apache/lucene/codecs/lucene95/IncrementalHnswGraphMerger.java:
##
@@ -0,0 +1,209 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene95;
+
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+
+import java.io.IOException;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.codecs.HnswGraphProvider;
+import org.apache.lucene.codecs.KnnVectorsReader;
+import org.apache.lucene.codecs.KnnVectorsWriter;
+import org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat;
+import org.apache.lucene.index.ByteVectorValues;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.FloatVectorValues;
+import org.apache.lucene.index.MergeState;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.util.Bits;
+import org.apache.lucene.util.hnsw.HnswGraph;
+import org.apache.lucene.util.hnsw.HnswGraphBuilder;
+import org.apache.lucene.util.hnsw.InitializedHnswGraphBuilder;
+import org.apache.lucene.util.hnsw.RandomVectorScorerSupplier;
+
+/**
+ * This selects the biggest Hnsw graph from the provided merge state and 
initializes a new
+ * HnswGraphBuilder with that graph as a starting point.
+ */
+public class IncrementalHnswGraphMerger {
+
+  private KnnVectorsReader curReader;
+  private MergeState.DocMap curDocMap;
+  private int curGraphSize;
+  private final FieldInfo fieldInfo;
+  private final RandomVectorScorerSupplier scorerSupplier;
+  private final int M;
+  private final int beamWidth;
+
+  /**
+   * @param fieldInfo FieldInfo for the field being merged
+   */
+  public IncrementalHnswGraphMerger(
+  FieldInfo fieldInfo, RandomVectorScorerSupplier scorerSupplier, int M, 
int beamWidth) {
+this.fieldInfo = fieldInfo;
+this.scorerSupplier = scorerSupplier;
+this.M = M;
+this.beamWidth = beamWidth;
+  }
+
+  /**
+   * Adds a reader to the graph merger if it meets the following criteria: 1. 
Does not contain any
+   * deleted docs 2. Is a HnswGraphProvider/PerFieldKnnVectorReader 3. Has the 
most docs of any
+   * previous reader that met the above criteria
+   *
+   * @param reader KnnVectorsReader to add to the merger
+   * @param docMap MergeState.DocMap for the reader
+   * @param liveDocs Bits representing live docs, can be null
+   * @return this
+   * @throws IOException If an error occurs while reading from the merge state
+   */
+  IncrementalHnswGraphMerger addReader(
+  KnnVectorsReader reader, MergeState.DocMap docMap, Bits liveDocs) throws 
IOException {
+KnnVectorsReader currKnnVectorsReader = reader;
+if (reader instanceof PerFieldKnnVectorsFormat.FieldsReader 
candidateReader) {
+  currKnnVectorsReader = candidateReader.getFieldReader(fieldInfo.name);
+}
+
+if (!(currKnnVectorsReader instanceof HnswGraphProvider) || 
!allMatch(liveDocs)) {
+  return this;
+}
+
+int candidateVectorCount = 0;
+switch (fieldInfo.getVectorEncoding()) {
+  case BYTE -> {
+ByteVectorValues byteVectorValues =
+currKnnVectorsReader.getByteVectorValues(fieldInfo.name);
+if (byteVectorValues == null) {
+  return this;
+}
+candidateVectorCount = byteVectorValues.size();
+  }
+  case FLOAT32 -> {
+FloatVectorValues vectorValues = 
currKnnVectorsReader.getFloatVectorValues(fieldInfo.name);
+if (vectorValues == null) {
+  return this;
+}
+candidateVectorCount = vectorValues.size();
+  }
+}
+if (candidateVectorCount > curGraphSize) {
+  curReader = currKnnVectorsReader;
+  curDocMap = docMap;
+  curGraphSize = candidateVectorCount;
+}
+return this;
+  }
+
+  /**
+   * Builds a new HnswGraphBuilder using the biggest graph from the merge 
state as a starting point.
+   * If no valid readers were added to the merge state, a new graph is created.
+   *
+   * @param mergeState MergeState for the merge
+   * @return HnswGraphBuilder
+   * @throws IOException If an error occurs w

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-12 Thread via GitHub


benwtrent commented on code in PR #12582:
URL: https://github.com/apache/lucene/pull/12582#discussion_r1357380984


##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsFormat.java:
##
@@ -0,0 +1,209 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.lucene99;
+
+import java.io.IOException;
+import org.apache.lucene.codecs.KnnVectorsFormat;
+import org.apache.lucene.codecs.KnnVectorsReader;
+import org.apache.lucene.codecs.KnnVectorsWriter;
+import org.apache.lucene.codecs.lucene90.IndexedDISI;
+import org.apache.lucene.index.SegmentReadState;
+import org.apache.lucene.index.SegmentWriteState;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.util.hnsw.HnswGraph;
+
+/**
+ * Lucene 9.5 vector format, which encodes numeric vector values and an 
optional associated graph
+ * connecting the documents having values. The graph is used to power HNSW 
search. The format
+ * consists of three files:
+ *
+ * .vec (vector data) file
+ *
+ * For each field:
+ *
+ * 
+ *   Vector data ordered by field, document ordinal, and vector dimension. 
When the
+ *   vectorEncoding is BYTE, each sample is stored as a single byte. When 
it is FLOAT32, each
+ *   sample is stored as an IEEE float in little-endian byte order.
+ *   DocIds encoded by {@link IndexedDISI#writeBitSet(DocIdSetIterator, 
IndexOutput, byte)},
+ *   note that only in sparse case
+ *   OrdToDoc was encoded by {@link 
org.apache.lucene.util.packed.DirectMonotonicWriter}, note
+ *   that only in sparse case
+ * 
+ *
+ * .vex (vector index)
+ *
+ * Stores graphs connecting the documents for each field organized as a 
list of nodes' neighbours
+ * as following:
+ *
+ * 
+ *   For each level:
+ *   
+ * For each node:
+ * 
+ *   [vint] the number of neighbor nodes
+ *   array[vint] the delta encoded neighbor ordinals
+ * 
+ *   
+ *   After all levels are encoded memory offsets for each node's neighbor 
nodes encoded by
+ *   {@link org.apache.lucene.util.packed.DirectMonotonicWriter} are 
appened to the end of the
+ *   file.
+ * 
+ *
+ * .vem (vector metadata) file
+ *
+ * For each field:
+ *
+ * 
+ *   [int32] field number
+ *   [int32] vector similarity function ordinal
+ *   [vlong] offset to this field's vectors in the .vec file
+ *   [vlong] length of this field's vectors, in bytes
+ *   [vlong] offset to this field's index in the .vex file
+ *   [vlong] length of this field's index data, in bytes
+ *   [vint] dimension of this field's vectors
+ *   [int] the number of documents having values for this field
+ *   [int8] if equals to -1, dense – all documents have values for 
a field. If equals to
+ *   0, sparse – some documents missing values.
+ *   DocIds were encoded by {@link 
IndexedDISI#writeBitSet(DocIdSetIterator, IndexOutput, byte)}
+ *   OrdToDoc was encoded by {@link 
org.apache.lucene.util.packed.DirectMonotonicWriter}, note
+ *   that only in sparse case
+ *   [vint] the maximum number of connections (neigbours) that each 
node can have
+ *   [vint] number of levels in the graph
+ *   Graph nodes by level. For each level
+ *   
+ * [vint] the number of nodes on this level
+ * array[vint] for levels greater than 0 list of nodes on 
this level, stored as
+ * the level 0th delta encoded nodes' ordinals.
+ *   
+ * 
+ *
+ * @lucene.experimental
+ */
+public final class Lucene99HnswVectorsFormat extends KnnVectorsFormat {

Review Comment:
   @mayya-sharipova The format comment should contain information around how 
quantized vectors are stored, their sparse corrections, etc. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apach

[PR] migrate all vectorbench methods to lucene [lucene]

2023-10-12 Thread via GitHub


rmuir opened a new pull request, #12667:
URL: https://github.com/apache/lucene/pull/12667

   Following up to @dweiss work, this gives us the same benchmarks as 
https://github.com/rmuir/vectorbench, just without the code duplication and 
maintenance hassle.
   
   Each method is simply invoked with and without 
`--add-modules=jdk.incubator.vector`, so we can easily compare the performance 
of both the scalar and vector implementations.
   
   Same results we were getting before, with less code. I also like that it 
tests the "real" codepath through lucene, we just call VectorUtil methods with 
different JVM args.
   
   ```
   Benchmark   (size)   Mode  Cnt   Score   
Error   Units
   VectorUtilBenchmark.binaryCosineScalar1024  thrpt5   0.763 ± 
0.012  ops/us
   VectorUtilBenchmark.binaryCosineVector1024  thrpt5   3.487 ± 
0.080  ops/us
   VectorUtilBenchmark.binaryDotProductScalar1024  thrpt5   1.404 ± 
0.032  ops/us
   VectorUtilBenchmark.binaryDotProductVector1024  thrpt5   7.120 ± 
0.133  ops/us
   VectorUtilBenchmark.binarySquareScalar1024  thrpt5   1.109 ± 
0.898  ops/us
   VectorUtilBenchmark.binarySquareVector1024  thrpt5   5.938 ± 
1.490  ops/us
   VectorUtilBenchmark.floatCosineScalar 1024  thrpt5   0.612 ± 
0.071  ops/us
   VectorUtilBenchmark.floatCosineVector 1024  thrpt5   5.953 ± 
0.118  ops/us
   VectorUtilBenchmark.floatDotProductScalar 1024  thrpt5   1.926 ± 
0.060  ops/us
   VectorUtilBenchmark.floatDotProductVector 1024  thrpt5  11.865 ± 
1.057  ops/us
   VectorUtilBenchmark.floatSquareScalar 1024  thrpt5   1.386 ± 
0.065  ops/us
   VectorUtilBenchmark.floatSquareVector 1024  thrpt5   9.689 ± 
0.102  ops/us
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] migrate all vectorbench methods to lucene [lucene]

2023-10-12 Thread via GitHub


rmuir commented on PR #12667:
URL: https://github.com/apache/lucene/pull/12667#issuecomment-1760526957

   You can still tell what's happening too due to the log messages. When each 
benchmark runs, you see a single message:
   
   xxxScalar() methods:
   ```
   WARNING: Java vector incubator module is not readable. For optimal vector 
performance, pass '--add-modules jdk.incubator.vector' to enable Vector API.
   ```
   
   xxxVector() methods:
   ```
   INFO: Java vector incubator API enabled; uses preferredBitSize=256
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-10-12 Thread via GitHub


dungba88 commented on code in PR #12624:
URL: https://github.com/apache/lucene/pull/12624#discussion_r1353826324


##
lucene/core/src/java/org/apache/lucene/util/fst/BytesStore.java:
##
@@ -21,19 +21,18 @@
 import java.util.List;
 import org.apache.lucene.store.DataInput;
 import org.apache.lucene.store.DataOutput;
-import org.apache.lucene.util.Accountable;
 import org.apache.lucene.util.RamUsageEstimator;
 
 // TODO: merge with PagedBytes, except PagedBytes doesn't
 // let you read while writing which FST needs
 
-class BytesStore extends DataOutput implements Accountable {
+class BytesStore extends FSTWriter {

Review Comment:
   > Once FST.addNode completes, those written bytes are never altered?
   
   ~~More precisely, those bytes are never altered after `FSTCompiler.add` 
completes. It seems we need to write all nodes of the input in reverse as a 
whole.~~ It seems it's possible to flush the scratch bytes after every addNode. 
yeah the BytesStore use the head of the buffer as scratch area.
   
   > Hmm getReverseBytesReader is indeed a problem. I wonder how the [Tantivy 
FST impl](https://blog.burntsushi.net/transducers/) deals with this?
   
   It seems Tantivy segregate the building and the traverse of FST as 2 
different entity. The FST Builder will just write the FST to a DataOutput and 
not allow it to be read directly. I was thinking of this too, as currently we 
are mixing up the writing and reading:
   - Load a previously saved FST from a DataInput. This is read-only and is 
fine, and it's how Tantivy FST is created as well.
   - Construct a FST on-the-fly and use it right away. This is both read & 
write and it uses BytesStore.
   
   I'm kind of favoring the way Tantivy is doing, it's cleaner and more "facade 
pattern". Maybe we could first refactor so that the FST created on the fly will 
be written directly to a DataOutput, and then instead of using it directly, we 
construct a FST from that DataOutput? From the Builder point-of-view it will 
still create the FST eventually, but could pave the way for segregation of 
building & reading later (i.e Builder will only write to the DataOutput and 
it's up to the users to create the corresponding DataInput and construct a FST 
from that).
   
   UPDATE: The above seems to be a breaking change and might not be a small, 
incremental one, as NodeHash needs the FST to be initialized first so that it 
can search over the previously written nodes.
   
   If we are doing that, then we can get rid of the `getReverseBytesReader`. 
However one issue remains: we still need 
`getReverseBytesReaderForSuffixSharing` for NodeHash. Or at least some 
operation for random-access. I think Tantivy is using LRU cache for this in 
write-through mode: write the node into both the DataOutput and the LRU at the 
same time. This means we don't even need to read from the DataOutput, but it 
won't be perfectly minimal (as there will be false-negative cache-miss). I 
understand that there is the trade-off, but we might also need to support the 
current minimalist mode.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Optimize OnHeapHnswGraph's data structure [lucene]

2023-10-12 Thread via GitHub


zhaih commented on code in PR #12651:
URL: https://github.com/apache/lucene/pull/12651#discussion_r1357264961


##
lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java:
##
@@ -40,31 +41,29 @@ public final class OnHeapHnswGraph extends HnswGraph 
implements Accountable {
   // to vectors
   // added to HnswBuilder, and the node values are the ordinals of those 
vectors.
   // Thus, on all levels, neighbors expressed as the level 0's nodes' ordinals.
-  private final List graphLevel0;

Review Comment:
   > Hmm that is interesting but since it will usually be sparse we will still 
have a lot of mutations to do
   
   That's not the case I think, we can always make the code to insert the 
up-most level of that node first such that we allocate the level for the node 
and never need to resize for the second dimension. And there'll be no waste of 
space on the second dimension because we allocate them according to need. See 
`OnHeapHnswGraph#addNode` I think that code explains itself.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



  1   2   >