Re: [PR] Specialize arc store for continuous label in FST [lucene]

2023-11-03 Thread via GitHub


easyice commented on PR #12748:
URL: https://github.com/apache/lucene/pull/12748#issuecomment-1791978766

   
   I ran this with wikimedium10m and wikimediumall, There was no significant 
performance improvement or regression that was found. The total size of tip has 
a slight reduced:
   
   |   | baseline | candidate  |
   | --- | --- | --- |
   | wikimedium10m  | 10280673 | 10275716 |
   | wikimediumall | 28530090 | 28496270 | 
   
   
   The counted the different `nodeFlags` for wikimedium10m:
   
   | strategies | count | percent |
   | --- | --- | --- |
   | ARCS_FOR_DIRECT_ADDRESSING | 558555 | 50.23% |
   | ARCS_FOR_CONTINUOUS | 25215 | 2.26% |
   | ARCS_FOR_BINARY_SEARCH | 9 | 0.00% |
   | Linear search(bytesPerArc:0) | 528100 | 47.49% |
   
   It seems that the percentage hitting this optimization is small, but the 
data is dense for the arcs, so i generated 10 million random long values as 
terms:
   
   
   ```
   for (int i = 0; i < 1000_; i++) {
   Document doc = new Document();
   doc.add(new StringField("f1", String.valueOf(rand.nextLong()), 
Store.NO));
   indexWriter.addDocument(doc);
   }
   ```
   
   This optimization will be hit in most cases:
   
   | strategies | count | percent |
   | --- | --- | --- |
   | ARCS_FOR_DIRECT_ADDRESSING | 2469 | 2.58% |
   | ARCS_FOR_CONTINUOUS | 78732 | 82.45% |
   | ARCS_FOR_BINARY_SEARCH | 0 | 0.00% |
   | Linear search(bytesPerArc:0) | 14280 | 14.95% |
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add a specialized bulk scorer for regular conjunctions. [lucene]

2023-11-03 Thread via GitHub


jpountz commented on PR #12719:
URL: https://github.com/apache/lucene/pull/12719#issuecomment-1792001309

   Interestingly, it seems to also help with facets: 
http://people.apache.org/~mikemccand/lucenebench/AndHighHighDayTaxoFacets.html.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add a specialized bulk scorer for regular conjunctions. [lucene]

2023-11-03 Thread via GitHub


jpountz commented on PR #12719:
URL: https://github.com/apache/lucene/pull/12719#issuecomment-1792000689

   This yielded a good speedup on [nightly 
benchmarks](http://people.apache.org/~mikemccand/lucenebench/CountAndHighHigh.html).
 I pushed an annotation. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] LUCENE-10560: Faster merging of TermsEnum [lucene]

2023-11-03 Thread via GitHub


jpountz merged PR #1052:
URL: https://github.com/apache/lucene/pull/1052


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Can we speed up OrdinalMap construction? [LUCENE-10560] [lucene]

2023-11-03 Thread via GitHub


jpountz closed issue #11596: Can we speed up OrdinalMap construction? 
[LUCENE-10560]
URL: https://github.com/apache/lucene/issues/11596


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] LUCENE-10560: Faster merging of TermsEnum [lucene]

2023-11-03 Thread via GitHub


mikemccand commented on PR #1052:
URL: https://github.com/apache/lucene/pull/1052#issuecomment-1792099793

   > +1 I fell a bit into a trap by trying to make long shared prefixes less 
adversarial. Let's do progress over perfection and start with a simple approach 
and look into whether/how we can better handle long sequences of shared 
prefixes better.
   
   +1
   
   Thanks @jpountz!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] TestIndexWriterOnVMError.testUnknownError timesout [lucene]

2023-11-03 Thread via GitHub


s1monw commented on issue #12654:
URL: https://github.com/apache/lucene/issues/12654#issuecomment-1792101597

   @dweiss I agree this is the problem. We should execute that 
`IOUtils.closeWhileHandlingException(readerPool, deleter, writeLock);` in a try 
/ finally block. I can open a PR for that unless you wanna do that?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Specialize arc store for continuous label in FST [lucene]

2023-11-03 Thread via GitHub


mikemccand commented on PR #12748:
URL: https://github.com/apache/lucene/pull/12748#issuecomment-1792198963

   I tested this PR using `IndexToFST` from `luceneutil`.  This just tests 
construction time and final FST size, on all `wikimediumall` unique terms, 
allowing up to 64 MB RAM while building the FST:
   
   `main`:
   
   ```
 saved FST to "fst.bin": 382070800 bytes; 44.146 sec
 saved FST to "fst.bin": 382070800 bytes; 43.478 sec
   ```
   
   
   PR:
   
   ```
 saved FST to "fst.bin": 381705016 bytes; 42.616 sec
 saved FST to "fst.bin": 381705016 bytes; 42.832 sec
   ```
   
   FST size is a wee bit smaller (~0.1%), and curiously the construction time 
seems to be faster too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Specialize arc store for continuous label in FST [lucene]

2023-11-03 Thread via GitHub


mikemccand commented on PR #12748:
URL: https://github.com/apache/lucene/pull/12748#issuecomment-1792202492

   I'll run `Test2BFST` too ... takes a few hours!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] TestIndexWriterOnVMError.testUnknownError timesout [lucene]

2023-11-03 Thread via GitHub


dweiss commented on issue #12654:
URL: https://github.com/apache/lucene/issues/12654#issuecomment-1792205263

   Thanks, Simon. I'll open up a PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Specialize arc store for continuous label in FST [lucene]

2023-11-03 Thread via GitHub


mikemccand commented on code in PR #12748:
URL: https://github.com/apache/lucene/pull/12748#discussion_r1381468008


##
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##
@@ -444,9 +446,15 @@ long addNode(FSTCompiler.UnCompiledNode nodeIn) throws 
IOException {
 
   int labelRange = nodeIn.arcs[nodeIn.numArcs - 1].label - 
nodeIn.arcs[0].label + 1;
   assert labelRange > 0;
-  if (shouldExpandNodeWithDirectAddressing(
+  boolean continuousLable = labelRange == nodeIn.numArcs;

Review Comment:
   Hmm can we please use the `continuousLabel` spelling instead :)  To be 
consistent...



##
lucene/core/src/java/org/apache/lucene/util/fst/FST.java:
##
@@ -96,6 +96,8 @@ public enum INPUT_TYPE {
*/
   static final byte ARCS_FOR_DIRECT_ADDRESSING = 1 << 6;
 
+  static final byte ARCS_FOR_CONTINUOUS = ARCS_FOR_DIRECT_ADDRESSING + 
ARCS_FOR_BINARY_SEARCH;

Review Comment:
   Could you add a comment explaining this arc optimization case?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] TestIndexWriterOnVMError.testUnknownError times out (fixes potential IW deadlock on tragic exceptions). [lucene]

2023-11-03 Thread via GitHub


dweiss opened a new pull request, #12751:
URL: https://github.com/apache/lucene/pull/12751

   Fixes #12654.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] The default tests.multiplier passed from gradle was 1, but [lucene]

2023-11-03 Thread via GitHub


dweiss opened a new pull request, #12752:
URL: https://github.com/apache/lucene/pull/12752

   LuceneTestCase tried to compute its default value from TESTS_NIGHTLY. This 
could lead to subtle errors: nightly mode failures would not report 
tests.multipler=1 and when started from the IDE, the tests.multiplier would be 
set to 2 (leading to different randomness).
   
   LuceneTestCase now compares the actual value of the multiplier to the 
"default" computed based off TESTS_NIGHTLY. I also opted to change gradle's 
test defaults to _not_ pass tests.multiplier at all, unless specified 
explicitly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use value-based LRU cache in NodeHash [lucene]

2023-11-03 Thread via GitHub


mikemccand commented on code in PR #12738:
URL: https://github.com/apache/lucene/pull/12738#discussion_r1381520513


##
lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java:
##
@@ -328,7 +298,100 @@ private void rehash(long lastNodeAddress) throws 
IOException {
   }
 
   mask = newMask;
-  entries = newEntries;
+  fstNodeAddress = newEntries;
+  copiedNodeAddress = newCopiedOffsets;
+}
+
+// hash code for a frozen node.  this must precisely match the hash 
computation of an unfrozen
+// node!
+private long hash(long node, long pos) throws IOException {
+  FST.BytesReader in = getBytesReader(node, pos);

Review Comment:
   Hmm that's a great point.  Let's leave it as is then (caller does not pass 
the reader).  Thanks @dungba88.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Specialize arc store for continuous label in FST [lucene]

2023-11-03 Thread via GitHub


mikemccand commented on PR #12748:
URL: https://github.com/apache/lucene/pull/12748#issuecomment-1792271552

   `Test2BFSTs` is happy:
   
   ```
   BUILD SUCCESSFUL in 50m 6s
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Specialize arc store for continuous label in FST [lucene]

2023-11-03 Thread via GitHub


gf2121 commented on code in PR #12748:
URL: https://github.com/apache/lucene/pull/12748#discussion_r1381513302


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsReader.java:
##
@@ -86,8 +86,11 @@ public final class Lucene90BlockTreeTermsReader extends 
FieldsProducer {
*/
   public static final int VERSION_MSB_VLONG_OUTPUT = 1;
 
+  /** Version that store continuous arcs label as range in FST. */
+  public static final int VERSION_ARCS_CONTINUOUS = 2;
+
   /** Current terms format. */
-  public static final int VERSION_CURRENT = VERSION_MSB_VLONG_OUTPUT;
+  public static final int VERSION_CURRENT = VERSION_ARCS_CONTINUOUS;

Review Comment:
   I'm a bit confused, why do we need to add a new version for block tree index 
?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] TestIndexWriterOnVMError.testUnknownError times out (fixes potential IW deadlock on tragic exceptions). [lucene]

2023-11-03 Thread via GitHub


s1monw commented on code in PR #12751:
URL: https://github.com/apache/lucene/pull/12751#discussion_r1381543043


##
lucene/core/src/java/org/apache/lucene/index/IndexWriter.java:
##
@@ -2560,10 +2560,15 @@ private void rollbackInternalNoCommit() throws 
IOException {
 
   // close all the closeables we can (but important is readerPool and 
writeLock to prevent
   // leaks)
-  IOUtils.closeWhileHandlingException(readerPool, deleter, writeLock);
-  writeLock = null;
-  closed = true;
-  closing = false;
+  try {
+IOUtils.closeWhileHandlingException(readerPool, deleter, 
writeLock);
+  } catch (Throwable t) {
+throwable.addSuppressed(t);
+  } finally {
+writeLock = null;
+closed = true;
+closing = false;
+  }
 
   // So any "concurrently closing" threads wake up and see that the 
close has now completed:
   notifyAll();

Review Comment:
   IMO we should also notify all threads waiting on the monitor in the case of 
an exception.



##
lucene/core/src/java/org/apache/lucene/index/IndexWriter.java:
##
@@ -2560,10 +2560,15 @@ private void rollbackInternalNoCommit() throws 
IOException {
 
   // close all the closeables we can (but important is readerPool and 
writeLock to prevent
   // leaks)
-  IOUtils.closeWhileHandlingException(readerPool, deleter, writeLock);
-  writeLock = null;
-  closed = true;
-  closing = false;
+  try {
+IOUtils.closeWhileHandlingException(readerPool, deleter, 
writeLock);
+  } catch (Throwable t) {

Review Comment:
   I don't think we need that, it will be handled in block below?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use value-based LRU cache in NodeHash [lucene]

2023-11-03 Thread via GitHub


mikemccand commented on code in PR #12738:
URL: https://github.com/apache/lucene/pull/12738#discussion_r1381538494


##
lucene/core/src/java/org/apache/lucene/util/fst/ByteBlockPoolReverseBytesReader.java:
##
@@ -0,0 +1,69 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.fst;
+
+import java.io.IOException;
+import org.apache.lucene.util.ByteBlockPool;
+
+/** Reads in reverse from a ByteBlockPool. */
+final class ByteBlockPoolReverseBytesReader extends FST.BytesReader {
+
+  private final ByteBlockPool buf;
+  // the difference between the FST node address and the hash table copied 
node address
+  private long posDelta;
+  private long pos;
+
+  public ByteBlockPoolReverseBytesReader(ByteBlockPool buf) {
+this.buf = buf;
+  }
+
+  @Override
+  public byte readByte() {
+return buf.readByte(pos--);
+  }
+
+  @Override
+  public void readBytes(byte[] b, int offset, int len) {
+for (int i = 0; i < len; i++) {
+  b[offset + i] = buf.readByte(pos--);
+}
+  }
+
+  @Override
+  public void skipBytes(long numBytes) throws IOException {
+pos -= numBytes;
+  }
+
+  @Override
+  public long getPosition() {
+return pos + posDelta;
+  }
+
+  @Override
+  public void setPosition(long pos) {
+this.pos = pos - posDelta;
+  }
+
+  @Override
+  public boolean reversed() {

Review Comment:
   I wonder why `FST.BytesReader` even has this method?  It might be a holdover 
(now dead?) from the `pack` days (long ago removed).  But we should not try to 
fix it here ... this change is awesome enough already!



##
lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java:
##
@@ -186,149 +218,228 @@ private long hash(FSTCompiler.UnCompiledNode node) {
 return h;
   }
 
-  // hash code for a frozen node.  this must precisely match the hash 
computation of an unfrozen
-  // node!
-  private long hash(long node) throws IOException {
-final int PRIME = 31;
-
-long h = 0;
-fstCompiler.fst.readFirstRealTargetArc(node, scratchArc, in);
-while (true) {
-  h = PRIME * h + scratchArc.label();
-  h = PRIME * h + (int) (scratchArc.target() ^ (scratchArc.target() >> 
32));
-  h = PRIME * h + scratchArc.output().hashCode();
-  h = PRIME * h + scratchArc.nextFinalOutput().hashCode();
-  if (scratchArc.isFinal()) {
-h += 17;
-  }
-  if (scratchArc.isLast()) {
-break;
-  }
-  fstCompiler.fst.readNextRealArc(scratchArc, in);
-}
-
-return h;
-  }
-
-  /**
-   * Compares an unfrozen node (UnCompiledNode) with a frozen node at byte 
location address (long),
-   * returning true if they are equal.
-   */
-  private boolean nodesEqual(FSTCompiler.UnCompiledNode node, long address) 
throws IOException {
-fstCompiler.fst.readFirstRealTargetArc(address, scratchArc, in);
-
-// fail fast for a node with fixed length arcs
-if (scratchArc.bytesPerArc() != 0) {
-  assert node.numArcs > 0;
-  // the frozen node uses fixed-with arc encoding (same number of bytes 
per arc), but may be
-  // sparse or dense
-  switch (scratchArc.nodeFlags()) {
-case FST.ARCS_FOR_BINARY_SEARCH:
-  // sparse
-  if (node.numArcs != scratchArc.numArcs()) {
-return false;
-  }
-  break;
-case FST.ARCS_FOR_DIRECT_ADDRESSING:
-  // dense -- compare both the number of labels allocated in the array 
(some of which may
-  // not actually be arcs), and the number of arcs
-  if ((node.arcs[node.numArcs - 1].label - node.arcs[0].label + 1) != 
scratchArc.numArcs()
-  || node.numArcs != FST.Arc.BitTable.countBits(scratchArc, in)) {
-return false;
-  }
-  break;
-default:
-  throw new AssertionError("unhandled scratchArc.nodeFlag() " + 
scratchArc.nodeFlags());
-  }
-}
-
-// compare arc by arc to see if there is a difference
-for (int arcUpto = 0; arcUpto < node.numArcs; arcUpto++) {
-  final FSTCompiler.Arc arc = node.arcs[arcUpto];
-  if (arc.label != scratchArc.label()
-  || arc.output.equals(scratchArc.output()) == false
-  || ((FSTCompiler.CompiledNode) a

Re: [PR] Use value-based LRU cache in NodeHash [lucene]

2023-11-03 Thread via GitHub


mikemccand commented on PR #12738:
URL: https://github.com/apache/lucene/pull/12738#issuecomment-1792293648

   Thanks @dungba88!
   
   I confirmed that `IndexToFST` now works again, and, when given "up to" `inf` 
RAM to use, it produces the same sized minimal `fst.bin` as main at `367244208 
bytes`.
   
   It's a bit slower than `main`, but that's to be expected and I think OK.  We 
can optimize later ... being able to fully cap RAM usage, nomatter how big an 
FST you produce, is worth this tradeoff.  (Note: we still need to fix FST 
writing to spool to disk to achieve RAM capping, the other PR @dungba88 is 
tackling -- thank you!).
   
   `main`:
   
   ```
 saved FST to "fst.bin": 367244208 bytes; 47.758 sec
 saved FST to "fst.bin": 367244208 bytes; 48.134 sec
   ```
   
   This PR:
   
   ```
 saved FST to "fst.bin": 367244208 bytes; 54.765 sec
 saved FST to "fst.bin": 367244208 bytes; 58.616 sec
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use value-based LRU cache in NodeHash [lucene]

2023-11-03 Thread via GitHub


dungba88 commented on code in PR #12738:
URL: https://github.com/apache/lucene/pull/12738#discussion_r1381559347


##
lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java:
##
@@ -186,149 +218,228 @@ private long hash(FSTCompiler.UnCompiledNode node) {
 return h;
   }
 
-  // hash code for a frozen node.  this must precisely match the hash 
computation of an unfrozen
-  // node!
-  private long hash(long node) throws IOException {
-final int PRIME = 31;
-
-long h = 0;
-fstCompiler.fst.readFirstRealTargetArc(node, scratchArc, in);
-while (true) {
-  h = PRIME * h + scratchArc.label();
-  h = PRIME * h + (int) (scratchArc.target() ^ (scratchArc.target() >> 
32));
-  h = PRIME * h + scratchArc.output().hashCode();
-  h = PRIME * h + scratchArc.nextFinalOutput().hashCode();
-  if (scratchArc.isFinal()) {
-h += 17;
-  }
-  if (scratchArc.isLast()) {
-break;
-  }
-  fstCompiler.fst.readNextRealArc(scratchArc, in);
-}
-
-return h;
-  }
-
-  /**
-   * Compares an unfrozen node (UnCompiledNode) with a frozen node at byte 
location address (long),
-   * returning true if they are equal.
-   */
-  private boolean nodesEqual(FSTCompiler.UnCompiledNode node, long address) 
throws IOException {
-fstCompiler.fst.readFirstRealTargetArc(address, scratchArc, in);
-
-// fail fast for a node with fixed length arcs
-if (scratchArc.bytesPerArc() != 0) {
-  assert node.numArcs > 0;
-  // the frozen node uses fixed-with arc encoding (same number of bytes 
per arc), but may be
-  // sparse or dense
-  switch (scratchArc.nodeFlags()) {
-case FST.ARCS_FOR_BINARY_SEARCH:
-  // sparse
-  if (node.numArcs != scratchArc.numArcs()) {
-return false;
-  }
-  break;
-case FST.ARCS_FOR_DIRECT_ADDRESSING:
-  // dense -- compare both the number of labels allocated in the array 
(some of which may
-  // not actually be arcs), and the number of arcs
-  if ((node.arcs[node.numArcs - 1].label - node.arcs[0].label + 1) != 
scratchArc.numArcs()
-  || node.numArcs != FST.Arc.BitTable.countBits(scratchArc, in)) {
-return false;
-  }
-  break;
-default:
-  throw new AssertionError("unhandled scratchArc.nodeFlag() " + 
scratchArc.nodeFlags());
-  }
-}
-
-// compare arc by arc to see if there is a difference
-for (int arcUpto = 0; arcUpto < node.numArcs; arcUpto++) {
-  final FSTCompiler.Arc arc = node.arcs[arcUpto];
-  if (arc.label != scratchArc.label()
-  || arc.output.equals(scratchArc.output()) == false
-  || ((FSTCompiler.CompiledNode) arc.target).node != 
scratchArc.target()
-  || arc.nextFinalOutput.equals(scratchArc.nextFinalOutput()) == false
-  || arc.isFinal != scratchArc.isFinal()) {
-return false;
-  }
-
-  if (scratchArc.isLast()) {
-if (arcUpto == node.numArcs - 1) {
-  return true;
-} else {
-  return false;
-}
-  }
-
-  fstCompiler.fst.readNextRealArc(scratchArc, in);
-}
-
-// unfrozen node has fewer arcs than frozen node
-
-return false;
-  }
-
   /** Inner class because it needs access to hash function and FST bytes. */
   private class PagedGrowableHash {
-private PagedGrowableWriter entries;
+// total bytes copied out of the FST byte[] into this hash table 
ByteBlockPool for suffix nodes
+private long copiedBytes;
+// storing the FST node address where the position is the masked hash of 
the node arcs
+private PagedGrowableWriter fstNodeAddress;
+// storing the local copiedNodes address in the same position as 
fstNodeAddress
+// here we are effectively storing a Map from the FST node 
address to copiedNodes
+// address
+private PagedGrowableWriter copiedNodeAddress;
 private long count;
 private long mask;
+// storing the byte slice from the FST for nodes we added to the hash so 
that we don't need to
+// look up from the FST itself, so the FST bytes can stream directly to 
disk as append-only
+// writes.
+// each node will be written subsequently
+private final ByteBlockPool copiedNodes;
+// the {@link FST.BytesReader} to read from copiedNodes. we use this when 
computing a frozen
+// node hash
+// or comparing if a frozen and unfrozen nodes are equal
+private final ByteBlockPoolReverseBytesReader bytesReader;
 
 // 256K blocks, but note that the final block is sized only as needed so 
it won't use the full
 // block size when just a few elements were written to it
 private static final int BLOCK_SIZE_BYTES = 1 << 18;
 
 public PagedGrowableHash() {
-  entries = new PagedGrowableWriter(16, BLOCK_SIZE_BYTES, 8, 
PackedInts.COMPACT);
+  fstNodeAddress = new PagedGrowableWriter(16, BLOCK_SIZE_BYTES, 8, 
PackedInts.COMPACT);
+  co

Re: [PR] Use value-based LRU cache in NodeHash [lucene]

2023-11-03 Thread via GitHub


mikemccand commented on code in PR #12738:
URL: https://github.com/apache/lucene/pull/12738#discussion_r1381564163


##
lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java:
##
@@ -110,25 +117,39 @@ public long add(FSTCompiler.UnCompiledNode nodeIn) 
throws IOException {
 node = getFallback(nodeIn, hash);
 if (node != 0) {
   // it was already in fallback -- promote to primary
-  primaryTable.set(pos, node);
+  // TODO: Copy directly between 2 ByteBlockPool to avoid double-copy
+  primaryTable.set(pos, node, fallbackTable.getBytes(pos, 
lastFallbackNodeLength));
 } else {
   // not in fallback either -- freeze & add the incoming node
 
+  long startAddress = fstCompiler.bytes.getPosition();
   // freeze & add
   node = fstCompiler.addNode(nodeIn);
 
+  // TODO: Write the bytes directly from BytesStore
   // we use 0 as empty marker in hash table, so it better be 
impossible to get a frozen node
   // at 0:
-  assert node != 0;
+  assert node != FST.FINAL_END_NODE && node != FST.NON_FINAL_END_NODE;
+  byte[] buf = new byte[Math.toIntExact(node - startAddress + 1)];

Review Comment:
   Brain hurts!  Reverse ob1 error in my brain :)
   
   What confuses me here is if `fstCompiler.addNode` writes 3 bytes, won't we 
compute a length of 4?
   
   Oh, I see!  `FSTCompiler#addNode` has this at the end:
   
   ```
   final long thisNodeAddress = bytes.getPosition() - 1;
   bytes.reverse(startAddress, thisNodeAddress);
   nodeCount++;
   return thisNodeAddress;
   ```
   
   So that last byte address it returns is inclusive, since it did the `-1`, so 
we have to +1 to undo that.  OK I think it makese sense ;)
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use value-based LRU cache in NodeHash [lucene]

2023-11-03 Thread via GitHub


dungba88 commented on code in PR #12738:
URL: https://github.com/apache/lucene/pull/12738#discussion_r1381565556


##
lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java:
##
@@ -186,149 +218,228 @@ private long hash(FSTCompiler.UnCompiledNode node) {
 return h;
   }
 
-  // hash code for a frozen node.  this must precisely match the hash 
computation of an unfrozen
-  // node!
-  private long hash(long node) throws IOException {
-final int PRIME = 31;
-
-long h = 0;
-fstCompiler.fst.readFirstRealTargetArc(node, scratchArc, in);
-while (true) {
-  h = PRIME * h + scratchArc.label();
-  h = PRIME * h + (int) (scratchArc.target() ^ (scratchArc.target() >> 
32));
-  h = PRIME * h + scratchArc.output().hashCode();
-  h = PRIME * h + scratchArc.nextFinalOutput().hashCode();
-  if (scratchArc.isFinal()) {
-h += 17;
-  }
-  if (scratchArc.isLast()) {
-break;
-  }
-  fstCompiler.fst.readNextRealArc(scratchArc, in);
-}
-
-return h;
-  }
-
-  /**
-   * Compares an unfrozen node (UnCompiledNode) with a frozen node at byte 
location address (long),
-   * returning true if they are equal.
-   */
-  private boolean nodesEqual(FSTCompiler.UnCompiledNode node, long address) 
throws IOException {
-fstCompiler.fst.readFirstRealTargetArc(address, scratchArc, in);
-
-// fail fast for a node with fixed length arcs
-if (scratchArc.bytesPerArc() != 0) {
-  assert node.numArcs > 0;
-  // the frozen node uses fixed-with arc encoding (same number of bytes 
per arc), but may be
-  // sparse or dense
-  switch (scratchArc.nodeFlags()) {
-case FST.ARCS_FOR_BINARY_SEARCH:
-  // sparse
-  if (node.numArcs != scratchArc.numArcs()) {
-return false;
-  }
-  break;
-case FST.ARCS_FOR_DIRECT_ADDRESSING:
-  // dense -- compare both the number of labels allocated in the array 
(some of which may
-  // not actually be arcs), and the number of arcs
-  if ((node.arcs[node.numArcs - 1].label - node.arcs[0].label + 1) != 
scratchArc.numArcs()
-  || node.numArcs != FST.Arc.BitTable.countBits(scratchArc, in)) {
-return false;
-  }
-  break;
-default:
-  throw new AssertionError("unhandled scratchArc.nodeFlag() " + 
scratchArc.nodeFlags());
-  }
-}
-
-// compare arc by arc to see if there is a difference
-for (int arcUpto = 0; arcUpto < node.numArcs; arcUpto++) {
-  final FSTCompiler.Arc arc = node.arcs[arcUpto];
-  if (arc.label != scratchArc.label()
-  || arc.output.equals(scratchArc.output()) == false
-  || ((FSTCompiler.CompiledNode) arc.target).node != 
scratchArc.target()
-  || arc.nextFinalOutput.equals(scratchArc.nextFinalOutput()) == false
-  || arc.isFinal != scratchArc.isFinal()) {
-return false;
-  }
-
-  if (scratchArc.isLast()) {
-if (arcUpto == node.numArcs - 1) {
-  return true;
-} else {
-  return false;
-}
-  }
-
-  fstCompiler.fst.readNextRealArc(scratchArc, in);
-}
-
-// unfrozen node has fewer arcs than frozen node
-
-return false;
-  }
-
   /** Inner class because it needs access to hash function and FST bytes. */
   private class PagedGrowableHash {
-private PagedGrowableWriter entries;
+// total bytes copied out of the FST byte[] into this hash table 
ByteBlockPool for suffix nodes
+private long copiedBytes;
+// storing the FST node address where the position is the masked hash of 
the node arcs
+private PagedGrowableWriter fstNodeAddress;
+// storing the local copiedNodes address in the same position as 
fstNodeAddress
+// here we are effectively storing a Map from the FST node 
address to copiedNodes
+// address
+private PagedGrowableWriter copiedNodeAddress;
 private long count;
 private long mask;
+// storing the byte slice from the FST for nodes we added to the hash so 
that we don't need to
+// look up from the FST itself, so the FST bytes can stream directly to 
disk as append-only
+// writes.
+// each node will be written subsequently
+private final ByteBlockPool copiedNodes;
+// the {@link FST.BytesReader} to read from copiedNodes. we use this when 
computing a frozen
+// node hash
+// or comparing if a frozen and unfrozen nodes are equal
+private final ByteBlockPoolReverseBytesReader bytesReader;
 
 // 256K blocks, but note that the final block is sized only as needed so 
it won't use the full
 // block size when just a few elements were written to it
 private static final int BLOCK_SIZE_BYTES = 1 << 18;
 
 public PagedGrowableHash() {
-  entries = new PagedGrowableWriter(16, BLOCK_SIZE_BYTES, 8, 
PackedInts.COMPACT);
+  fstNodeAddress = new PagedGrowableWriter(16, BLOCK_SIZE_BYTES, 8, 
PackedInts.COMPACT);
+  co

Re: [PR] Use value-based LRU cache in NodeHash [lucene]

2023-11-03 Thread via GitHub


mikemccand commented on code in PR #12738:
URL: https://github.com/apache/lucene/pull/12738#discussion_r1381570983


##
lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java:
##
@@ -186,149 +218,228 @@ private long hash(FSTCompiler.UnCompiledNode node) {
 return h;
   }
 
-  // hash code for a frozen node.  this must precisely match the hash 
computation of an unfrozen
-  // node!
-  private long hash(long node) throws IOException {
-final int PRIME = 31;
-
-long h = 0;
-fstCompiler.fst.readFirstRealTargetArc(node, scratchArc, in);
-while (true) {
-  h = PRIME * h + scratchArc.label();
-  h = PRIME * h + (int) (scratchArc.target() ^ (scratchArc.target() >> 
32));
-  h = PRIME * h + scratchArc.output().hashCode();
-  h = PRIME * h + scratchArc.nextFinalOutput().hashCode();
-  if (scratchArc.isFinal()) {
-h += 17;
-  }
-  if (scratchArc.isLast()) {
-break;
-  }
-  fstCompiler.fst.readNextRealArc(scratchArc, in);
-}
-
-return h;
-  }
-
-  /**
-   * Compares an unfrozen node (UnCompiledNode) with a frozen node at byte 
location address (long),
-   * returning true if they are equal.
-   */
-  private boolean nodesEqual(FSTCompiler.UnCompiledNode node, long address) 
throws IOException {
-fstCompiler.fst.readFirstRealTargetArc(address, scratchArc, in);
-
-// fail fast for a node with fixed length arcs
-if (scratchArc.bytesPerArc() != 0) {
-  assert node.numArcs > 0;
-  // the frozen node uses fixed-with arc encoding (same number of bytes 
per arc), but may be
-  // sparse or dense
-  switch (scratchArc.nodeFlags()) {
-case FST.ARCS_FOR_BINARY_SEARCH:
-  // sparse
-  if (node.numArcs != scratchArc.numArcs()) {
-return false;
-  }
-  break;
-case FST.ARCS_FOR_DIRECT_ADDRESSING:
-  // dense -- compare both the number of labels allocated in the array 
(some of which may
-  // not actually be arcs), and the number of arcs
-  if ((node.arcs[node.numArcs - 1].label - node.arcs[0].label + 1) != 
scratchArc.numArcs()
-  || node.numArcs != FST.Arc.BitTable.countBits(scratchArc, in)) {
-return false;
-  }
-  break;
-default:
-  throw new AssertionError("unhandled scratchArc.nodeFlag() " + 
scratchArc.nodeFlags());
-  }
-}
-
-// compare arc by arc to see if there is a difference
-for (int arcUpto = 0; arcUpto < node.numArcs; arcUpto++) {
-  final FSTCompiler.Arc arc = node.arcs[arcUpto];
-  if (arc.label != scratchArc.label()
-  || arc.output.equals(scratchArc.output()) == false
-  || ((FSTCompiler.CompiledNode) arc.target).node != 
scratchArc.target()
-  || arc.nextFinalOutput.equals(scratchArc.nextFinalOutput()) == false
-  || arc.isFinal != scratchArc.isFinal()) {
-return false;
-  }
-
-  if (scratchArc.isLast()) {
-if (arcUpto == node.numArcs - 1) {
-  return true;
-} else {
-  return false;
-}
-  }
-
-  fstCompiler.fst.readNextRealArc(scratchArc, in);
-}
-
-// unfrozen node has fewer arcs than frozen node
-
-return false;
-  }
-
   /** Inner class because it needs access to hash function and FST bytes. */
   private class PagedGrowableHash {
-private PagedGrowableWriter entries;
+// total bytes copied out of the FST byte[] into this hash table 
ByteBlockPool for suffix nodes
+private long copiedBytes;
+// storing the FST node address where the position is the masked hash of 
the node arcs
+private PagedGrowableWriter fstNodeAddress;
+// storing the local copiedNodes address in the same position as 
fstNodeAddress
+// here we are effectively storing a Map from the FST node 
address to copiedNodes
+// address
+private PagedGrowableWriter copiedNodeAddress;
 private long count;
 private long mask;
+// storing the byte slice from the FST for nodes we added to the hash so 
that we don't need to
+// look up from the FST itself, so the FST bytes can stream directly to 
disk as append-only
+// writes.
+// each node will be written subsequently
+private final ByteBlockPool copiedNodes;
+// the {@link FST.BytesReader} to read from copiedNodes. we use this when 
computing a frozen
+// node hash
+// or comparing if a frozen and unfrozen nodes are equal
+private final ByteBlockPoolReverseBytesReader bytesReader;
 
 // 256K blocks, but note that the final block is sized only as needed so 
it won't use the full
 // block size when just a few elements were written to it
 private static final int BLOCK_SIZE_BYTES = 1 << 18;
 
 public PagedGrowableHash() {
-  entries = new PagedGrowableWriter(16, BLOCK_SIZE_BYTES, 8, 
PackedInts.COMPACT);
+  fstNodeAddress = new PagedGrowableWriter(16, BLOCK_SIZE_BYTES, 8, 
PackedInts.COMPACT);
+  

Re: [PR] Specialize arc store for continuous label in FST [lucene]

2023-11-03 Thread via GitHub


easyice commented on PR #12748:
URL: https://github.com/apache/lucene/pull/12748#issuecomment-1792322904

   @mikemccand Thanks for the benchmarking, i also write 10 million docs of 
random long values, then use `TermInSetQuery` for benchmarking. here is the 
result:
   
   The file size of tip reduced ~2% 
   
   | | size |
   | --- | --- |
   | main | 1807149 |
   | PR | 1770259 |
   
   The query latency reduced ~7%. `termsCount` is the number of terms in 
`TermInSetQuery`, `hitRatio` refers to what percentage of the term will be hit. 
there is a bit of variance across runs, but they seem good overall.
   
   | hitRatio  | termsCount | tookMs(main) | tookMs(PR) | diff |
   | --- |  --- | --- | --- | --- |
   | 1%  | 64 | 177 | 164 | 92.66% |
   | 1%  | 512 | 1380 | 1312 | 95.07% |
   | 1%  | 2048 | 5225 | 5022 | 96.11% |
   | 25%  | 64 | 222 | 212 | 95.50% |
   | 25%  | 512 | 1462 | 1391 | 95.14% |
   | 25%  | 2048 | 5602 | 5533 | 98.77% |
   | 50%  | 64 | 216 | 204 | 94.44% |
   | 50%  | 512 | 1600 | 1513 | 94.56% |
   | 50%  | 2048 | 6193 | 5883 | 94.99% |
   | 75%  | 64 | 224 | 213 | 95.09% |
   | 75%  | 512 | 1702 | 1598 | 93.89% |
   | 75%  | 2048 | 6565 | 6289 | 95.80% |
   | 100%  | 64 | 233 | 218 | 93.56% |
   | 100%  | 512 | 1752 | 1736 | 99.09% |
   | 100%  | 2048 | 7057 | 6621 | 93.82% |
   
   crude benchmark code:
   
   ```
   static public long doSearch(int termCount, int hitRatio) throws IOException {
   Directory directory = 
FSDirectory.open(Paths.get("/Volumes/RamDisk/longdata"));
   IndexReader indexReader = DirectoryReader.open(directory);
   IndexSearcher searcher = new IndexSearcher(indexReader);
   searcher.setQueryCachingPolicy(
   new QueryCachingPolicy() {
   @Override
   public void onUse(Query query) {
   }
   
   @Override
   public boolean shouldCache(Query query) throws 
IOException {
   return false;
   }
   });
   
   long total = 0;
   Query query = getQuery(termCount, hitRatio);
   for (int i = 0; i < 1000; i++) {
   long start = System.currentTimeMillis();
   doQuery(searcher, query);
   long end = System.currentTimeMillis();
   total += end - start;
   }
   //System.out.println("term count: " + termCount + ", took(ms): " + 
total);
   indexReader.close();
   directory.close();
   return total;
   }
   
   private static Query getQuery(int termCount, int hitRatio) {
   int hitCount = termCount * hitRatio / 100;
   int notHitCount = termCount - hitCount;
   List terms = new ArrayList<>();
   for (int i = 0; i < hitCount; i++) {
   terms.add(new 
BytesRef(Long.toString(longs.get(RANDOM.nextInt(longs.size() - 1);
   }
   
   Random r = new Random();
   for (int i = 0; i < notHitCount; i++) {
   long v = r.nextLong();
   while (uniqueLongs.contains(v)) {
   v = r.nextLong();
   }
   terms.add(new BytesRef(Long.toString(v)));
   }
   return new TermInSetQuery(FIELD, terms);
   }
   
   private static void doQuery(IndexSearcher searcher, Query query) throws 
IOException {
   searcher.search(
   query,
   new Collector() {
   @Override
   public LeafCollector getLeafCollector(LeafReaderContext 
context) throws IOException {
   return new LeafCollector() {
   @Override
   public void setScorer(Scorable scorer) throws 
IOException {
   }
   
   @Override
   public void collect(int doc) throws IOException {
   throw new CollectionTerminatedException();
   }
   };
   }
   
   @Override
   public ScoreMode scoreMode() {
   return ScoreMode.COMPLETE_NO_SCORES;
   }
   });
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use value-based LRU cache in NodeHash [lucene]

2023-11-03 Thread via GitHub


mikemccand commented on PR #12738:
URL: https://github.com/apache/lucene/pull/12738#issuecomment-1792327869

   `Test2BFST` is happy, yay!
   
   ```
   BUILD SUCCESSFUL in 56m 36s
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Random access term dictionary [lucene]

2023-11-03 Thread via GitHub


mikemccand commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1381580022


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermType.java:
##
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import java.util.Objects;
+import 
org.apache.lucene.codecs.lucene90.Lucene90PostingsFormat.IntBlockTermState;
+
+/**
+ * TermType holds the classification of a term, based on how its postings are 
written.
+ *
+ * It captures -- 1) if a term has a singleton docid (i.e. only one doc 
contains this term). 2)
+ * if the term has skip data. 3) if the term as an VINT encoded position block.

Review Comment:
   `term as a` -> `term has a`?



##
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermStateCodecImpl.java:
##
@@ -0,0 +1,159 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import 
org.apache.lucene.codecs.lucene90.Lucene90PostingsFormat.IntBlockTermState;
+import 
org.apache.lucene.sandbox.codecs.lucene90.randomaccess.bitpacking.BitPacker;
+import 
org.apache.lucene.sandbox.codecs.lucene90.randomaccess.bitpacking.BitUnpacker;
+import org.apache.lucene.store.ByteArrayDataInput;
+import org.apache.lucene.store.ByteArrayDataOutput;
+import org.apache.lucene.util.BytesRef;
+
+final class TermStateCodecImpl implements TermStateCodec {
+  private final TermStateCodecComponent[] components;
+  private final int metadataBytesLength;
+
+  private static int getMetadataLength(TermStateCodecComponent component) {
+// 1 byte for bitWidth; optionally 8 byte more for the reference value
+return 1 + (component.isMonotonicallyIncreasing() ? 8 : 0);
+  }
+
+  public TermStateCodecImpl(TermStateCodecComponent[] components) {
+assert components.length > 0;
+
+this.components = components;
+int metadataBytesLength = 0;
+for (var component : components) {
+  metadataBytesLength += getMetadataLength(component);
+}
+this.metadataBytesLength = metadataBytesLength;
+  }
+
+  @Override
+  public byte[] encodeBlock(IntBlockTermState[] inputs, BitPacker bitPacker) {
+Metadata[] metadataPerComponent = getMetadataPerComponent(inputs);
+byte[] metadataBytes = serializeMetadata(metadataPerComponent);
+
+// Encode inputs via the bitpacker
+for (var termState : inputs) {
+  encodeOne(bitPacker, termState, metadataPerComponent);
+}
+
+return metadataBytes;
+  }
+
+  private Metadata[] getMetadataPerComponent(IntBlockTermState[] inputs) {
+Metadata[] metadataPerComponent = new Metadata[components.length];
+for (int i = 0; i < components.length; i++) {
+  var component = components[i];
+  byte bitWidth = TermStateCodecComponent.getBitWidth(inputs, component);
+  long referenceValue =
+  component.isMonotonicallyIncreasing() ? 
component.getTargetValue(inputs[0]) : 0L;
+  metadataPerComponent[i] = new Metadata(bitWidth, referenceValue);
+}
+return metadataPerComponent;
+  }
+
+  private byte[] serializeMetadata(Metadata[] metadataPerComponent) {
+byte[] metadataBytes = new byte[this.metadataBytesLength];
+ByteArrayDataOutput dataOut = new ByteArrayDataOutput(metadataBytes);
+
+for (int i = 0; i < components.length; i++) {
+  var me

Re: [PR] TestIndexWriterOnVMError.testUnknownError times out (fixes potential IW deadlock on tragic exceptions). [lucene]

2023-11-03 Thread via GitHub


dweiss commented on code in PR #12751:
URL: https://github.com/apache/lucene/pull/12751#discussion_r1381608903


##
lucene/core/src/java/org/apache/lucene/index/IndexWriter.java:
##
@@ -2560,10 +2560,15 @@ private void rollbackInternalNoCommit() throws 
IOException {
 
   // close all the closeables we can (but important is readerPool and 
writeLock to prevent
   // leaks)
-  IOUtils.closeWhileHandlingException(readerPool, deleter, writeLock);
-  writeLock = null;
-  closed = true;
-  closing = false;
+  try {
+IOUtils.closeWhileHandlingException(readerPool, deleter, 
writeLock);
+  } catch (Throwable t) {

Review Comment:
   this was deliberate so that notifyAll would still be called, even if 
something happened. In fact, the finally block is not necessary - it's a 
drop-through with catch (Throwable), I just wanted to make it explicit that we 
need to set those fields.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] TestIndexWriterOnVMError.testUnknownError times out (fixes potential IW deadlock on tragic exceptions). [lucene]

2023-11-03 Thread via GitHub


dweiss commented on code in PR #12751:
URL: https://github.com/apache/lucene/pull/12751#discussion_r1381610595


##
lucene/core/src/java/org/apache/lucene/index/IndexWriter.java:
##
@@ -2560,10 +2560,15 @@ private void rollbackInternalNoCommit() throws 
IOException {
 
   // close all the closeables we can (but important is readerPool and 
writeLock to prevent
   // leaks)
-  IOUtils.closeWhileHandlingException(readerPool, deleter, writeLock);
-  writeLock = null;
-  closed = true;
-  closing = false;
+  try {
+IOUtils.closeWhileHandlingException(readerPool, deleter, 
writeLock);
+  } catch (Throwable t) {
+throwable.addSuppressed(t);
+  } finally {
+writeLock = null;
+closed = true;
+closing = false;
+  }
 
   // So any "concurrently closing" threads wake up and see that the 
close has now completed:
   notifyAll();

Review Comment:
   As commented above, it'll be called even if something happens. I didn't want 
to move that notify out of the synchronized block because this would mean the 
need to re-acquire the lock?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Adding new flat vector format and refactoring HNSW [lucene]

2023-11-03 Thread via GitHub


jimczi commented on PR #12729:
URL: https://github.com/apache/lucene/pull/12729#issuecomment-1792391067

   I agree with Adrien that hardcoded formats with a clear strategy are better. 
   We want to avoid exposing a knn format that takes another abstract format. 
That would be cryptic and difficult to navigate for users.
   However I think we can reuse existing format as an implementation detail. 
The `Lucene99HnswScalarQuantizedVectorsFormat ` could be hardcoded to use the 
`Lucene99ScalarQuantizedVectorsFormat` internally. The fact that we register 
the format does not mean that we expose the composability to users. In its 
current form the `FlatVector...` lineage is redundant if it's only to avoid to 
register the format explicitly?  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Specialize arc store for continuous label in FST [lucene]

2023-11-03 Thread via GitHub


easyice commented on code in PR #12748:
URL: https://github.com/apache/lucene/pull/12748#discussion_r1381647779


##
lucene/core/src/java/org/apache/lucene/util/fst/FST.java:
##
@@ -96,6 +96,8 @@ public enum INPUT_TYPE {
*/
   static final byte ARCS_FOR_DIRECT_ADDRESSING = 1 << 6;
 
+  static final byte ARCS_FOR_CONTINUOUS = ARCS_FOR_DIRECT_ADDRESSING + 
ARCS_FOR_BINARY_SEARCH;

Review Comment:
   +1 , It is important.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] HnwsGraph creates disconnected components [lucene]

2023-11-03 Thread via GitHub


msokolov commented on issue #12627:
URL: https://github.com/apache/lucene/issues/12627#issuecomment-1792404783

   > So we are removing this half of the undirected connection but I don't 
think we are removing the other half c ---> b anywhere. This will leave 
inconsistent Graph
   
   This is by design; the resultant graph is not expected to be symmetric (ie 
undirected, bidirectional). It is allowed to have a->b with no corresponding 
b->a


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Specialize arc store for continuous label in FST [lucene]

2023-11-03 Thread via GitHub


easyice commented on code in PR #12748:
URL: https://github.com/apache/lucene/pull/12748#discussion_r1381655684


##
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##
@@ -444,9 +446,15 @@ long addNode(FSTCompiler.UnCompiledNode nodeIn) throws 
IOException {
 
   int labelRange = nodeIn.arcs[nodeIn.numArcs - 1].label - 
nodeIn.arcs[0].label + 1;
   assert labelRange > 0;
-  if (shouldExpandNodeWithDirectAddressing(
+  boolean continuousLable = labelRange == nodeIn.numArcs;

Review Comment:
   sorry for the typo :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] LUCENE-10572: Add support for varhandles in native byte order (still randomized during tests) [lucene]

2023-11-03 Thread via GitHub


uschindler commented on PR #888:
URL: https://github.com/apache/lucene/pull/888#issuecomment-1792528482

   Hi @mikemccand,
   I reset the branch to the initial commit (without BytesRefHash & Co. changes 
). Then I merged and pushed.
   I will now try to redo the changes.
   
   In fact, on x86 machines it makes no sense to benchmark it, as the LE byte 
order is already native :-) This PR only helps with architectures like s390x 
that have big endian, as the internals of BytesRefHash would never make it into 
a file format so they can encode their "private data" in native endianness. We 
still randomize the endianness on testing, so we make sure both variants work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] LUCENE-10572: Add support for varhandles in native byte order (still randomized during tests) [lucene]

2023-11-03 Thread via GitHub


uschindler commented on PR #888:
URL: https://github.com/apache/lucene/pull/888#issuecomment-1792537252

   @mikemccand,
   I checked in main branch, it no longer uses any varhandles in BytesRefHash 
and ByteBlockPool. No idea where the code moved to.
   
   It now uses BytesRefBlockPool, but this one uses BIG ENDIAN byte order (for 
whatever reason). As I no longer know whcih of those ByteFoobar classes in Util 
are used internally and not serialized to disk and which ones are serialized to 
disk I won't change anything for now.
   
   So I restored the PR into the "known state" (it adds native varhandles) and 
changes LZ4 compression to use the native order (which is documented by LZ4 to 
not matter). So this one only improves LZ4.
   
   I have no time to look into the changes in BytesRefHash, so I give it to you 
to figure out where it is "ok" to change from fixed BE or LE order to native 
order, but care must be taken that those byte arrays are never 
persisted/serialized onto index file formats,


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] LUCENE-10572: Add support for varhandles in native byte order (still randomized during tests) [lucene]

2023-11-03 Thread via GitHub


uschindler commented on PR #888:
URL: https://github.com/apache/lucene/pull/888#issuecomment-1792545702

   @mikemccand: If you want to see the changes I reverted, see the above 
comparison: 
https://github.com/apache/lucene/compare/c1b626c0636821f4d7c085895359489e7dfa330f..36de2bb7fa7a0587a102cf5c4d35ac8f94976bbd


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Adding new flat vector format and refactoring HNSW [lucene]

2023-11-03 Thread via GitHub


benwtrent commented on PR #12729:
URL: https://github.com/apache/lucene/pull/12729#issuecomment-1792546852

   @jimczi what do you mean "existing format as implementation detail"?
   
   The flat format is an implementation detail. Folks using the quantized hnsw 
do not have to supply a flat format.
   
   And by "register format" do you mean we should register the flat formats? 
Right now they are just classes that encapsulate vector storage and merging.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Specialize arc store for continuous label in FST [lucene]

2023-11-03 Thread via GitHub


easyice commented on code in PR #12748:
URL: https://github.com/apache/lucene/pull/12748#discussion_r1381794470


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsReader.java:
##
@@ -86,8 +86,11 @@ public final class Lucene90BlockTreeTermsReader extends 
FieldsProducer {
*/
   public static final int VERSION_MSB_VLONG_OUTPUT = 1;
 
+  /** Version that store continuous arcs label as range in FST. */
+  public static final int VERSION_ARCS_CONTINUOUS = 2;
+
   /** Current terms format. */
-  public static final int VERSION_CURRENT = VERSION_MSB_VLONG_OUTPUT;
+  public static final int VERSION_CURRENT = VERSION_ARCS_CONTINUOUS;

Review Comment:
   You are right, we are backward compatible



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) [lucene]

2023-11-03 Thread via GitHub


javanna commented on PR #240:
URL: https://github.com/apache/lucene/pull/240#issuecomment-1792564301

   heya @zacharymorn I worked quite a bit on this last year. I should have 
addresses all of this little by little, although we are still not very close on 
deprecating search(Query, Collector). There is still quite a bit of work left 
to do to remove usages of it. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] LUCENE-10572: Add support for varhandles in native byte order (still randomized during tests) [lucene]

2023-11-03 Thread via GitHub


uschindler commented on PR #888:
URL: https://github.com/apache/lucene/pull/888#issuecomment-1792570482

   > @mikemccand: If you want to see the changes I reverted, see the above 
comparison: 
https://github.com/apache/lucene/compare/36de2bb7fa7a0587a102cf5c4d35ac8f94976bbd..c1b626c0636821f4d7c085895359489e7dfa330f
   > 
   > Those changes need to be re-applied to the repo in correct files (not sure 
where this code now lives, looks like BytesRefBlockPool, but no idea, sorry)
   
   I think I know after looking into those changes what the problem was. 
Internally BytesRefHash uses BIG ENDIAN, because some parts in the byte array 
are "UTF-8 like" encoded (if highest bit is set another byte follows). As this 
is stupid to do and requires only a few bytes more storage, I removed that 
encoding to always use shorts instead of "byte or BE short". When the encoding 
no longer matters and must not be "UTF-8 encoding like", it can use native 
order. But for safety you could also use LE encoding to make use of actual CPUs 
(ARM is also LE now).
   
   So we have 2 posisbilities:
   - Change the internal encoding of bytesrefhash and remove the Big Endian 
UTF-8 like encoding (or call it vShort) and switch to Little Endian shorts
   - Use native encoding to also help CPUs like s390 and use native encoding 
(which also works). This PR supports this, but it is questionable for the 
reasons Robert said.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] TestIndexWriterOnVMError.testUnknownError times out (fixes potential IW deadlock on tragic exceptions). [lucene]

2023-11-03 Thread via GitHub


s1monw commented on code in PR #12751:
URL: https://github.com/apache/lucene/pull/12751#discussion_r1381800115


##
lucene/core/src/java/org/apache/lucene/index/IndexWriter.java:
##
@@ -2560,10 +2560,15 @@ private void rollbackInternalNoCommit() throws 
IOException {
 
   // close all the closeables we can (but important is readerPool and 
writeLock to prevent
   // leaks)
-  IOUtils.closeWhileHandlingException(readerPool, deleter, writeLock);
-  writeLock = null;
-  closed = true;
-  closing = false;
+  try {
+IOUtils.closeWhileHandlingException(readerPool, deleter, 
writeLock);
+  } catch (Throwable t) {
+throwable.addSuppressed(t);
+  } finally {
+writeLock = null;
+closed = true;
+closing = false;
+  }
 
   // So any "concurrently closing" threads wake up and see that the 
close has now completed:
   notifyAll();

Review Comment:
   not sure we are talking about the same thing. maybe do it like this:
   
   ```diff
   --- a/lucene/core/src/java/org/apache/lucene/index/IndexWriter.java
   +++ b/lucene/core/src/java/org/apache/lucene/index/IndexWriter.java
   @@ -2560,13 +2560,15 @@ public class IndexWriter

  // close all the closeables we can (but important is readerPool 
and writeLock to prevent
  // leaks)
   -  IOUtils.closeWhileHandlingException(readerPool, deleter, 
writeLock);
   -  writeLock = null;
   -  closed = true;
   -  closing = false;
   -
   -  // So any "concurrently closing" threads wake up and see that the 
close has now completed:
   -  notifyAll();
   +  try {
   +IOUtils.closeWhileHandlingException(readerPool, deleter, 
writeLock);
   +  } finally {
   +writeLock = null;
   +closed = true;
   +closing = false;
   +// So any "concurrently closing" threads wake up and see that 
the close has now completed:
   +notifyAll();
   +  }
}
  } catch (Throwable t) {
throwable.addSuppressed(t);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use value-based LRU cache in NodeHash [lucene]

2023-11-03 Thread via GitHub


jpountz commented on code in PR #12738:
URL: https://github.com/apache/lucene/pull/12738#discussion_r1381802489


##
lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java:
##
@@ -186,149 +218,228 @@ private long hash(FSTCompiler.UnCompiledNode node) {
 return h;
   }
 
-  // hash code for a frozen node.  this must precisely match the hash 
computation of an unfrozen
-  // node!
-  private long hash(long node) throws IOException {
-final int PRIME = 31;
-
-long h = 0;
-fstCompiler.fst.readFirstRealTargetArc(node, scratchArc, in);
-while (true) {
-  h = PRIME * h + scratchArc.label();
-  h = PRIME * h + (int) (scratchArc.target() ^ (scratchArc.target() >> 
32));
-  h = PRIME * h + scratchArc.output().hashCode();
-  h = PRIME * h + scratchArc.nextFinalOutput().hashCode();
-  if (scratchArc.isFinal()) {
-h += 17;
-  }
-  if (scratchArc.isLast()) {
-break;
-  }
-  fstCompiler.fst.readNextRealArc(scratchArc, in);
-}
-
-return h;
-  }
-
-  /**
-   * Compares an unfrozen node (UnCompiledNode) with a frozen node at byte 
location address (long),
-   * returning true if they are equal.
-   */
-  private boolean nodesEqual(FSTCompiler.UnCompiledNode node, long address) 
throws IOException {
-fstCompiler.fst.readFirstRealTargetArc(address, scratchArc, in);
-
-// fail fast for a node with fixed length arcs
-if (scratchArc.bytesPerArc() != 0) {
-  assert node.numArcs > 0;
-  // the frozen node uses fixed-with arc encoding (same number of bytes 
per arc), but may be
-  // sparse or dense
-  switch (scratchArc.nodeFlags()) {
-case FST.ARCS_FOR_BINARY_SEARCH:
-  // sparse
-  if (node.numArcs != scratchArc.numArcs()) {
-return false;
-  }
-  break;
-case FST.ARCS_FOR_DIRECT_ADDRESSING:
-  // dense -- compare both the number of labels allocated in the array 
(some of which may
-  // not actually be arcs), and the number of arcs
-  if ((node.arcs[node.numArcs - 1].label - node.arcs[0].label + 1) != 
scratchArc.numArcs()
-  || node.numArcs != FST.Arc.BitTable.countBits(scratchArc, in)) {
-return false;
-  }
-  break;
-default:
-  throw new AssertionError("unhandled scratchArc.nodeFlag() " + 
scratchArc.nodeFlags());
-  }
-}
-
-// compare arc by arc to see if there is a difference
-for (int arcUpto = 0; arcUpto < node.numArcs; arcUpto++) {
-  final FSTCompiler.Arc arc = node.arcs[arcUpto];
-  if (arc.label != scratchArc.label()
-  || arc.output.equals(scratchArc.output()) == false
-  || ((FSTCompiler.CompiledNode) arc.target).node != 
scratchArc.target()
-  || arc.nextFinalOutput.equals(scratchArc.nextFinalOutput()) == false
-  || arc.isFinal != scratchArc.isFinal()) {
-return false;
-  }
-
-  if (scratchArc.isLast()) {
-if (arcUpto == node.numArcs - 1) {
-  return true;
-} else {
-  return false;
-}
-  }
-
-  fstCompiler.fst.readNextRealArc(scratchArc, in);
-}
-
-// unfrozen node has fewer arcs than frozen node
-
-return false;
-  }
-
   /** Inner class because it needs access to hash function and FST bytes. */
   private class PagedGrowableHash {
-private PagedGrowableWriter entries;
+// total bytes copied out of the FST byte[] into this hash table 
ByteBlockPool for suffix nodes
+private long copiedBytes;
+// storing the FST node address where the position is the masked hash of 
the node arcs
+private PagedGrowableWriter fstNodeAddress;
+// storing the local copiedNodes address in the same position as 
fstNodeAddress
+// here we are effectively storing a Map from the FST node 
address to copiedNodes
+// address
+private PagedGrowableWriter copiedNodeAddress;
 private long count;
 private long mask;
+// storing the byte slice from the FST for nodes we added to the hash so 
that we don't need to
+// look up from the FST itself, so the FST bytes can stream directly to 
disk as append-only
+// writes.
+// each node will be written subsequently
+private final ByteBlockPool copiedNodes;
+// the {@link FST.BytesReader} to read from copiedNodes. we use this when 
computing a frozen
+// node hash
+// or comparing if a frozen and unfrozen nodes are equal
+private final ByteBlockPoolReverseBytesReader bytesReader;
 
 // 256K blocks, but note that the final block is sized only as needed so 
it won't use the full
 // block size when just a few elements were written to it
 private static final int BLOCK_SIZE_BYTES = 1 << 18;
 
 public PagedGrowableHash() {
-  entries = new PagedGrowableWriter(16, BLOCK_SIZE_BYTES, 8, 
PackedInts.COMPACT);
+  fstNodeAddress = new PagedGrowableWriter(16, BLOCK_SIZE_BYTES, 8, 
PackedInts.COMPACT);
+  cop

Re: [PR] Remove patching for doc blocks. [lucene]

2023-11-03 Thread via GitHub


jpountz commented on PR #12741:
URL: https://github.com/apache/lucene/pull/12741#issuecomment-1792586157

   For reference, I'm interested in taking advantage of the fact we're changing 
the codec anyway to look into other smaller changes, like switching tail 
postings from vints to group-varint, or better alignign blocks and skip lists 
so that `BlockDocsEnum#advance` doesn't need to check whether if `docBufferUpto 
== BLOCK_SIZE` to decode a new block and could do it directly under the `target 
> nextSkipDoc` check.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use value-based LRU cache in NodeHash [lucene]

2023-11-03 Thread via GitHub


mikemccand commented on code in PR #12738:
URL: https://github.com/apache/lucene/pull/12738#discussion_r1381811085


##
lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java:
##
@@ -186,149 +218,228 @@ private long hash(FSTCompiler.UnCompiledNode node) {
 return h;
   }
 
-  // hash code for a frozen node.  this must precisely match the hash 
computation of an unfrozen
-  // node!
-  private long hash(long node) throws IOException {
-final int PRIME = 31;
-
-long h = 0;
-fstCompiler.fst.readFirstRealTargetArc(node, scratchArc, in);
-while (true) {
-  h = PRIME * h + scratchArc.label();
-  h = PRIME * h + (int) (scratchArc.target() ^ (scratchArc.target() >> 
32));
-  h = PRIME * h + scratchArc.output().hashCode();
-  h = PRIME * h + scratchArc.nextFinalOutput().hashCode();
-  if (scratchArc.isFinal()) {
-h += 17;
-  }
-  if (scratchArc.isLast()) {
-break;
-  }
-  fstCompiler.fst.readNextRealArc(scratchArc, in);
-}
-
-return h;
-  }
-
-  /**
-   * Compares an unfrozen node (UnCompiledNode) with a frozen node at byte 
location address (long),
-   * returning true if they are equal.
-   */
-  private boolean nodesEqual(FSTCompiler.UnCompiledNode node, long address) 
throws IOException {
-fstCompiler.fst.readFirstRealTargetArc(address, scratchArc, in);
-
-// fail fast for a node with fixed length arcs
-if (scratchArc.bytesPerArc() != 0) {
-  assert node.numArcs > 0;
-  // the frozen node uses fixed-with arc encoding (same number of bytes 
per arc), but may be
-  // sparse or dense
-  switch (scratchArc.nodeFlags()) {
-case FST.ARCS_FOR_BINARY_SEARCH:
-  // sparse
-  if (node.numArcs != scratchArc.numArcs()) {
-return false;
-  }
-  break;
-case FST.ARCS_FOR_DIRECT_ADDRESSING:
-  // dense -- compare both the number of labels allocated in the array 
(some of which may
-  // not actually be arcs), and the number of arcs
-  if ((node.arcs[node.numArcs - 1].label - node.arcs[0].label + 1) != 
scratchArc.numArcs()
-  || node.numArcs != FST.Arc.BitTable.countBits(scratchArc, in)) {
-return false;
-  }
-  break;
-default:
-  throw new AssertionError("unhandled scratchArc.nodeFlag() " + 
scratchArc.nodeFlags());
-  }
-}
-
-// compare arc by arc to see if there is a difference
-for (int arcUpto = 0; arcUpto < node.numArcs; arcUpto++) {
-  final FSTCompiler.Arc arc = node.arcs[arcUpto];
-  if (arc.label != scratchArc.label()
-  || arc.output.equals(scratchArc.output()) == false
-  || ((FSTCompiler.CompiledNode) arc.target).node != 
scratchArc.target()
-  || arc.nextFinalOutput.equals(scratchArc.nextFinalOutput()) == false
-  || arc.isFinal != scratchArc.isFinal()) {
-return false;
-  }
-
-  if (scratchArc.isLast()) {
-if (arcUpto == node.numArcs - 1) {
-  return true;
-} else {
-  return false;
-}
-  }
-
-  fstCompiler.fst.readNextRealArc(scratchArc, in);
-}
-
-// unfrozen node has fewer arcs than frozen node
-
-return false;
-  }
-
   /** Inner class because it needs access to hash function and FST bytes. */
   private class PagedGrowableHash {
-private PagedGrowableWriter entries;
+// total bytes copied out of the FST byte[] into this hash table 
ByteBlockPool for suffix nodes
+private long copiedBytes;
+// storing the FST node address where the position is the masked hash of 
the node arcs
+private PagedGrowableWriter fstNodeAddress;
+// storing the local copiedNodes address in the same position as 
fstNodeAddress
+// here we are effectively storing a Map from the FST node 
address to copiedNodes
+// address
+private PagedGrowableWriter copiedNodeAddress;
 private long count;
 private long mask;
+// storing the byte slice from the FST for nodes we added to the hash so 
that we don't need to
+// look up from the FST itself, so the FST bytes can stream directly to 
disk as append-only
+// writes.
+// each node will be written subsequently
+private final ByteBlockPool copiedNodes;
+// the {@link FST.BytesReader} to read from copiedNodes. we use this when 
computing a frozen
+// node hash
+// or comparing if a frozen and unfrozen nodes are equal
+private final ByteBlockPoolReverseBytesReader bytesReader;
 
 // 256K blocks, but note that the final block is sized only as needed so 
it won't use the full
 // block size when just a few elements were written to it
 private static final int BLOCK_SIZE_BYTES = 1 << 18;
 
 public PagedGrowableHash() {
-  entries = new PagedGrowableWriter(16, BLOCK_SIZE_BYTES, 8, 
PackedInts.COMPACT);
+  fstNodeAddress = new PagedGrowableWriter(16, BLOCK_SIZE_BYTES, 8, 
PackedInts.COMPACT);
+  

Re: [PR] Use value-based LRU cache in NodeHash [lucene]

2023-11-03 Thread via GitHub


mikemccand commented on code in PR #12738:
URL: https://github.com/apache/lucene/pull/12738#discussion_r1381812079


##
lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java:
##
@@ -186,149 +218,228 @@ private long hash(FSTCompiler.UnCompiledNode node) {
 return h;
   }
 
-  // hash code for a frozen node.  this must precisely match the hash 
computation of an unfrozen
-  // node!
-  private long hash(long node) throws IOException {
-final int PRIME = 31;
-
-long h = 0;
-fstCompiler.fst.readFirstRealTargetArc(node, scratchArc, in);
-while (true) {
-  h = PRIME * h + scratchArc.label();
-  h = PRIME * h + (int) (scratchArc.target() ^ (scratchArc.target() >> 
32));
-  h = PRIME * h + scratchArc.output().hashCode();
-  h = PRIME * h + scratchArc.nextFinalOutput().hashCode();
-  if (scratchArc.isFinal()) {
-h += 17;
-  }
-  if (scratchArc.isLast()) {
-break;
-  }
-  fstCompiler.fst.readNextRealArc(scratchArc, in);
-}
-
-return h;
-  }
-
-  /**
-   * Compares an unfrozen node (UnCompiledNode) with a frozen node at byte 
location address (long),
-   * returning true if they are equal.
-   */
-  private boolean nodesEqual(FSTCompiler.UnCompiledNode node, long address) 
throws IOException {
-fstCompiler.fst.readFirstRealTargetArc(address, scratchArc, in);
-
-// fail fast for a node with fixed length arcs
-if (scratchArc.bytesPerArc() != 0) {
-  assert node.numArcs > 0;
-  // the frozen node uses fixed-with arc encoding (same number of bytes 
per arc), but may be
-  // sparse or dense
-  switch (scratchArc.nodeFlags()) {
-case FST.ARCS_FOR_BINARY_SEARCH:
-  // sparse
-  if (node.numArcs != scratchArc.numArcs()) {
-return false;
-  }
-  break;
-case FST.ARCS_FOR_DIRECT_ADDRESSING:
-  // dense -- compare both the number of labels allocated in the array 
(some of which may
-  // not actually be arcs), and the number of arcs
-  if ((node.arcs[node.numArcs - 1].label - node.arcs[0].label + 1) != 
scratchArc.numArcs()
-  || node.numArcs != FST.Arc.BitTable.countBits(scratchArc, in)) {
-return false;
-  }
-  break;
-default:
-  throw new AssertionError("unhandled scratchArc.nodeFlag() " + 
scratchArc.nodeFlags());
-  }
-}
-
-// compare arc by arc to see if there is a difference
-for (int arcUpto = 0; arcUpto < node.numArcs; arcUpto++) {
-  final FSTCompiler.Arc arc = node.arcs[arcUpto];
-  if (arc.label != scratchArc.label()
-  || arc.output.equals(scratchArc.output()) == false
-  || ((FSTCompiler.CompiledNode) arc.target).node != 
scratchArc.target()
-  || arc.nextFinalOutput.equals(scratchArc.nextFinalOutput()) == false
-  || arc.isFinal != scratchArc.isFinal()) {
-return false;
-  }
-
-  if (scratchArc.isLast()) {
-if (arcUpto == node.numArcs - 1) {
-  return true;
-} else {
-  return false;
-}
-  }
-
-  fstCompiler.fst.readNextRealArc(scratchArc, in);
-}
-
-// unfrozen node has fewer arcs than frozen node
-
-return false;
-  }
-
   /** Inner class because it needs access to hash function and FST bytes. */
   private class PagedGrowableHash {
-private PagedGrowableWriter entries;
+// total bytes copied out of the FST byte[] into this hash table 
ByteBlockPool for suffix nodes
+private long copiedBytes;
+// storing the FST node address where the position is the masked hash of 
the node arcs
+private PagedGrowableWriter fstNodeAddress;
+// storing the local copiedNodes address in the same position as 
fstNodeAddress
+// here we are effectively storing a Map from the FST node 
address to copiedNodes
+// address
+private PagedGrowableWriter copiedNodeAddress;
 private long count;
 private long mask;
+// storing the byte slice from the FST for nodes we added to the hash so 
that we don't need to
+// look up from the FST itself, so the FST bytes can stream directly to 
disk as append-only
+// writes.
+// each node will be written subsequently
+private final ByteBlockPool copiedNodes;
+// the {@link FST.BytesReader} to read from copiedNodes. we use this when 
computing a frozen
+// node hash
+// or comparing if a frozen and unfrozen nodes are equal
+private final ByteBlockPoolReverseBytesReader bytesReader;
 
 // 256K blocks, but note that the final block is sized only as needed so 
it won't use the full
 // block size when just a few elements were written to it
 private static final int BLOCK_SIZE_BYTES = 1 << 18;
 
 public PagedGrowableHash() {
-  entries = new PagedGrowableWriter(16, BLOCK_SIZE_BYTES, 8, 
PackedInts.COMPACT);
+  fstNodeAddress = new PagedGrowableWriter(16, BLOCK_SIZE_BYTES, 8, 
PackedInts.COMPACT);
+  

Re: [PR] Adding new flat vector format and refactoring HNSW [lucene]

2023-11-03 Thread via GitHub


jimczi commented on PR #12729:
URL: https://github.com/apache/lucene/pull/12729#issuecomment-1792590687

   > The flat format is an implementation detail. Folks using the quantized 
hnsw do not have to supply a flat format.
   
   We can register the flat format for direct usage (outside of HNSW) and use 
them in the HNSW formats as an implementation detail? My point is that we don't 
need the `FlatVector...` classes. We can expose the quantized hnsw format with 
the quantiles option and use the scalar flat format internally. Users won't 
have to supply the flat format explicitly but it will be used internally. For 
instance if we introduce the half-float format at a later stage as 
`Float16FlatVectorsFormat`, we would need to expose a 
`HNSWFloat16VectorsFormat` to make it available with HNSW. In other words, we 
don't allow composable formats explicitly but each high level format can reuse 
other format internally (as an implementation detail).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Explore partially decoding blocks (within-block skipping) [lucene]

2023-11-03 Thread via GitHub


jpountz commented on issue #12749:
URL: https://github.com/apache/lucene/issues/12749#issuecomment-1792592185

   How would it work? Since blocks are delta-coded, you can't know the value at 
a given index without decoding all previous values and computing their sum? Or 
you need to store some checkpoints separately, but then it may be easier/better 
to simply go with smaller blocks (e.g. 64 doc IDs instead of 128)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] [DRAFT] Bump release to Java 21 [lucene]

2023-11-03 Thread via GitHub


ChrisHegarty opened a new pull request, #12753:
URL: https://github.com/apache/lucene/pull/12753

   [ There is no intent to merge this PR ]
   
   This PR is intended to help tease out potential issues that may arise from 
compiling with JDK 21. We can use it to identify and pick out the individual 
issues. So far we know:
   
   1. we use deprecated java.net URL constructors - usage should likely be 
replaced with URI::toURL
   Can proceed independent of the version bump
   
   2. we use deprecated java.util Locale constructor - usage should likely be 
replaced with Locale:of
   Locale:of factories are added in Java 19, so this kinda the change to 
the version bump. Maybe evaluate using Local.Builder
   
   3. we need a new yet-to-be released ecj jdt - but it is good to validate the 
unreleased snapshot 
   Await release of 3.36.0
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] LUCENE-10572: Add support for varhandles in native byte order (still randomized during tests) [lucene]

2023-11-03 Thread via GitHub


mikemccand commented on PR #888:
URL: https://github.com/apache/lucene/pull/888#issuecomment-1792595250

   > @uschindler  pushed 0 commits.
   
   Huh, how do you do that?
   
   Mike McCandless
   
   http://blog.mikemccandless.com
   
   
   On Fri, Nov 3, 2023 at 10:42 AM Uwe Schindler ***@***.***>
   wrote:
   
   > @mikemccand : If you want to see the
   > changes I reverted, see the above comparison:
   > 
https://github.com/apache/lucene/compare/36de2bb7fa7a0587a102cf5c4d35ac8f94976bbd..c1b626c0636821f4d7c085895359489e7dfa330f
   >
   > Those changes need to be re-applied to the repo in correct files (not sure
   > where this code now lives, looks like BytesRefBlockPool, but no idea, 
sorry)
   >
   > I think I know after looking into those changes what the problem was.
   > Internally BytesRefHash uses BIG ENDIAN, because some parts in the byte
   > array are "UTF-8 like" encoded (if highest bit is set another byte
   > follows). As this is stupid to do and requires only a few bytes more
   > storage, I removed that encoding to always use shorts instead of "byte or
   > BE short". When the encoding no longer matters and must not be "UTF-8
   > encoding like", it can use native order. But for safety you could also use
   > LE encoding to make use of actual CPUs (ARM is also LE now).
   >
   > So we have 2 posisbilities:
   >
   >- Change the internal encoding of bytesrefhash and remove the Big
   >Endian UTF-8 like encoding (or call it vShort) and switch to Little 
Endian
   >shorts
   >- Use native encoding to also help CPUs like s390 and use native
   >encoding (which also works). This PR supports this, but it is 
questionable
   >for the reasons Robert said.
   >
   > —
   > Reply to this email directly, view it on GitHub
   > , or
   > unsubscribe
   > 

   > .
   > You are receiving this because you were mentioned.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] speedup arm int functions? [lucene]

2023-11-03 Thread via GitHub


rmuir commented on PR #12743:
URL: https://github.com/apache/lucene/pull/12743#issuecomment-1792596894

   @ChrisHegarty I did some investigation, looked at the assembly on ARM 
machines, did some experiments, etc. I didn't mess around with intel, but i 
think the situation is the same. My thoughts:
   * lots of overhead here for each loop iteration other than just `mul`/`madd` 
for what we do here. so there's some room left on the table on some cpus.
   * does not help the other functions as they have more going on already
   * does not help vector functions, specific to this one
   * cannot find a way to get more parallelism
   
   e.g. on the arm it is more than just `ldrsb` to load from byte array, 
convert to int, sign-extend, etc. There is the overhead of the loop too. so we 
just get a little more parallelism out of this simple function. for the other 
functions here this trick is not helpful: they already have a "meatier" loop 
body if that makes sense.
   
   So I think we should commit the little trick for now. The way to get more 
parallelization long-term for this method is to use the correct cpu 
instructions (e.g. `sdot`): 
https://developer.arm.com/documentation/102651/a/What-are-dot-product-intructions-
 . As the documentation states, with 128-bit vectors it can do 64 dot products 
per cycle. I assume possibly twice that on 256-bit vectors.
   
   on intel there is a similar feature. I don't even know where to 
begin/suggest when looking at vector API, how such instructions would be 
exposed. It is like a VectorOperator which is both ternary and conversion :). 
The best idea i have, would be to expose it as ternary operator on ByteVector 
that returns ByteVector and force the user to reinterpret the underlying bits 
as IntVector.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Skip docs with Docvalues in NumericLeafComparator [lucene]

2023-11-03 Thread via GitHub


jpountz commented on PR #12405:
URL: https://github.com/apache/lucene/pull/12405#issuecomment-1792600276

   Sorry, since I had approved the PR, I had not understood it was still 
waiting on me. It's a great change, let's see how to get it in.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] LUCENE-10572: Add support for varhandles in native byte order (still randomized during tests) [lucene]

2023-11-03 Thread via GitHub


mikemccand commented on PR #888:
URL: https://github.com/apache/lucene/pull/888#issuecomment-1792606649

   Thanks @uschindler!  Removing vShort and switching to LE (or native -- I 
didn't understand the problem with that -- this is never (directly) serialized 
to a Lucene index) short seems good?  I guess we lose a bit of RAM efficiency, 
sometimes taking two bytes instead of one.  But we get faster CPU decode.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] speedup arm int functions? [lucene]

2023-11-03 Thread via GitHub


rmuir commented on PR #12743:
URL: https://github.com/apache/lucene/pull/12743#issuecomment-1792630498

   equivalent on intel ice lake: https://www.felixcloutier.com/x86/vpdpbusd
   IMO, we should figure out a path to using these, to get the best performance 
from the binary vectors. it isn't useful to spend time trying to do many other 
hacks around the situation when the "problem has been solved", we just have to 
make use of the hardware.
   
   but i think this little patch won't hurt anyone and improves the scalar 
baseline.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [DRAFT] Bump release to Java 21 [lucene]

2023-11-03 Thread via GitHub


rmuir commented on PR #12753:
URL: https://github.com/apache/lucene/pull/12753#issuecomment-1792647332

   > 2\. we use deprecated java.util Locale constructor - usage should likely 
be replaced with Locale:of
   > Locale:of factories are added in Java 19, so this kinda the change to 
the version bump. Maybe evaluate using Local.Builder
   
   Or just consider `forLanguageTag` which has been around forever (java 8?). 
AFAIK Locale.of etc are just sugar over that. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Skip docs with Docvalues in NumericLeafComparator [lucene]

2023-11-03 Thread via GitHub


jpountz commented on PR #12405:
URL: https://github.com/apache/lucene/pull/12405#issuecomment-1792654011

   I did my best at fixing conflicts, @LuXugang are you able to check the 
changes?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [DRAFT] Bump release to Java 21 [lucene]

2023-11-03 Thread via GitHub


rmuir commented on PR #12753:
URL: https://github.com/apache/lucene/pull/12753#issuecomment-1792669164

   I'll try to cutover your branch, seems some stuff here should be using it 
already. For example Luke GUI already expects a field to be language tag, so we 
shouldn't be using this Locale.of(). Let's ban Locale.of with forbidden apis :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [DRAFT] Bump release to Java 21 [lucene]

2023-11-03 Thread via GitHub


ChrisHegarty commented on PR #12753:
URL: https://github.com/apache/lucene/pull/12753#issuecomment-1792685739

   Thanks @rmuir 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Adding new flat vector format and refactoring HNSW [lucene]

2023-11-03 Thread via GitHub


benwtrent commented on PR #12729:
URL: https://github.com/apache/lucene/pull/12729#issuecomment-1792686970

   @jimczi the HNSWWriter and Readers need the passed flat vector readers and 
writers to provide specific functions. Like the mergeOneField that returns 
closeable scorers.
   
   I am not sure all knnVectorReaders/Writers should have that method. 
   
   
   I think we should expose the flat formats in the codec. But the required new 
functions for indexing the vectors seem to justify a new abstraction.
   
   Or do you think they are just KnnReader/Writers and we update that base 
abstraction?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [DRAFT] Bump release to Java 21 [lucene]

2023-11-03 Thread via GitHub


rmuir commented on PR #12753:
URL: https://github.com/apache/lucene/pull/12753#issuecomment-1792687728

   the only ugly one was the benchmark locale task, because its got a method 
shaped just like the deprecated java-ism:
   ```
   static Locale createLocale(String language, String country, String variant)
   ```
   
   If we want to make it prettier, the fix to that is to change signature of 
`createLocale` (separate issue)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] TestIndexWriterOnVMError.testUnknownError times out (fixes potential IW deadlock on tragic exceptions). [lucene]

2023-11-03 Thread via GitHub


dweiss commented on code in PR #12751:
URL: https://github.com/apache/lucene/pull/12751#discussion_r1381916122


##
lucene/core/src/java/org/apache/lucene/index/IndexWriter.java:
##
@@ -2560,10 +2560,15 @@ private void rollbackInternalNoCommit() throws 
IOException {
 
   // close all the closeables we can (but important is readerPool and 
writeLock to prevent
   // leaks)
-  IOUtils.closeWhileHandlingException(readerPool, deleter, writeLock);
-  writeLock = null;
-  closed = true;
-  closing = false;
+  try {
+IOUtils.closeWhileHandlingException(readerPool, deleter, 
writeLock);
+  } catch (Throwable t) {
+throwable.addSuppressed(t);
+  } finally {
+writeLock = null;
+closed = true;
+closing = false;
+  }
 
   // So any "concurrently closing" threads wake up and see that the 
close has now completed:
   notifyAll();

Review Comment:
   Ah, I see it now. Sure, I'll move it.  Out of curiosity - what's going to 
happen if an exception is thrown from here (it's before the synchronization 
block)?
   ```
   // Must not hold IW's lock while closing
   // mergeScheduler: this can lead to deadlock,
   // e.g. TestIW.testThreadInterruptDeadlock
   IOUtils.closeWhileHandlingException(mergeScheduler);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [DRAFT] Bump release to Java 21 [lucene]

2023-11-03 Thread via GitHub


rmuir commented on PR #12753:
URL: https://github.com/apache/lucene/pull/12753#issuecomment-1792702231

   I'd say lets factor out the cleanups and commit those without the java-21 
stuff? it would make the java-21 PR smaller and these are really just tech-debt 
type fixes that should be addressed?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [DRAFT] Bump release to Java 21 [lucene]

2023-11-03 Thread via GitHub


rmuir commented on PR #12753:
URL: https://github.com/apache/lucene/pull/12753#issuecomment-1792700032

   oops, sorry i missed some, thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] tests.multiplier could be omitted in failed test reproduce line [lucene]

2023-11-03 Thread via GitHub


dweiss merged PR #12752:
URL: https://github.com/apache/lucene/pull/12752


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Skip docs with Docvalues in NumericLeafComparator [lucene]

2023-11-03 Thread via GitHub


jpountz commented on PR #12405:
URL: https://github.com/apache/lucene/pull/12405#issuecomment-1792713630

   Tests fail because the optimization kicks in in more cases than the test 
expects, it's not clear to me yet if it's a bug or not.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [DRAFT] Bump release to Java 21 [lucene]

2023-11-03 Thread via GitHub


ChrisHegarty commented on PR #12753:
URL: https://github.com/apache/lucene/pull/12753#issuecomment-1792714766

   > I'd say lets factor out the cleanups and commit those without the java-21 
stuff? it would make the java-21 PR smaller and these are really just tech-debt 
type fixes that should be addressed?
   
   ++ I'll start with the URL/URI changes.Will leave the locale changes to 
tomorrow, unless someone else picks it up. :-) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [DRAFT] Bump release to Java 21 [lucene]

2023-11-03 Thread via GitHub


rmuir commented on PR #12753:
URL: https://github.com/apache/lucene/pull/12753#issuecomment-1792737633

   LOL forbidden apis caught us with the language tags: yes let's use error 
handling:
   
   > If the specified language tag contains any ill-formed subtags, the first 
such subtag and all following subtags are ignored. Compare to 
[Locale.Builder.setLanguageTag(java.lang.String)](https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/util/Locale.Builder.html#setLanguageTag(java.lang.String))
 which throws an exception in this case.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [DRAFT] Bump release to Java 21 [lucene]

2023-11-03 Thread via GitHub


ChrisHegarty commented on PR #12753:
URL: https://github.com/apache/lucene/pull/12753#issuecomment-1792745161

   I just noticed that too! easy fix! ;-)  ( this PR is marked non-draft, just 
to get the CI building/testing, which helps spot such issues, without warming 
my home! )


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [DRAFT] Bump release to Java 21 [lucene]

2023-11-03 Thread via GitHub


rmuir commented on PR #12753:
URL: https://github.com/apache/lucene/pull/12753#issuecomment-1792750934

   I am working on it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Refactor access to VM options [lucene]

2023-11-03 Thread via GitHub


uschindler opened a new pull request, #12754:
URL: https://github.com/apache/lucene/pull/12754

   This code was previously in `RamUsageEstimator` and also in 
`PanamaVectorUtilSupport`.
   
   In addition this moves detection of Client VM and fast FMA support to 
`Constants` class (in preparation for #12737).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Refactor access to VM options [lucene]

2023-11-03 Thread via GitHub


uschindler commented on code in PR #12754:
URL: https://github.com/apache/lucene/pull/12754#discussion_r1381959455


##
lucene/core/src/java/org/apache/lucene/util/Constants.java:
##
@@ -53,28 +53,33 @@ private Constants() {} // can't construct
   /** The value of System.getProperty("java.vendor"). */
   public static final String JAVA_VENDOR = System.getProperty("java.vendor");
 
+  /** True iff the Java runtime is a client runtime and C2 compiler is not 
enabled */
+  public static final boolean IS_CLIENT_VM =

Review Comment:
   The Constants class always ignored security manager. So this is not new 
here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Refactor access to VM options [lucene]

2023-11-03 Thread via GitHub


uschindler commented on code in PR #12754:
URL: https://github.com/apache/lucene/pull/12754#discussion_r1381961561


##
lucene/core/src/java/org/apache/lucene/util/Constants.java:
##
@@ -53,28 +53,33 @@ private Constants() {} // can't construct
   /** The value of System.getProperty("java.vendor"). */
   public static final String JAVA_VENDOR = System.getProperty("java.vendor");
 
+  /** True iff the Java runtime is a client runtime and C2 compiler is not 
enabled */
+  public static final boolean IS_CLIENT_VM =
+  System.getProperty("java.vm.info", "").contains("emulated-client");
+
   /** True iff running on a 64bit JVM */
-  public static final boolean JRE_IS_64BIT;
+  public static final boolean JRE_IS_64BIT = is64Bit();
+
+  /** true iff we know fast FMA is supported, to deliver less error */
+  public static final boolean HAS_FAST_FMA =
+  (IS_CLIENT_VM == false)
+  && Constants.OS_ARCH.equals("amd64")

Review Comment:
   not sure if this may still be NULL (see below). Maybe be safe here, as 
Constants class is early init.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Refactor access to VM options [lucene]

2023-11-03 Thread via GitHub


uschindler commented on code in PR #12754:
URL: https://github.com/apache/lucene/pull/12754#discussion_r1381963481


##
lucene/core/src/java/org/apache/lucene/util/Constants.java:
##
@@ -53,28 +53,33 @@ private Constants() {} // can't construct
   /** The value of System.getProperty("java.vendor"). */
   public static final String JAVA_VENDOR = System.getProperty("java.vendor");
 
+  /** True iff the Java runtime is a client runtime and C2 compiler is not 
enabled */
+  public static final boolean IS_CLIENT_VM =
+  System.getProperty("java.vm.info", "").contains("emulated-client");
+
   /** True iff running on a 64bit JVM */
-  public static final boolean JRE_IS_64BIT;
+  public static final boolean JRE_IS_64BIT = is64Bit();
+
+  /** true iff we know fast FMA is supported, to deliver less error */
+  public static final boolean HAS_FAST_FMA =
+  (IS_CLIENT_VM == false)
+  && Constants.OS_ARCH.equals("amd64")
+  && 
HotspotVMOptions.get("UseFMA").map(Boolean::valueOf).orElse(false);

Review Comment:
   I am fine with both. I just used here the code as is on current main branch.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Refactor access to VM options [lucene]

2023-11-03 Thread via GitHub


rmuir commented on code in PR #12754:
URL: https://github.com/apache/lucene/pull/12754#discussion_r1381962437


##
lucene/core/src/java/org/apache/lucene/util/Constants.java:
##
@@ -53,28 +53,33 @@ private Constants() {} // can't construct
   /** The value of System.getProperty("java.vendor"). */
   public static final String JAVA_VENDOR = System.getProperty("java.vendor");
 
+  /** True iff the Java runtime is a client runtime and C2 compiler is not 
enabled */
+  public static final boolean IS_CLIENT_VM =
+  System.getProperty("java.vm.info", "").contains("emulated-client");
+
   /** True iff running on a 64bit JVM */
-  public static final boolean JRE_IS_64BIT;
+  public static final boolean JRE_IS_64BIT = is64Bit();
+
+  /** true iff we know fast FMA is supported, to deliver less error */
+  public static final boolean HAS_FAST_FMA =
+  (IS_CLIENT_VM == false)
+  && Constants.OS_ARCH.equals("amd64")
+  && 
HotspotVMOptions.get("UseFMA").map(Boolean::valueOf).orElse(false);

Review Comment:
   Can we update this with the current logic in the other branch? With the 
newer Neoverse cores that support SVE, they "fixed" their FMA to be fast, so we 
turn it on there too. It is a good exercise as well since it must check integer 
parameter.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] use URI where possible [lucene]

2023-11-03 Thread via GitHub


ChrisHegarty opened a new pull request, #12755:
URL: https://github.com/apache/lucene/pull/12755

   This commit replaces the usage of the deprecated `java.net.URL` constructor 
with `URI`, later converting `toURL` where necessary to interoperate with the 
URLConnection API.  The usage is mostly in tools, which I verified with 
`gradlew regenerate`.
   
   The motivation for this change is to address tech debt identified when 
experimenting with bumping to Java 21. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [DRAFT] Bump release to Java 21 [lucene]

2023-11-03 Thread via GitHub


rmuir commented on PR #12753:
URL: https://github.com/apache/lucene/pull/12753#issuecomment-1792805460

   I will fix the norwegian problems in the tests. not sure what this `NY` 
stuff is. there is: `no`, `nn`, `nb` for norwegian, nynorsk, and bokmål. I 
assume the test wants `nn`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Random access term dictionary [lucene]

2023-11-03 Thread via GitHub


Tony-X commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1381984403


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermStateCodecImpl.java:
##
@@ -0,0 +1,159 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import 
org.apache.lucene.codecs.lucene90.Lucene90PostingsFormat.IntBlockTermState;
+import 
org.apache.lucene.sandbox.codecs.lucene90.randomaccess.bitpacking.BitPacker;
+import 
org.apache.lucene.sandbox.codecs.lucene90.randomaccess.bitpacking.BitUnpacker;
+import org.apache.lucene.store.ByteArrayDataInput;
+import org.apache.lucene.store.ByteArrayDataOutput;
+import org.apache.lucene.util.BytesRef;
+
+final class TermStateCodecImpl implements TermStateCodec {
+  private final TermStateCodecComponent[] components;
+  private final int metadataBytesLength;
+
+  private static int getMetadataLength(TermStateCodecComponent component) {
+// 1 byte for bitWidth; optionally 8 byte more for the reference value
+return 1 + (component.isMonotonicallyIncreasing() ? 8 : 0);
+  }
+
+  public TermStateCodecImpl(TermStateCodecComponent[] components) {
+assert components.length > 0;
+
+this.components = components;
+int metadataBytesLength = 0;
+for (var component : components) {
+  metadataBytesLength += getMetadataLength(component);
+}
+this.metadataBytesLength = metadataBytesLength;
+  }
+
+  @Override
+  public byte[] encodeBlock(IntBlockTermState[] inputs, BitPacker bitPacker) {
+Metadata[] metadataPerComponent = getMetadataPerComponent(inputs);
+byte[] metadataBytes = serializeMetadata(metadataPerComponent);
+
+// Encode inputs via the bitpacker
+for (var termState : inputs) {
+  encodeOne(bitPacker, termState, metadataPerComponent);
+}
+
+return metadataBytes;
+  }
+
+  private Metadata[] getMetadataPerComponent(IntBlockTermState[] inputs) {
+Metadata[] metadataPerComponent = new Metadata[components.length];
+for (int i = 0; i < components.length; i++) {
+  var component = components[i];
+  byte bitWidth = TermStateCodecComponent.getBitWidth(inputs, component);
+  long referenceValue =
+  component.isMonotonicallyIncreasing() ? 
component.getTargetValue(inputs[0]) : 0L;
+  metadataPerComponent[i] = new Metadata(bitWidth, referenceValue);
+}
+return metadataPerComponent;
+  }
+
+  private byte[] serializeMetadata(Metadata[] metadataPerComponent) {
+byte[] metadataBytes = new byte[this.metadataBytesLength];
+ByteArrayDataOutput dataOut = new ByteArrayDataOutput(metadataBytes);
+
+for (int i = 0; i < components.length; i++) {
+  var metadata = metadataPerComponent[i];
+  dataOut.writeByte(metadata.bitWidth);
+  if (components[i].isMonotonicallyIncreasing()) {
+dataOut.writeLong(metadata.referenceValue);
+  }
+}
+return metadataBytes;
+  }
+
+  private void encodeOne(

Review Comment:
   Yes, since the common access pattern is to grab all interesting states for 
one term as oppose to "I want to get all posting start FP for many terms" 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [DRAFT] Bump release to Java 21 [lucene]

2023-11-03 Thread via GitHub


rmuir commented on PR #12753:
URL: https://github.com/apache/lucene/pull/12753#issuecomment-1792821264

   ok, i think the best fix is to just cutover this benchmark task to take a 
tag. The parsing is strict, so if someone has .alg file with `en,US` or 
whatever, they will get a nice error message and just need to change to tag 
(e.g. `en-US`):
   ```
  > java.util.IllformedLocaleException: Invalid subtag: en,US [at index 
0]
  > at 
__randomizedtesting.SeedInfo.seed([93469BC41AA902E:8C022240A6E1B490]:0)
  > at 
java.base/java.util.Locale$Builder.setLanguageTag(Locale.java:2749)
  > at 
org.apache.lucene.benchmark.byTask.tasks.NewLocaleTask.createLocale(NewLocaleTask.java:53)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Random access term dictionary [lucene]

2023-11-03 Thread via GitHub


Tony-X commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1381990655


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermType.java:
##
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import java.util.Objects;
+import 
org.apache.lucene.codecs.lucene90.Lucene90PostingsFormat.IntBlockTermState;
+
+/**
+ * TermType holds the classification of a term, based on how its postings are 
written.
+ *
+ * It captures -- 1) if a term has a singleton docid (i.e. only one doc 
contains this term). 2)
+ * if the term has skip data. 3) if the term as an VINT encoded position block.
+ */
+final class TermType {
+  private static final byte SINGLETON_DOC_MASK = (byte) 1;
+
+  private static final byte HAS_SKIP_DATA_MASK = (byte) 1 << 1;
+
+  private static final byte HAS_VINT_POSITION_BLOCK_MASK = (byte) 1 << 2;
+
+  public static final int NUM_TOTAL_TYPES = 8;

Review Comment:
   Hmm, 8 because even for the most demanding  indexing options which asks for 
everything: doc, freq, position, payload and offset, there are three optional 
states singleton docid, skip offset and the last VINT position block offset.
   
   For lighter indexing options, it is possible that the terms produced will 
have less than 8 types. For example, if position is not needed then we know we 
never have last VINT position block offset. So there can only be up to 4 types. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Random access term dictionary [lucene]

2023-11-03 Thread via GitHub


Tony-X commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1381991598


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermsIndexBuilder.java:
##
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import java.io.IOException;
+import java.util.Arrays;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IntsRefBuilder;
+import org.apache.lucene.util.fst.FST;
+import org.apache.lucene.util.fst.FSTCompiler;
+import org.apache.lucene.util.fst.PositiveIntOutputs;
+import org.apache.lucene.util.fst.Util;
+
+/**
+ * Builds a term index for a given field. Logically this is a map: term -> 
(type, ord) where the
+ * ordinals are scoped to type (not global).
+ */
+final class TermsIndexBuilder {
+  private static long MAX_ORD = (1L << 60) - 1;

Review Comment:
   Yes, and we need to use set the 0 bit to 1 to indicate it is not `NO_OUPUT` 
value for the FST. 



##
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermsIndexBuilder.java:
##
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import java.io.IOException;
+import java.util.Arrays;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IntsRefBuilder;
+import org.apache.lucene.util.fst.FST;
+import org.apache.lucene.util.fst.FSTCompiler;
+import org.apache.lucene.util.fst.PositiveIntOutputs;
+import org.apache.lucene.util.fst.Util;
+
+/**
+ * Builds a term index for a given field. Logically this is a map: term -> 
(type, ord) where the
+ * ordinals are scoped to type (not global).
+ */
+final class TermsIndexBuilder {
+  private static long MAX_ORD = (1L << 60) - 1;
+
+  private final long[] countPerType = new long[TermType.NUM_TOTAL_TYPES];
+  private final FSTCompiler fstCompiler =
+  new FSTCompiler<>(FST.INPUT_TYPE.BYTE1, 
PositiveIntOutputs.getSingleton());
+
+  TermsIndexBuilder() {
+Arrays.fill(countPerType, -1);
+  }
+
+  public void addTerm(BytesRef term, TermType termType) throws IOException {
+IntsRefBuilder scratchInts = new IntsRefBuilder();
+long ord = ++countPerType[termType.getId()];
+fstCompiler.add(Util.toIntsRef(term, scratchInts), encode(ord, termType));
+  }
+
+  public TermsIndex build() throws IOException {
+return new TermsIndex(fstCompiler.compile());
+  }
+
+  private long encode(long ord, TermType termType) {
+// use a single long to encode `ord` and `termType`
+// also consider the special value of `PositiveIntOutputs.NO_OUTPUT == 0`
+// so it looks like this |...  ord ...| termType| ... hasOutput  ...|
+// where termType takes 3 bit and hasOutput takes the lowest bit. The rest 
is taken by ord
+if (ord < 0) {
+  throw new IllegalArgumentException("can't encode negative ord");
+}
+if (ord > MAX_ORD) {
+  throw new IllegalArgumentException(
+  "Input ord is too large for TermType: "

Review Comment:
   +1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.

Re: [PR] Random access term dictionary [lucene]

2023-11-03 Thread via GitHub


Tony-X commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1381991598


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermsIndexBuilder.java:
##
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import java.io.IOException;
+import java.util.Arrays;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IntsRefBuilder;
+import org.apache.lucene.util.fst.FST;
+import org.apache.lucene.util.fst.FSTCompiler;
+import org.apache.lucene.util.fst.PositiveIntOutputs;
+import org.apache.lucene.util.fst.Util;
+
+/**
+ * Builds a term index for a given field. Logically this is a map: term -> 
(type, ord) where the
+ * ordinals are scoped to type (not global).
+ */
+final class TermsIndexBuilder {
+  private static long MAX_ORD = (1L << 60) - 1;

Review Comment:
   Yes, and we need to use set the 0 bit to 1 to indicate it is not `NO_OUPUT` 
value for the FST. so we peel off 4 bits in total.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Random access term dictionary [lucene]

2023-11-03 Thread via GitHub


Tony-X commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1381991598


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermsIndexBuilder.java:
##
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import java.io.IOException;
+import java.util.Arrays;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IntsRefBuilder;
+import org.apache.lucene.util.fst.FST;
+import org.apache.lucene.util.fst.FSTCompiler;
+import org.apache.lucene.util.fst.PositiveIntOutputs;
+import org.apache.lucene.util.fst.Util;
+
+/**
+ * Builds a term index for a given field. Logically this is a map: term -> 
(type, ord) where the
+ * ordinals are scoped to type (not global).
+ */
+final class TermsIndexBuilder {
+  private static long MAX_ORD = (1L << 60) - 1;

Review Comment:
   Yes, and we need to use set the 0 bit to 1 to indicate it is not `NO_OUPUTT` 
value for the FST. so we peel off 4 bits in total.



##
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermsIndexBuilder.java:
##
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import java.io.IOException;
+import java.util.Arrays;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IntsRefBuilder;
+import org.apache.lucene.util.fst.FST;
+import org.apache.lucene.util.fst.FSTCompiler;
+import org.apache.lucene.util.fst.PositiveIntOutputs;
+import org.apache.lucene.util.fst.Util;
+
+/**
+ * Builds a term index for a given field. Logically this is a map: term -> 
(type, ord) where the
+ * ordinals are scoped to type (not global).
+ */
+final class TermsIndexBuilder {
+  private static long MAX_ORD = (1L << 60) - 1;

Review Comment:
   Yes, and we need to use set the 0 bit to 1 to indicate it is not `NO_OUTPUT` 
value for the FST. so we peel off 4 bits in total.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [DRAFT] Bump release to Java 21 [lucene]

2023-11-03 Thread via GitHub


rmuir commented on PR #12753:
URL: https://github.com/apache/lucene/pull/12753#issuecomment-1792837540

   primary purpose of this task is to benchmark collation, where you really 
want to use the tag anyway, e.g. `de-DE-u-co-phonebk`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Refactor access to VM options [lucene]

2023-11-03 Thread via GitHub


rmuir commented on code in PR #12754:
URL: https://github.com/apache/lucene/pull/12754#discussion_r1382004593


##
lucene/core/src/java/org/apache/lucene/util/Constants.java:
##
@@ -53,28 +53,33 @@ private Constants() {} // can't construct
   /** The value of System.getProperty("java.vendor"). */
   public static final String JAVA_VENDOR = System.getProperty("java.vendor");
 
+  /** True iff the Java runtime is a client runtime and C2 compiler is not 
enabled */
+  public static final boolean IS_CLIENT_VM =
+  System.getProperty("java.vm.info", "").contains("emulated-client");
+
   /** True iff running on a 64bit JVM */
-  public static final boolean JRE_IS_64BIT;
+  public static final boolean JRE_IS_64BIT = is64Bit();
+
+  /** true iff we know fast FMA is supported, to deliver less error */
+  public static final boolean HAS_FAST_FMA =
+  (IS_CLIENT_VM == false)
+  && Constants.OS_ARCH.equals("amd64")
+  && 
HotspotVMOptions.get("UseFMA").map(Boolean::valueOf).orElse(false);

Review Comment:
   ok, that makes sense. sorry i've been working off the other branch as a 
baseline for a while.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [DRAFT] Bump release to Java 21 [lucene]

2023-11-03 Thread via GitHub


rmuir commented on PR #12753:
URL: https://github.com/apache/lucene/pull/12753#issuecomment-1792850561

   build is green


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Adding new flat vector format and refactoring HNSW [lucene]

2023-11-03 Thread via GitHub


jimczi commented on PR #12729:
URL: https://github.com/apache/lucene/pull/12729#issuecomment-1792854167

   > I think we should expose the flat formats in the codec. But the required 
new functions for indexing the vectors seem to justify a new abstraction.
   
   Can we add the abstraction as an extension of the existing 
knnVectorFormat/Readers/Writers? So instead of adding the functions to the base 
class, we introduce the flat format as an abstract class that extends the base 
one. 
   That would allow to reference it on the HNSW format directly and all flat 
formats would inherit from it directly.
   
   
   
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Replace usage of deprecated java.net.URL constructor with URI [lucene]

2023-11-03 Thread via GitHub


rmuir commented on PR #12755:
URL: https://github.com/apache/lucene/pull/12755#issuecomment-1792858675

   needs @uschindler to review. Only germans understand the difference between 
URI and URL. Probably not great usability-wise for java to deprecate URL and 
force everyone to deal with this stuff...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Replace usage of deprecated java.net.URL constructor with URI [lucene]

2023-11-03 Thread via GitHub


rmuir commented on PR #12755:
URL: https://github.com/apache/lucene/pull/12755#issuecomment-1792862514

   oh yeah, this is also the same class that does DNS lookups in its `equals()` 
method :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [DRAFT] Bump release to Java 21 [lucene]

2023-11-03 Thread via GitHub


ChrisHegarty commented on code in PR #12753:
URL: https://github.com/apache/lucene/pull/12753#discussion_r1382020346


##
lucene/analysis/common/src/java/org/apache/lucene/analysis/de/GermanStemmer.java:
##
@@ -33,7 +33,7 @@ class GermanStemmer {
   /** Amount of characters that are removed with substitute() 
while stemming. */
   private int substCount = 0;
 
-  private static final Locale locale = new Locale("de", "DE");
+  private static final Locale locale = new 
Locale.Builder().setLanguageTag("de-DE").build();

Review Comment:
   Trivially, we could use `Locale.GERMANY`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Refactor access to VM options and move some VM options to oal.util.Constants [lucene]

2023-11-03 Thread via GitHub


uschindler commented on code in PR #12754:
URL: https://github.com/apache/lucene/pull/12754#discussion_r1382022330


##
lucene/core/src/java/org/apache/lucene/util/Constants.java:
##
@@ -53,28 +53,33 @@ private Constants() {} // can't construct
   /** The value of System.getProperty("java.vendor"). */
   public static final String JAVA_VENDOR = System.getProperty("java.vendor");
 
+  /** True iff the Java runtime is a client runtime and C2 compiler is not 
enabled */
+  public static final boolean IS_CLIENT_VM =
+  System.getProperty("java.vm.info", "").contains("emulated-client");
+
   /** True iff running on a 64bit JVM */
-  public static final boolean JRE_IS_64BIT;
+  public static final boolean JRE_IS_64BIT = is64Bit();
+
+  /** true iff we know fast FMA is supported, to deliver less error */
+  public static final boolean HAS_FAST_FMA =
+  (IS_CLIENT_VM == false)
+  && Constants.OS_ARCH.equals("amd64")
+  && 
HotspotVMOptions.get("UseFMA").map(Boolean::valueOf).orElse(false);

Review Comment:
   I will merge it to your branch once merged to main/9.x



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [DRAFT] Bump release to Java 21 [lucene]

2023-11-03 Thread via GitHub


ChrisHegarty commented on code in PR #12753:
URL: https://github.com/apache/lucene/pull/12753#discussion_r1382022708


##
lucene/benchmark/src/test/org/apache/lucene/benchmark/byTask/TestPerfTasksLogic.java:
##
@@ -793,19 +793,17 @@ public void testLocale() throws Exception {
 
 // ROOT locale
 benchmark = execBenchmark(getLocaleConfig("ROOT"));
-assertEquals(new Locale(""), benchmark.getRunData().getLocale());
+assertEquals(Locale.ROOT, benchmark.getRunData().getLocale());
 
 // specify just a language
 benchmark = execBenchmark(getLocaleConfig("de"));
-assertEquals(new Locale("de"), benchmark.getRunData().getLocale());
+assertEquals(
+new Locale.Builder().setLanguageTag("de").build(), 
benchmark.getRunData().getLocale());

Review Comment:
   Trivially we could use `Locale.GERMAN`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [DRAFT] Bump release to Java 21 [lucene]

2023-11-03 Thread via GitHub


ChrisHegarty commented on PR #12753:
URL: https://github.com/apache/lucene/pull/12753#issuecomment-1792871213

   >build is green 
   
   Woot! Thanks @rmuir 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Refactor access to VM options and move some VM options to oal.util.Constants [lucene]

2023-11-03 Thread via GitHub


uschindler commented on PR #12754:
URL: https://github.com/apache/lucene/pull/12754#issuecomment-1792879954

   Hi @rmuir ,
   I also fixed the broken security manager and NULL property handling in 
Constants.java, so we won't crush.
   Thats an improvement, but long overdue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [DRAFT] Bump release to Java 21 [lucene]

2023-11-03 Thread via GitHub


rmuir commented on code in PR #12753:
URL: https://github.com/apache/lucene/pull/12753#discussion_r1382037193


##
lucene/analysis/common/src/java/org/apache/lucene/analysis/de/GermanStemmer.java:
##
@@ -33,7 +33,7 @@ class GermanStemmer {
   /** Amount of characters that are removed with substitute() 
while stemming. */
   private int substCount = 0;
 
-  private static final Locale locale = new Locale("de", "DE");
+  private static final Locale locale = new 
Locale.Builder().setLanguageTag("de-DE").build();

Review Comment:
   until germany changes their mind about something politically and the 
constant no longer translates to "de-DE"? :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [DRAFT] Bump release to Java 21 [lucene]

2023-11-03 Thread via GitHub


rmuir commented on code in PR #12753:
URL: https://github.com/apache/lucene/pull/12753#discussion_r1382039175


##
lucene/analysis/common/src/java/org/apache/lucene/analysis/de/GermanStemmer.java:
##
@@ -33,7 +33,7 @@ class GermanStemmer {
   /** Amount of characters that are removed with substitute() 
while stemming. */
   private int substCount = 0;
 
-  private static final Locale locale = new Locale("de", "DE");
+  private static final Locale locale = new 
Locale.Builder().setLanguageTag("de-DE").build();

Review Comment:
   Sorry, I'm just nitpicky, but i see no such guarantee in the javadocs for 
what this constant does. The javadoc says "Useful constant for country."



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [DRAFT] Bump release to Java 21 [lucene]

2023-11-03 Thread via GitHub


ChrisHegarty commented on code in PR #12753:
URL: https://github.com/apache/lucene/pull/12753#discussion_r1382043830


##
lucene/analysis/common/src/java/org/apache/lucene/analysis/de/GermanStemmer.java:
##
@@ -33,7 +33,7 @@ class GermanStemmer {
   /** Amount of characters that are removed with substitute() 
while stemming. */
   private int substCount = 0;
 
-  private static final Locale locale = new Locale("de", "DE");
+  private static final Locale locale = new 
Locale.Builder().setLanguageTag("de-DE").build();

Review Comment:
   Yeah, the javadoc should be improved. I’ll file a jdk issue for that.



##
lucene/analysis/common/src/java/org/apache/lucene/analysis/de/GermanStemmer.java:
##
@@ -33,7 +33,7 @@ class GermanStemmer {
   /** Amount of characters that are removed with substitute() 
while stemming. */
   private int substCount = 0;
 
-  private static final Locale locale = new Locale("de", "DE");
+  private static final Locale locale = new 
Locale.Builder().setLanguageTag("de-DE").build();

Review Comment:
   Yeah, the javadoc should be improved. I’ll file a jdk issue for that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Random access term dictionary [lucene]

2023-11-03 Thread via GitHub


Tony-X commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1381994228


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermsIndexBuilder.java:
##
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import java.io.IOException;
+import java.util.Arrays;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IntsRefBuilder;
+import org.apache.lucene.util.fst.FST;
+import org.apache.lucene.util.fst.FSTCompiler;
+import org.apache.lucene.util.fst.PositiveIntOutputs;
+import org.apache.lucene.util.fst.Util;
+
+/**
+ * Builds a term index for a given field. Logically this is a map: term -> 
(type, ord) where the
+ * ordinals are scoped to type (not global).
+ */
+final class TermsIndexBuilder {
+  private static long MAX_ORD = (1L << 60) - 1;
+
+  private final long[] countPerType = new long[TermType.NUM_TOTAL_TYPES];
+  private final FSTCompiler fstCompiler =
+  new FSTCompiler<>(FST.INPUT_TYPE.BYTE1, 
PositiveIntOutputs.getSingleton());
+
+  TermsIndexBuilder() {
+Arrays.fill(countPerType, -1);
+  }
+
+  public void addTerm(BytesRef term, TermType termType) throws IOException {
+IntsRefBuilder scratchInts = new IntsRefBuilder();
+long ord = ++countPerType[termType.getId()];
+fstCompiler.add(Util.toIntsRef(term, scratchInts), encode(ord, termType));
+  }
+
+  public TermsIndex build() throws IOException {
+return new TermsIndex(fstCompiler.compile());
+  }
+
+  private long encode(long ord, TermType termType) {
+// use a single long to encode `ord` and `termType`
+// also consider the special value of `PositiveIntOutputs.NO_OUTPUT == 0`
+// so it looks like this |...  ord ...| termType| ... hasOutput  ...|
+// where termType takes 3 bit and hasOutput takes the lowest bit. The rest 
is taken by ord
+if (ord < 0) {
+  throw new IllegalArgumentException("can't encode negative ord");
+}
+if (ord > MAX_ORD) {
+  throw new IllegalArgumentException(
+  "Input ord is too large for TermType: "

Review Comment:
   Did you mean to spell out the input ord? 



##
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/Lucene90RandomAccessDictionaryPostingsFormat.java:
##
@@ -0,0 +1,82 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.sandbox.codecs.lucene90.randomaccess;
+
+import java.io.IOException;
+import org.apache.lucene.codecs.FieldsConsumer;
+import org.apache.lucene.codecs.FieldsProducer;
+import org.apache.lucene.codecs.PostingsFormat;
+import org.apache.lucene.codecs.PostingsReaderBase;
+import org.apache.lucene.codecs.PostingsWriterBase;
+import org.apache.lucene.codecs.lucene90.Lucene90PostingsFormat;
+import org.apache.lucene.codecs.lucene90.Lucene90PostingsReader;
+import org.apache.lucene.codecs.lucene90.Lucene90PostingsWriter;
+import org.apache.lucene.index.SegmentReadState;
+import org.apache.lucene.index.SegmentWriteState;
+import org.apache.lucene.util.IOUtils;
+
+/**
+ * Similar to {@link Lucene90PostingsFormat} but with a different term 
dictionary implementation.

Review Comment:
   > BTW I think FST will pack better if you put the 3 "term type" bits at the 
end (lsb) of the long instead of msb, because then prefixes can be shared 
across all term types, instead of ju

Re: [PR] [DRAFT] Bump release to Java 21 [lucene]

2023-11-03 Thread via GitHub


ChrisHegarty commented on code in PR #12753:
URL: https://github.com/apache/lucene/pull/12753#discussion_r1382046377


##
lucene/analysis/common/src/java/org/apache/lucene/analysis/de/GermanStemmer.java:
##
@@ -33,7 +33,7 @@ class GermanStemmer {
   /** Amount of characters that are removed with substitute() 
while stemming. */
   private int substCount = 0;
 
-  private static final Locale locale = new Locale("de", "DE");
+  private static final Locale locale = new 
Locale.Builder().setLanguageTag("de-DE").build();

Review Comment:
   Anyway, my comments were trivial, what you have is fine.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Random access term dictionary [lucene]

2023-11-03 Thread via GitHub


Tony-X commented on PR #12688:
URL: https://github.com/apache/lucene/pull/12688#issuecomment-1792903331

   > It seems like you have the low level encode/decode working? So all that 
remains is to hook that up with the Codec components that read/write the terms 
dict ... then you can test the Codec by passing -Dtests.codec= and Lucene 
will run all tests cases using your Codec.
   
   Thanks for the tips! Yes, almost there. I'm working on the real compact 
bitpacker and unpacker. I still need to implement the PostingFormat afterwards. 
Do you think I need to implement a new Codec? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Random access term dictionary [lucene]

2023-11-03 Thread via GitHub


nknize commented on code in PR #12688:
URL: https://github.com/apache/lucene/pull/12688#discussion_r1382078005


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/bitpacking/BitPacker.java:
##


Review Comment:
   > I'm in the process of building the real compact bit packer.
   
   +1. I'm assuming that will be added to this PR?
   
   
   
   > The goal here is to pack a sequence of values that have different 
bitwidths... We can't use VLong either since we aim to write fixed size record, 
so that we can do random access
   
   Hmm.. I'll have to look deeper at this. The reason I ask is because I did a 
similar bit packing w/ "random access" when serializing the [ShapeDocValues 
binary 
tree](https://github.com/apache/lucene/blob/d6836d3d0e5d33a98b35c0885b9787f46c4be47e/lucene/core/src/java/org/apache/lucene/document/ShapeDocValues.java#L462C26-L462C26)
 and it feels like we often re-implement this logic in different forms for 
different use cases. Can we generalize this and lean out our code base to make 
it more useable and readable?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Replace usage of deprecated java.net.URL constructor with URI [lucene]

2023-11-03 Thread via GitHub


rmuir commented on PR #12755:
URL: https://github.com/apache/lucene/pull/12755#issuecomment-1792956804

   > The German explanation: one is a location the other is just an opaque 
name. Every URL is an URI, but not otherwise round.
   
   If every URL is a URI, then how come `URL.equals()` do a DNS lookup and 
`URI.equals()` does not?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Replace usage of deprecated java.net.URL constructor with URI [lucene]

2023-11-03 Thread via GitHub


uschindler commented on PR #12755:
URL: https://github.com/apache/lucene/pull/12755#issuecomment-1792961270

   > > The German explanation: one is a location the other is just an opaque 
name. Every URL is an URI, but not otherwise round.
   > 
   > If every URL is a URI, then how come `URL.equals()` do a DNS lookup and 
`URI.equals()` does not?
   
   That's the buggy java implementation.
   
   The question was what the difference between the term URI and URL according 
to RFC and standards is.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



  1   2   >