[GitHub] [lucene] benwtrent commented on pull request #12529: Introduce a random vector scorer in HNSW builder/searcher

2023-09-08 Thread via GitHub


benwtrent commented on PR #12529:
URL: https://github.com/apache/lucene/pull/12529#issuecomment-1711551199

   @msokolov what say you? It seems like encapsulating random vector seeking & 
scoring into one thing makes the code simpler.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #12529: Introduce a random vector scorer in HNSW builder/searcher

2023-09-08 Thread via GitHub


jpountz commented on code in PR #12529:
URL: https://github.com/apache/lucene/pull/12529#discussion_r1319777844


##
lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene90/Lucene90HnswVectorsReader.java:
##
@@ -423,8 +422,12 @@ public RandomAccessVectorValues copy() {
 
 @Override
 public float[] vectorValue(int targetOrd) throws IOException {
+  if (lastOrd == targetOrd) {
+return value;
+  }

Review Comment:
   I'm curious, is it common to read the same ord multiple times in a row?



##
lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/word2vec/Word2VecSynonymProvider.java:
##
@@ -23,14 +23,11 @@
 import java.io.IOException;
 import java.util.LinkedList;
 import java.util.List;
-import org.apache.lucene.index.VectorEncoding;
 import org.apache.lucene.index.VectorSimilarityFunction;
 import org.apache.lucene.search.KnnCollector;
 import org.apache.lucene.search.TopDocs;
 import org.apache.lucene.util.BytesRef;
-import org.apache.lucene.util.hnsw.HnswGraphBuilder;
-import org.apache.lucene.util.hnsw.HnswGraphSearcher;
-import org.apache.lucene.util.hnsw.OnHeapHnswGraph;
+import org.apache.lucene.util.hnsw.*;

Review Comment:
   Nit: We prefer to avoid star imports.



##
lucene/core/src/java/org/apache/lucene/util/hnsw/RandomVectorScorerProvider.java:
##
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.hnsw;
+
+import java.io.IOException;
+import org.apache.lucene.index.VectorSimilarityFunction;
+
+/** A provider that creates {@link RandomVectorScorer} from an ordinal. */
+public interface RandomVectorScorerProvider {

Review Comment:
   Naming nit: the naming convention seems to be to call things "Supplier"s 
rather than "Provider"s.



##
lucene/core/src/java/org/apache/lucene/codecs/lucene95/OffHeapByteVectorValues.java:
##
@@ -60,13 +61,17 @@ public int size() {
 
   @Override
   public byte[] vectorValue(int targetOrd) throws IOException {
-readValue(targetOrd);
+if (lastOrd != targetOrd) {
+  readValue(targetOrd);
+  lastOrd = targetOrd;
+}
 return binaryValue;
   }
 
   private void readValue(int targetOrd) throws IOException {
 slice.seek((long) targetOrd * byteSize);
 slice.readBytes(byteBuffer.array(), byteBuffer.arrayOffset(), byteSize);
+lastOrd = targetOrd;

Review Comment:
   we already set `lastOrd` in `vectorValue()`, so it's not needed here?



##
lucene/backward-codecs/src/test/org/apache/lucene/backward_codecs/lucene94/Lucene94HnswVectorsWriter.java:
##
@@ -47,12 +47,8 @@
 import org.apache.lucene.util.IOUtils;
 import org.apache.lucene.util.InfoStream;
 import org.apache.lucene.util.RamUsageEstimator;
-import org.apache.lucene.util.hnsw.HnswGraph;
+import org.apache.lucene.util.hnsw.*;

Review Comment:
   nit: star import



##
lucene/backward-codecs/src/test/org/apache/lucene/backward_codecs/lucene92/Lucene92HnswVectorsWriter.java:
##
@@ -34,16 +34,12 @@
 import org.apache.lucene.index.FloatVectorValues;
 import org.apache.lucene.index.IndexFileNames;
 import org.apache.lucene.index.SegmentWriteState;
-import org.apache.lucene.index.VectorEncoding;
 import org.apache.lucene.index.VectorSimilarityFunction;
 import org.apache.lucene.search.DocIdSetIterator;
 import org.apache.lucene.store.IndexInput;
 import org.apache.lucene.store.IndexOutput;
 import org.apache.lucene.util.IOUtils;
-import org.apache.lucene.util.hnsw.HnswGraphBuilder;
-import org.apache.lucene.util.hnsw.NeighborArray;
-import org.apache.lucene.util.hnsw.OnHeapHnswGraph;
-import org.apache.lucene.util.hnsw.RandomAccessVectorValues;
+import org.apache.lucene.util.hnsw.*;

Review Comment:
   nit: star import



##
lucene/core/src/java/org/apache/lucene/util/hnsw/RandomVectorScorerProvider.java:
##
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * 

[GitHub] [lucene] mikemccand commented on issue #12542: Lucene's FST Builder should have a simpler "knob" to trade off memory/CPU required against minimality

2023-09-08 Thread via GitHub


mikemccand commented on issue #12542:
URL: https://github.com/apache/lucene/issues/12542#issuecomment-1711608900

   Digging into this a bit, I think I found some silly performance bugs in our 
current FST impl:
 * We seem to create a `PagedGrowableWriter` with [page size 128 MB 
here](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java#L34),
 meaning even when building a small FST, we are allocating at least 128 MB 
pages?
 * When we rehash, we create a new `PagedGrowableWriter`, with too small 
estimated `bitsRequired` [since we pass `count` (the number of nodes in the 
hash) instead of the most recently added `long 
node`](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java#L182).
  We are actually storing the `long node` values, so it really should be `node` 
not `count`.  The effect of this is we make `PagedGrowableWriter` work harder 
than necessary to reallocate when we store the next `node` that doesn't fit in 
that `bitsRequired`.
   
   I'll try to get the LRU hash working, but if that takes too long, we should 
separately fix these performance bugs (if I'm right that these are really 
bugs!).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] javanna commented on pull request #12544: Close index readers in tests

2023-09-08 Thread via GitHub


javanna commented on PR #12544:
URL: https://github.com/apache/lucene/pull/12544#issuecomment-1711628290

   thanks @jpountz !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] javanna merged pull request #12544: Close index readers in tests

2023-09-08 Thread via GitHub


javanna merged PR #12544:
URL: https://github.com/apache/lucene/pull/12544


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on issue #12542: Lucene's FST Builder should have a simpler "knob" to trade off memory/CPU required against minimality

2023-09-08 Thread via GitHub


dweiss commented on issue #12542:
URL: https://github.com/apache/lucene/issues/12542#issuecomment-1711706053

   With regard to automata/ FSTs - they're nearly the same thing, conceptually. 
Automata are logically transducers producing a constant epsilon value (no 
value). This knowledge can be used to make them smaller but they're the same 
animal, really.
   
   The root of the automaton/fst difference in Lucene is historically the 
source of code: brics package for constructing arbitrary automata, Daciuk/Mihov 
algorithm implementing minimal (at least at first) FST construction directly 
from sorted data (no intermediate "unoptimized" transitions).
   
   > Non-minimal FST means the index is a wee bit bigger, and perhaps lookups 
through the FST are a bit slower since we must have more bytes hot / fewer 
bytes cache local. But it's likely these effects are miniscule relative to the 
RAM savings during construction.
   
   If we allow non-optimal "transition graphs" then in theory we could also 
build FSTs in parallel: just build them independently from different prefix 
blocks. Yes, it wouldn't be optimal, but it if we don't care then so be it.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jimczi commented on a diff in pull request #12529: Introduce a random vector scorer in HNSW builder/searcher

2023-09-08 Thread via GitHub


jimczi commented on code in PR #12529:
URL: https://github.com/apache/lucene/pull/12529#discussion_r1319990727


##
lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene90/Lucene90HnswVectorsReader.java:
##
@@ -423,8 +422,12 @@ public RandomAccessVectorValues copy() {
 
 @Override
 public float[] vectorValue(int targetOrd) throws IOException {
+  if (lastOrd == targetOrd) {
+return value;
+  }

Review Comment:
   Not on normal search where only the query vector is used multiple times but 
it happens frequently when we build the index during the diversification phase. 
There, we compare single ordinals with all their neighbors sequentially so 
instead of having the optimisation in the graph I moved it to the reader since 
it's straightforward to add and shouldn't add any overhead to check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jimczi commented on a diff in pull request #12529: Introduce a random vector scorer in HNSW builder/searcher

2023-09-08 Thread via GitHub


jimczi commented on code in PR #12529:
URL: https://github.com/apache/lucene/pull/12529#discussion_r132915


##
lucene/core/src/java/org/apache/lucene/util/hnsw/RandomVectorScorerProvider.java:
##
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.hnsw;
+
+import java.io.IOException;
+import org.apache.lucene.index.VectorSimilarityFunction;
+
+/** A provider that creates {@link RandomVectorScorer} from an ordinal. */
+public interface RandomVectorScorerProvider {
+  /**
+   * This creates a {@link RandomVectorScorer} for scoring random nodes in 
batches against the given
+   * ordinal.
+   *
+   * @param ord the ordinal of the node to compare
+   * @return a new {@link RandomVectorScorer}
+   */
+  RandomVectorScorer scorer(int ord) throws IOException;
+
+  /**
+   * Creates a {@link RandomVectorScorerProvider} to compare float vectors.
+   *
+   * WARNING: The {@link RandomAccessVectorValues} given can contain 
stateful buffers. Avoid
+   * using it after calling this function. If you plan to use it again outside 
the returned {@link
+   * RandomVectorScorer}, think about passing a copied version ({@link
+   * RandomAccessVectorValues#copy}).
+   *
+   * @param vectors the underlying storage for vectors
+   * @param similarityFunction the similarity function to score vectors
+   */
+  static RandomVectorScorerProvider createFloats(
+  final RandomAccessVectorValues vectors,
+  final VectorSimilarityFunction similarityFunction)
+  throws IOException {
+final RandomAccessVectorValues vectorsCopy = vectors.copy();
+return queryOrd ->
+(RandomVectorScorer)
+cand ->
+similarityFunction.compare(
+vectors.vectorValue(queryOrd), 
vectorsCopy.vectorValue(cand));

Review Comment:
   I'm trying limit the number of copies we make. In this model, we only make 
one copy for the entire supplier. We depend on the fact that when we call 
vectors.vectorValue twice in a row with the same order, it will use the 
previous value.
   
   Your suggestion would result in making a copy every time the supplier 
creates a scorer. Since we create a scorer every time we diversify a node, I 
believe this would have a noticeable impact.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jimczi commented on a diff in pull request #12529: Introduce a random vector scorer in HNSW builder/searcher

2023-09-08 Thread via GitHub


jimczi commented on code in PR #12529:
URL: https://github.com/apache/lucene/pull/12529#discussion_r1320004511


##
lucene/core/src/java/org/apache/lucene/codecs/lucene95/OffHeapByteVectorValues.java:
##
@@ -60,13 +61,17 @@ public int size() {
 
   @Override
   public byte[] vectorValue(int targetOrd) throws IOException {
-readValue(targetOrd);
+if (lastOrd != targetOrd) {
+  readValue(targetOrd);
+  lastOrd = targetOrd;
+}
 return binaryValue;
   }
 
   private void readValue(int targetOrd) throws IOException {
 slice.seek((long) targetOrd * byteSize);
 slice.readBytes(byteBuffer.array(), byteBuffer.arrayOffset(), byteSize);
+lastOrd = targetOrd;

Review Comment:
   thanks, addressed in 
https://github.com/apache/lucene/pull/12529/commits/31d8b549e62a288358c58cd666767d09bf17f65c



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jimczi commented on a diff in pull request #12529: Introduce a random vector scorer in HNSW builder/searcher

2023-09-08 Thread via GitHub


jimczi commented on code in PR #12529:
URL: https://github.com/apache/lucene/pull/12529#discussion_r1320004724


##
lucene/core/src/java/org/apache/lucene/util/hnsw/RandomVectorScorerProvider.java:
##
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.hnsw;
+
+import java.io.IOException;
+import org.apache.lucene.index.VectorSimilarityFunction;
+
+/** A provider that creates {@link RandomVectorScorer} from an ordinal. */
+public interface RandomVectorScorerProvider {

Review Comment:
   Addressed in 
https://github.com/apache/lucene/pull/12529/commits/31d8b549e62a288358c58cd666767d09bf17f65c



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jimczi commented on a diff in pull request #12529: Introduce a random vector scorer in HNSW builder/searcher

2023-09-08 Thread via GitHub


jimczi commented on code in PR #12529:
URL: https://github.com/apache/lucene/pull/12529#discussion_r1320004018


##
lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene91/Lucene91HnswVectorsReader.java:
##
@@ -42,9 +41,7 @@
 import org.apache.lucene.util.Bits;
 import org.apache.lucene.util.IOUtils;
 import org.apache.lucene.util.RamUsageEstimator;
-import org.apache.lucene.util.hnsw.HnswGraph;
-import org.apache.lucene.util.hnsw.HnswGraphSearcher;
-import org.apache.lucene.util.hnsw.RandomAccessVectorValues;
+import org.apache.lucene.util.hnsw.*;

Review Comment:
   I have to remove this automatic rule from my idea. I upgraded to a new 
version and forgot to copy my previous settings. Thanks for spotting!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on pull request #12489: Add support for recursive graph bisection.

2023-09-08 Thread via GitHub


mikemccand commented on PR #12489:
URL: https://github.com/apache/lucene/pull/12489#issuecomment-1712097368

   @jpountz did you measure any change to index size with the reordered docids?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #12489: Add support for recursive graph bisection.

2023-09-08 Thread via GitHub


jpountz commented on PR #12489:
URL: https://github.com/apache/lucene/pull/12489#issuecomment-1712166542

   I did. My wikimedium file is sorted by title, which already gives some 
compression compared to random ordering. Disappointedly, recursive graph 
bisection only improved compression of postings (doc) by 1.5%. It significantly 
hurts stored fields though, I suspect it's because the `title` field is stored, 
and stored fields take advantage of splits of the same article being next to 
one another.
   
   | File | before (MB) | after (MB) |
   | - | - | - |
   | terms (tim) | 307 |315 |
   | postings (doc) | 1706 | 1685 |
   | positions (pos) | 2563 | 2540 |
   | points (kdd) | 122 | 126 |
   | doc values (dvd) | 686 | 693 |
   | stored fields (fdt) | 255 | 364 |
   | norms (nvd) | 20 | 20 |
   | total | 5664 |5747 |
   
   It gave me doubts whether the algorithm was correctly implemented in the 
beginning, but the query speedups suggest it is not completely wrong.
   
   I should run on wikibigall too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on issue #12542: Lucene's FST Builder should have a simpler "knob" to trade off memory/CPU required against minimality

2023-09-09 Thread via GitHub


mikemccand commented on issue #12542:
URL: https://github.com/apache/lucene/issues/12542#issuecomment-1712472912

   > We seem to create a PagedGrowableWriter with [page size 128 MB 
here](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java#L34),
 meaning even when building a small FST, we are allocating at least 128 MB 
pages?
   
   OK this was really freaking me out overnight (allocating 128 MB array even 
for building the tiniest of FSTs), so I dug deeper, and it is a false alarm!
   
   It turns out that 
[`PagedGrowableWriter`](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/packed/PagedGrowableWriter.java),
 via its [parent class `AbstractPagedMutable`, will allocate a "just big 
enough" final 
page](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/packed/AbstractPagedMutable.java#L57),
 instead of the full 128 MB page size.  And it will reallocate whenever the 
`NodeHash` resizes to a larger array.  There is also some sneaky power-of-2 mod 
trickery that ensures that that final page, even on indefinite rehashing, is 
always sized to exactly a power of 2.  And a [real if statement to enforce 
it](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/packed/PackedInts.java#L867-L869).
  Phew!
   
   I'll open a separate tiny PR to address the wrong `bitsRequired` during 
rehash -- that's just a smallish performance bug when building biggish FSTs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand opened a new pull request, #12545: Fix minor (excess reallocation) performance bug when building FSTs

2023-09-09 Thread via GitHub


mikemccand opened a new pull request, #12545:
URL: https://github.com/apache/lucene/pull/12545

   The bitsRequired passed during NodeHash rehash (when building an FST) was 
too small, causing excess/wasted reallocations.  This is just a performance 
bug, especially impacting larger FSTs, but likely a small overall impact even 
then.
   
   I also reduced the page size during rehashing from 1 GB (1 << 30) back down 
to the initial 128 MB (1 << 27) created on init.
   
   Relates #12542
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on pull request #12545: Fix minor (excess reallocation) performance bug when building FSTs

2023-09-09 Thread via GitHub


mikemccand commented on PR #12545:
URL: https://github.com/apache/lucene/pull/12545#issuecomment-1712474813

   Tests and precommit passed locally (once) for me ... I'll make sure 
`Test2BFST` passes once too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on pull request #12545: Fix minor (excess reallocation) performance bug when building FSTs

2023-09-09 Thread via GitHub


mikemccand commented on PR #12545:
URL: https://github.com/apache/lucene/pull/12545#issuecomment-1712476087

   For the record, this command seems to at least kick off `Test2BFST`:
   
   `./gradlew test --max-workers=1 --tests org.apache.lucene.util.fst.Test2BFST 
-Dtests.nightly=true -Dtests.monster=true -Dtests.verbose=true 
-Dtests.heapsize=40G`
   
   Once we finish #12542 it'll be fun to see how much smaller a heap is 
needed...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on pull request #12545: Fix minor (excess reallocation) performance bug when building FSTs

2023-09-09 Thread via GitHub


mikemccand commented on PR #12545:
URL: https://github.com/apache/lucene/pull/12545#issuecomment-1712497190

   OK `Test2BFST` is happy:
   
   ```
   BUILD SUCCESSFUL in 54m 15s   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] stefanvodita opened a new issue, #12546: Compute multiple aggregations in one iteration of the match-set

2023-09-09 Thread via GitHub


stefanvodita opened a new issue, #12546:
URL: https://github.com/apache/lucene/issues/12546

   ### Description
   
   When a user knows that they want multiple different aggregations, they have 
to iterate the match-set once for each aggregation, which [is 
inefficient](https://lists.apache.org/thread/dfw1knm6xqx03wmzq7c9wcxv4loqdg6p). 
We can remedy this by enabling users to request more than one aggregation in a 
faceting call and then by handling all the computations in one go.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] stefanvodita opened a new pull request, #12547: Compute multiple float aggregations in one go

2023-09-09 Thread via GitHub


stefanvodita opened a new pull request, #12547:
URL: https://github.com/apache/lucene/pull/12547

   Usually facets maintain a one-dimensional array indexed by ordinal which 
keeps the values they're supposed to compute.
   The change here is simple in principle - use a two-dimensional array, 
indexed by aggregation and ordinal, so we can do multiple aggregations at once.
   
   For the methods that get top children / all children / spcific values / 
dims, the default is to get the values corresponding to the first aggregation, 
but the aggregation can be specified.
   
   There is one tricky bit when we aggregate using provided values sources. In 
this case, we advance the values sources and get the values for each 
aggregation before iterating through the ordinals in each doc. When we iterate 
through the ordinals, we load each of the values we've already retrieved and 
update the corresponding accumulator.
   
   Addresses #12546 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand merged pull request #12545: Fix minor (excess reallocation) performance bug when building FSTs

2023-09-09 Thread via GitHub


mikemccand merged PR #12545:
URL: https://github.com/apache/lucene/pull/12545


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on pull request #12545: Fix minor (excess reallocation) performance bug when building FSTs

2023-09-09 Thread via GitHub


mikemccand commented on PR #12545:
URL: https://github.com/apache/lucene/pull/12545#issuecomment-1712668955

   I backported to 9.x as well: 
https://github.com/apache/lucene/commit/d70c91134726ff5768c0bcdc7bce51f3fbfcac56


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #12489: Add support for recursive graph bisection.

2023-09-10 Thread via GitHub


jpountz commented on PR #12489:
URL: https://github.com/apache/lucene/pull/12489#issuecomment-1712779097

   Wikibigall. Less space spent on doc valuse this time since I did not enable 
indexing of facets. There is a more significant size reduction of postings this 
time (-10.5%). This is not misaligned with the reproducibility paper which 
observered size reductions of 18% with partitioned Elias-Fano and 5% with 
SVByte on the Wikipedia dataset. I would expect PFor to be somewhere in between 
as it's better able to take advantage of small gaps between docs than SVByte, 
but less than partioned Elias-Fano.
   
   | File | before (MB) | after (MB) |
   | - | - | - |
   | terms (tim) | 767 |766 |
   | postings (doc) | 2779 | 2489 |
   | positions (pos) | 11356 | 10569 |
   | points (kdd) | 100 | 99 |
   | doc values (dvd) | 456 | 461 |
   | stored fields (fdt) | 249 | 257 |
   | norms (nvd) | 13 | 13 |
   | total | 15734 |14669 |
   
   Benchmarks still show slowdowns on phrase queries and speedups on 
conjunctions, though it's less spectacular than on wikimedium10m.
   
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
MedTerm  652.41  (7.5%)  493.97  
(2.6%)  -24.3% ( -31% -  -15%) 0.000
 HighPhrase   30.86  (3.5%)   23.85  
(2.6%)  -22.7% ( -27% -  -17%) 0.000
  LowPhrase   51.09  (3.1%)   42.38  
(2.2%)  -17.1% ( -21% -  -12%) 0.000
LowTerm 1057.76  (5.4%)  881.22  
(2.5%)  -16.7% ( -23% -   -9%) 0.000
  MedPhrase   82.18  (3.0%)   71.88  
(1.7%)  -12.5% ( -16% -   -8%) 0.000
  HighTermMonthSort 6482.52  (4.5%) 5739.50  
(3.5%)  -11.5% ( -18% -   -3%) 0.000
   PKLookup  293.95  (3.2%)  276.15  
(3.7%)   -6.1% ( -12% -0%) 0.000
MedSloppyPhrase8.68  (2.7%)8.20  
(2.9%)   -5.5% ( -10% -0%) 0.000
  OrHighLow  578.06  (4.4%)  550.49  
(4.0%)   -4.8% ( -12% -3%) 0.016
   HighSloppyPhrase7.43  (2.2%)7.10  
(4.0%)   -4.4% ( -10% -1%) 0.003
 Fuzzy1  244.70  (2.9%)  238.49  
(3.3%)   -2.5% (  -8% -3%) 0.080
 OrHighHigh   39.76  (9.5%)   39.21  
(6.1%)   -1.4% ( -15% -   15%) 0.717
   HighTerm  370.57  (8.5%)  367.09  
(4.4%)   -0.9% ( -12% -   13%) 0.768
LowSloppyPhrase   13.68  (2.3%)   13.71  
(3.3%)0.2% (  -5% -5%) 0.868
Respell  204.23  (1.8%)  204.98  
(2.0%)0.4% (  -3% -4%) 0.679
Prefix3  225.23  (5.1%)  226.74  
(5.5%)0.7% (  -9% -   11%) 0.786
   Wildcard  170.34  (4.0%)  171.63  
(3.4%)0.8% (  -6% -8%) 0.665
 IntNRQ   92.30 (11.9%)   95.15 
(10.2%)3.1% ( -17% -   28%) 0.555
MedSpanNear5.79  (6.8%)5.99  
(9.3%)3.4% ( -11% -   20%) 0.378
  OrHighMed  104.41  (7.3%)  107.99  
(5.3%)3.4% (  -8% -   17%) 0.253
   HighSpanNear2.47  (4.2%)2.56  
(4.1%)3.7% (  -4% -   12%) 0.059
 Fuzzy2  139.96  (2.8%)  146.77  
(2.6%)4.9% (   0% -   10%) 0.000
LowSpanNear   42.96  (3.6%)   45.21  
(2.5%)5.2% (   0% -   11%) 0.000
AndHighHigh   33.24  (6.2%)   36.20  
(4.3%)8.9% (  -1% -   20%) 0.000
 AndHighMed  131.84  (5.2%)  144.31  
(3.2%)9.5% (   0% -   18%) 0.000
  HighTermDayOfYearSort  186.67  (2.9%)  208.78  
(3.2%)   11.8% (   5% -   18%) 0.000
 AndHighLow  590.69  (3.2%)  677.22  
(2.2%)   14.6% (   9% -   20%) 0.000
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #12489: Add support for recursive graph bisection.

2023-09-10 Thread via GitHub


jpountz commented on PR #12489:
URL: https://github.com/apache/lucene/pull/12489#issuecomment-1712923358

   > I wonder why stored fields index size wasn't really hurt nearly as much 
for wikibigall but was for wikimediumall?
   
   This is because wikimedium uses chunks of articles as documents, and every 
chunk has the title of the Wikipedia article, so there are often ten or more 
adjacent docs that have the same title. This is a best case for stored fields 
compression as only the fist title is actually stored and other occurrences of 
the same title are replaced with a reference to the first occurrence. With 
reordering, these duplicate titles are no longer in the same block, so it goes 
back to just deduplicating bits of title strings, instead of entire titles. 
wikibig doesn't have this best case scenario for stored fields compression. 
Ordering only helps a bit because articles are in title order, so there are 
more duplicate strings in a block of stored fields (shared prefixes) compared 
to the reordered index.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #12489: Add support for recursive graph bisection.

2023-09-10 Thread via GitHub


jpountz commented on PR #12489:
URL: https://github.com/apache/lucene/pull/12489#issuecomment-1712928445

   Regarding positions, the reproducibility paper noted that the algorithm 
helped term frequencies a bit, though not as much as docs. It doesn't say 
anythink about positions, though I suspect that if it tends to group together 
docs that have the same freq for the same term, then gaps in positions also 
tend to be more regular.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] shubhamvishu opened a new pull request, #12548: Add API to compute vector similarity in DoubleValuesSource

2023-09-10 Thread via GitHub


shubhamvishu opened a new pull request, #12548:
URL: https://github.com/apache/lucene/pull/12548

   ### Description
   
   This PR addresses the issue #12394. It adds an API 
**`similarityToQueryVector`** to `DoubleValuesSource` to compute vector 
similarity scores between the query vector and the 
`KnnByteVectorField`/`KnnFloatVectorField` for documents using the 2 new DVS 
implementations (`ByteVectorSimilarityValuesSource` for byte vectors and 
`FloatVectorSimilarityValuesSource` for float vectors). Below are the method 
signatures added to DVS in this PR:
   
   - `DoubleValues similarityToQueryVector(LeafReaderContext ctx, float[] 
queryVector, String vectorField) (uses ByteVectorSimilarityValuesSource)`
   - `DoubleValues similarityToQueryVector(LeafReaderContext ctx, byte[] 
queryVector, String vectorField) (uses FloatVectorSimilarityValuesSource)`
   
   Closes #12394
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz opened a new pull request, #12549: Run merge-on-full-flush even though no changes got flushed.

2023-09-11 Thread via GitHub


jpountz opened a new pull request, #12549:
URL: https://github.com/apache/lucene/pull/12549

   Currently, merge-on-full-flush only checks if merges need to run if changes 
have been flushed to disk. This prevents from having different merging logic 
for refreshes and commits, since the merge policy would not be checked upon 
commit if no new documents got indexed since the previous refresh.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on a diff in pull request #12337: Index arbitrary fields in taxonomy docs

2023-09-11 Thread via GitHub


mikemccand commented on code in PR #12337:
URL: https://github.com/apache/lucene/pull/12337#discussion_r1321799297


##
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyIndexReader.java:
##
@@ -0,0 +1,43 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.taxonomy.directory;
+
+import java.io.IOException;
+import org.apache.lucene.index.DirectoryReader;
+import org.apache.lucene.store.Directory;
+
+/**
+ * This is like a {@link DirectoryTaxonomyReader}, except it provides access 
to the underlying
+ * {@link DirectoryReader} and full path field name.
+ */
+public class DirectoryTaxonomyIndexReader extends DirectoryTaxonomyReader {

Review Comment:
   Can we open a separate PR to address this issue?  It seems like a crab (a 
spinoff issue discovered while creating this PR that's fundamentally not 
otherwise related to this PR)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on a diff in pull request #12337: Index arbitrary fields in taxonomy docs

2023-09-11 Thread via GitHub


mikemccand commented on code in PR #12337:
URL: https://github.com/apache/lucene/pull/12337#discussion_r1321802426


##
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/ReindexingEnrichedDirectoryTaxonomyWriter.java:
##
@@ -0,0 +1,103 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.taxonomy.directory;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.function.BiConsumer;
+import org.apache.lucene.document.Document;
+import org.apache.lucene.facet.taxonomy.FacetLabel;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.LeafReader;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.util.IOUtils;
+
+/**
+ * Use this {@link org.apache.lucene.facet.taxonomy.TaxonomyWriter} to append 
arbitrary fields to
+ * the ordinal documents in the taxonomy. To update the custom data added to 
the docs, it is
+ * required to {@link #reindexWithNewOrdinalData(BiConsumer)}.
+ */
+public class ReindexingEnrichedDirectoryTaxonomyWriter extends 
DirectoryTaxonomyWriter {

Review Comment:
   Note that Lucene has updatable doc values, so in theory one could update/add 
a new doc values field into the taxonomy index without shifting ordinals.  This 
might be a nice path to update fields associated with ordinals?



##
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java:
##
@@ -877,6 +886,26 @@ public synchronized void replaceTaxonomy(Directory 
taxoDir) throws IOException {
 ++indexEpoch;
   }
 
+  /** Delete all taxonomy data and start over. */
+  public synchronized void reset() throws IOException {

Review Comment:
   Maybe rename to `deleteAll` (matching `IndexWriter.deleteAll`).
   
   And could you add some javadoc warnings / when this might be used?  If you 
fully delete your taxo index, you must also reindex any documents index 
referencing these same facets?  Or, you must carefully regenerate the ordinals 
in the right order?
   
   And perhaps it should be package private, if the intention is to only use it 
via `ReindexingEnrichedDirectoryTaxonomyWriter`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on pull request #12337: Index arbitrary fields in taxonomy docs

2023-09-11 Thread via GitHub


mikemccand commented on PR #12337:
URL: https://github.com/apache/lucene/pull/12337#issuecomment-1714232934

   > But as I think about this feature and how do I see it mature over time, I 
DO think the payload should be given when ingesting the documents
   
   Hmm -- I don't think that's great because we are forcing denormalization 
onto the user?  This is fundamentally nicely normalized content (values per 
`FacetLabel` not `Document`), so we really should enable indexing it in a 
denormalized manner.  If the user really wants to (inefficiently) denormalize 
they can use `AssociationFacets` already?
   
   > On the other hand, once a facet was added, e.g. "Author/Bob", its value 
can't change (yet)
   
   Actually, I remember someone 😉 adding this nice feature long ago to Lucene 
to be able to update doc values in-place -- it seems like that could be a great 
mechanism for updating these denormalized `FacetLabel` values.  But we should 
do that as a followon issue -- let's leave this PR to focus on getting these 
initial values into the taxo index.
   
   We could also (later, separate PR) consider more radical changes to how the 
taxo index is stored to make it more efficient to update `FacetLabel` values.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on issue #12190: Add "Expression" Facets Implementation

2023-09-11 Thread via GitHub


mikemccand commented on issue #12190:
URL: https://github.com/apache/lucene/issues/12190#issuecomment-1714240627

   I like this idea -- it's an "aggregation level expression", which computes 
an expression in "aggregation space", instead of the existing (already 
supported) document level expression.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] onyxmaster commented on issue #4549: ShingleFilter should handle positionIncrement of zero, e.g. synonyms [LUCENE-3475]

2023-09-11 Thread via GitHub


onyxmaster commented on issue #4549:
URL: https://github.com/apache/lucene/issues/4549#issuecomment-1714290760

   Hi. Got bitten by this today after a lemmatizer filter produced two variants 
of base word at the same position and ShingleFilter producing a "shingle" from 
these variants, failing to produce a variant of first shingle:
   `" " -> " " -> [" ", 
" "]`, where the correct result would be `[" 
", " "]`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #12490: Reduce the overhead of ImpactsDISI.

2023-09-11 Thread via GitHub


jpountz commented on PR #12490:
URL: https://github.com/apache/lucene/pull/12490#issuecomment-1714465729

   I plan on merging soon if there are no objections.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #12526: Speed up disjunctions by computing estimations of the score of the k-th top hit up-front.

2023-09-11 Thread via GitHub


jpountz commented on PR #12526:
URL: https://github.com/apache/lucene/pull/12526#issuecomment-1714471318

   We could. These tasks are a bit malicious as the doc freq is slightly 
greater than the value of `k=100` so it takes lots of collected matches to find 
k documents that have this term. I suspect that another interesting value for 
the document frequency is when it is a bit less than `k`.
   
   I still need to figure out a way to avoid referencing readers in weight, I 
think we had issues with that in the past though I can't remember exactly what 
the issue was.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gokaai commented on a diff in pull request #12530: Fix CheckIndex to detect major corruption with old (not the latest) commit point

2023-09-11 Thread via GitHub


gokaai commented on code in PR #12530:
URL: https://github.com/apache/lucene/pull/12530#discussion_r1322006478


##
lucene/core/src/java/org/apache/lucene/index/CheckIndex.java:
##
@@ -610,6 +610,39 @@ public Status checkIndex(List onlySegments, 
ExecutorService executorServ
   return result;
 }
 
+// https://github.com/apache/lucene/issues/7820: also attempt to open any 
older commit
+// points (segments_N), which will catch certain corruption like missing 
_N.si files
+// for segments not also referenced by the newest commit point (which was 
already
+// loaded, successfully, above).  Note that we do not do a deeper check of 
segments
+// referenced ONLY by these older commit points, because such corruption 
would not
+// prevent a new IndexWriter from opening on the newest commit point.  but 
it is still
+// corruption, e.g. a reader opened on those old commit points can hit 
corruption
+// exceptions which we (still) will not detect here.  progress not 
perfection!
+
+for (String fileName : files) {
+  if (fileName.startsWith(IndexFileNames.SEGMENTS)
+  && fileName.equals(lastSegmentsFile) == false) {

Review Comment:
   We can avoid having a [separate block specifically for validating the 
`lastSegmentsFile`](https://github.com/apache/lucene/blob/f96b84321806a3f6aac79dd17fd656636250fb78/lucene/core/src/java/org/apache/lucene/index/CheckIndex.java#L597-L611)
 by using this conditional to determine 
   1. [The error 
message](https://github.com/apache/lucene/blob/f96b84321806a3f6aac79dd17fd656636250fb78/lucene/core/src/java/org/apache/lucene/index/CheckIndex.java#L634-L638)
 (i.e., whether it's the latest segment or an older one which could not be 
read) and
   2. [Whether or not the commit read should be set as the value for `sis`, 
which will be checked 
deeper](https://github.com/apache/lucene/blob/f96b84321806a3f6aac79dd17fd656636250fb78/lucene/core/src/java/org/apache/lucene/index/CheckIndex.java#L600)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jainankitk commented on issue #12527: Optimize readInts24 performance for DocIdsWriter

2023-09-11 Thread via GitHub


jainankitk commented on issue #12527:
URL: https://github.com/apache/lucene/issues/12527#issuecomment-1714517103

   > Maybe next we should try 4 readLong() for readInts32? Though I wonder how 
often in this benchy are we really needing 32 bits to encode the docid deltas 
in a BKD leaf block?
   
   Why 4? we can just do single long for 2 ints. Although, I also doubt that we 
really need 32 bits to encode the docid deltas in a BKD leaf block in the 
benchmark.
   
   Following patch should work for achieving this:
   
   ```
   diff --git 
a/lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java 
b/lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java
   index 40db4c0069d..830e4ba43c9 100644
   --- a/lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java
   +++ b/lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java
   @@ -345,9 +345,14 @@ final class DocIdsWriter {
  }

  private void readInts32(IndexInput in, int count, IntersectVisitor 
visitor) throws IOException {
   -in.readInts(scratch, 0, count);
   -for (int i = 0; i < count; i++) {
   -  visitor.visit(scratch[i]);
   +int i;
   +for (i = 0; i < count - 1; i += 2) {
   +  long l = in.readLong();
   +  visitor.visit((int) (l >>> 32));
   +  visitor.visit((int) l);
   +}
   +for (; i < count; ++i) {
   +  visitor.visit(in.readInt());
}
  }
}
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Tony-X closed pull request #12541: Document why we need `lastPosBlockOffset`

2023-09-11 Thread via GitHub


Tony-X closed pull request #12541: Document why we need `lastPosBlockOffset`
URL: https://github.com/apache/lucene/pull/12541


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih commented on issue #11537: StackOverflow when RegExp encounters a very large string [LUCENE-10501]

2023-09-11 Thread via GitHub


zhaih commented on issue #11537:
URL: https://github.com/apache/lucene/issues/11537#issuecomment-1715016712

   I checked the CHANGES list since last release and seems we have good amount
   of commits already, let me start a thread about releasing the next version.
   
   On Wed, Sep 6, 2023 at 1:08 PM Lily ***@***.***> wrote:
   
   > Given that the version would include a fix that is tracked as a security
   > vulnerability, can there be some commitment to a timeline? Or possibly
   > cherry-pick to a minor release 9.7.1?
   >
   >
   >
   > 
   > From: Patrick Zhai ***@***.***>
   > Sent: Wednesday, September 6, 2023, 9:05 PM
   > To: apache/lucene ***@***.***>
   > Cc: Lily Warner ***@***.***>; Comment ***@***.***>
   > Subject: Re: [apache/lucene] StackOverflow when RegExp encounters a very
   > large string [LUCENE-10501] (Issue #11537)
   >
   >
   > [External Email]
   >
   > 
   > There's no fixed schedule, we usually release after some amount of changes
   > are checked in. IIRC the last release was in July
   >
   > On Wed, Sep 6, 2023, 13:02 thomas-warner ***@***.***> wrote:
   >
   > > @zhaih > when is
   > lucene 9.8 expected to be
   > > released?
   > >
   > > —
   > > Reply to this email directly, view it on GitHub
   > >  https://github.com/apache/lucene/issues/11537#issuecomment-1709019677>>,
   > > or unsubscribe
   > > <
   > 
https://github.com/notifications/unsubscribe-auth/AEFSB7A3DOMMSQRF6CUFOHLXZDJFDANCNFSM6AA3TRAQHU
   > <
   > 
https://github.com/notifications/unsubscribe-auth/AEFSB7A3DOMMSQRF6CUFOHLXZDJFDANCNFSM6AA3TRAQHU>>
   >
   > > .
   > > You are receiving this because you were mentioned.Message ID:
   > > ***@***.***>
   > >
   >
   > —
   > Reply to this email directly, view it on GitHub<
   > https://github.com/apache/lucene/issues/11537#issuecomment-1709023471>,
   > or unsubscribe<
   > 
https://github.com/notifications/unsubscribe-auth/AONIE7LBKZ2SYEHAUK2L4G3XZDJQ7ANCNFSM6AA3TRAQHU>.
   >
   > You are receiving this because you commented.Message ID: ***@***.***>
   >
   > —
   > Reply to this email directly, view it on GitHub
   > ,
   > or unsubscribe
   > 

   > .
   > You are receiving this because you were mentioned.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #12460: Allow reading binary doc values as a DataInput

2023-09-12 Thread via GitHub


jpountz commented on PR #12460:
URL: https://github.com/apache/lucene/pull/12460#issuecomment-1715126194

   The more I think of this change, the more I like it: most of the time, you 
would need to read data out of binary doc values, e.g. (variable-length) 
integers, strings, etc. and exposing binary doc values as a data input not only 
makes this easier to do, but also more efficient by saving one memory copy.
   
   Thoughts:
- We are losing the symmetry between the index-time API 
(BinaryDocValuesField), which takes a byte[] while the search-time API produces 
a `DataInput`. Not a deal breaker, but users have to be careful about 
endianness.
- Today it's possible to consume values multiple times, but the data input 
is stateful, so if one consumer reads data from it, and then another consumer 
tries to read data again it will fail unless it resets the input via a call to 
`advanceExact`. Again not a deal breaker but something that could be a surprise 
to some callers.
   
   In terms of backward compatibility, I'm contemplating not introducing a new 
`DataInputDocValues` class, and instead have a `dataInput()` method on 
`BinaryDocValues`, as well as a default implementation of `binaryValue()` that 
reads all bytes from the data input. What do you think? I'm also curious of the 
opinion of our sophisticated backward compatibility policeman @uschindler.
   
   To @stefanvodita 's point, it would be nice to migrate at least one or two 
call sites that would get simpler with a `DataInput` to further validate this 
change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #12549: Run merge-on-full-flush even though no changes got flushed.

2023-09-12 Thread via GitHub


jpountz commented on code in PR #12549:
URL: https://github.com/apache/lucene/pull/12549#discussion_r1322592471


##
lucene/core/src/test/org/apache/lucene/index/TestIndexWriter.java:
##
@@ -518,11 +518,10 @@ public void testFlushWithNoMerging() throws IOException {
 doc.add(newField("field", "aaa", customType));
 for (int i = 0; i < 19; i++) writer.addDocument(doc);
 writer.flush(false, true);
-writer.close();
-SegmentInfos sis = SegmentInfos.readLatestCommit(dir);
 // Since we flushed w/o allowing merging we should now
 // have 10 segments
-assertEquals(10, sis.size());
+assertEquals(10, writer.getSegmentCount());
+writer.close();

Review Comment:
   Note to reviewers: this test started failing because `writer.close()` now 
triggers merges because of the change. So I changed the test to check the 
number of segments after the `flush` call but before the `close` call.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #12549: Run merge-on-full-flush even though no changes got flushed.

2023-09-12 Thread via GitHub


jpountz commented on code in PR #12549:
URL: https://github.com/apache/lucene/pull/12549#discussion_r1322599113


##
lucene/core/src/test/org/apache/lucene/index/TestIndexWriterDelete.java:
##
@@ -1315,7 +1315,8 @@ public void testTryDeleteDocument() throws Exception {
 w.addDocument(doc);
 w.close();
 
-iwc = new IndexWriterConfig(new MockAnalyzer(random()));
+iwc = new IndexWriterConfig(new MockAnalyzer(random()))
+.setMergePolicy(NoMergePolicy.INSTANCE);

Review Comment:
   Note to reviewers: this test now triggers a merge because of deletes. Since 
the test is only about checking how deletes get applied, I simply disabled 
merging.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase commented on pull request #12460: Allow reading binary doc values as a DataInput

2023-09-12 Thread via GitHub


iverase commented on PR #12460:
URL: https://github.com/apache/lucene/pull/12460#issuecomment-1715224914

   > I'm contemplating not introducing a new DataInputDocValues class, and 
instead have a dataInput() method on BinaryDocValues
   
   I think this approach defeats on of the main purposes for this change, that 
is to avoid allocating a byte array when reading doc values. I don't think we 
want BinaryDocValues to do that lazily:
   
   ```
   when one of the doc values is big, in the order of few megabytes, it can 
cause issues with small heaps (or even big heaps if 
   big enough). This is due to the allocation of a big byte array upfront, that 
can be consider humongous allocations by the G1 
   garbage collector and it can cause heap issues under high load.
   ```
   
   On my own use case, getting a DataInput is not enough as I need random 
access via get/set position, in a similar fashion to what I am doing now via 
ByteArrayDataInput. 
   
   
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #12460: Allow reading binary doc values as a DataInput

2023-09-12 Thread via GitHub


jpountz commented on PR #12460:
URL: https://github.com/apache/lucene/pull/12460#issuecomment-1715238722

   > I think this approach defeats on of the main purposes for this change, 
that is to avoid allocating a byte array when reading doc values. I don't think 
we want BinaryDocValues to do that lazily
   
   What is the problem with allocating lazily? It wouldn't make sense to me 
with the current API, where binaryValue() is the only way to retrieve the data, 
but if it were to only remain for bw compat it would make sense to me to only 
incur the byte[] overhead if the legacy `binaryValue()` API is used?
   
   > On my own use case, getting a DataInput is not enough as I need random 
access via get/set position, in a similar fashion to what I am doing now via 
ByteArrayDataInput.
   
   This has been a challenge so many times in the past, maybe it's time to add 
`seek()` support to `DataInput`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] stefanvodita opened a new pull request, #12550: [Demo] Per label association facet fields

2023-09-12 Thread via GitHub


stefanvodita opened a new pull request, #12550:
URL: https://github.com/apache/lucene/pull/12550

   ### Description
   
   A user could have data about facet labels. In the demo here, we record an 
author's popularity score, with authors being facet labels in an index of books.
   
   Today, users can model this by using an `AssociationFacetField` and 
assigning a value to the label, but this value is not constant across 
documents. This PR introduces `PerLabelAssociationFacetField`, which does have 
this property. Once an association for a facet label has been made, it is final.
   
   The problem that remains is that we end up storing the same value in the 
index every time a facet label is mentioned. This is why I prefer the idea of 
[storing data about facet labels in the 
taxonomy](https://github.com/apache/lucene/issues/12336). It ends up using less 
space and it makes more sense to me conceptually, but I thought I'd code both 
to better illustrate the difference.
   
   Related to #12336


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] stefanvodita commented on pull request #12550: [Demo] Per label association facet fields

2023-09-12 Thread via GitHub


stefanvodita commented on PR #12550:
URL: https://github.com/apache/lucene/pull/12550#issuecomment-1715245714

   Cancelling right away, this is not meant to be merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] stefanvodita closed pull request #12550: [Demo] Per label association facet fields

2023-09-12 Thread via GitHub


stefanvodita closed pull request #12550: [Demo] Per label association facet 
fields
URL: https://github.com/apache/lucene/pull/12550


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #12490: Reduce the overhead of ImpactsDISI.

2023-09-12 Thread via GitHub


jpountz commented on PR #12490:
URL: https://github.com/apache/lucene/pull/12490#issuecomment-1715453502

   Another benchmark run on the last commit to make sure it still works as 
expected, and wikibigall this time instead of wikimedium10m:
   
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
  HighTermMonthSort 4813.12  (2.6%) 4760.12  
(2.7%)   -1.1% (  -6% -4%) 0.193
Prefix3  286.82  (5.0%)  285.78  
(4.4%)   -0.4% (  -9% -9%) 0.809
   HighSpanNear8.01  (2.1%)7.99  
(2.8%)   -0.3% (  -5% -4%) 0.692
MedSpanNear7.14  (1.5%)7.14  
(2.4%)   -0.1% (  -3% -3%) 0.854
 IntNRQ  109.71 (15.0%)  109.82 
(15.0%)0.1% ( -25% -   35%) 0.983
   HighSloppyPhrase3.52  (5.2%)3.53  
(3.8%)0.1% (  -8% -9%) 0.942
LowSpanNear   43.43  (1.4%)   43.49  
(2.1%)0.1% (  -3% -3%) 0.808
   PKLookup  224.68  (1.9%)  225.84  
(2.9%)0.5% (  -4% -5%) 0.500
LowSloppyPhrase   17.55  (3.6%)   17.65  
(3.0%)0.6% (  -5% -7%) 0.556
  HighTermDayOfYearSort  244.66  (1.7%)  246.26  
(2.0%)0.7% (  -2% -4%) 0.264
Respell   75.37  (1.5%)   75.92  
(2.0%)0.7% (  -2% -4%) 0.195
MedSloppyPhrase   13.14  (3.9%)   13.24  
(2.9%)0.8% (  -5% -7%) 0.482
   Wildcard   65.52  (3.5%)   66.43  
(3.5%)1.4% (  -5% -8%) 0.213
 Fuzzy1  139.90  (1.6%)  145.17  
(1.5%)3.8% (   0% -6%) 0.000
 Fuzzy2  104.47  (1.6%)  108.44  
(1.5%)3.8% (   0% -6%) 0.000
  LowPhrase   45.82  (3.3%)   48.13  
(3.4%)5.0% (  -1% -   12%) 0.000
  MedPhrase   60.75  (3.7%)   64.16  
(3.6%)5.6% (  -1% -   13%) 0.000
 HighPhrase   49.20  (3.9%)   52.11  
(3.9%)5.9% (  -1% -   14%) 0.000
 AndHighLow  682.56  (3.5%)  727.45  
(3.2%)6.6% (   0% -   13%) 0.000
   HighTerm  440.00  (9.1%)  490.39  
(7.1%)   11.5% (  -4% -   30%) 0.000
LowTerm  917.37  (8.4%) 1033.11  
(5.4%)   12.6% (  -1% -   28%) 0.000
MedTerm  459.23  (9.4%)  519.72  
(6.6%)   13.2% (  -2% -   32%) 0.000
  OrHighLow  605.42  (5.7%)  694.24  
(3.7%)   14.7% (   5% -   25%) 0.000
AndHighHigh   36.06  (5.7%)   44.64  
(6.8%)   23.8% (  10% -   38%) 0.000
 AndHighMed   86.74  (5.1%)  107.82  
(4.5%)   24.3% (  13% -   35%) 0.000
  OrHighMed  190.57  (6.5%)  251.30  
(5.5%)   31.9% (  18% -   46%) 0.000
 OrHighHigh   39.38  (9.4%)   55.12  
(8.7%)   40.0% (  19% -   64%) 0.000
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jimczi commented on pull request #12529: Introduce a random vector scorer in HNSW builder/searcher

2023-09-12 Thread via GitHub


jimczi commented on PR #12529:
URL: https://github.com/apache/lucene/pull/12529#issuecomment-1715484871

   Given that no further concerns have been raised, I am intending to merge 
this change soon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] stefanvodita commented on a diff in pull request #12337: Index arbitrary fields in taxonomy docs

2023-09-12 Thread via GitHub


stefanvodita commented on code in PR #12337:
URL: https://github.com/apache/lucene/pull/12337#discussion_r1322872602


##
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyIndexReader.java:
##
@@ -0,0 +1,43 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.taxonomy.directory;
+
+import java.io.IOException;
+import org.apache.lucene.index.DirectoryReader;
+import org.apache.lucene.store.Directory;
+
+/**
+ * This is like a {@link DirectoryTaxonomyReader}, except it provides access 
to the underlying
+ * {@link DirectoryReader} and full path field name.
+ */
+public class DirectoryTaxonomyIndexReader extends DirectoryTaxonomyReader {

Review Comment:
   This was more complicated in the first draft, but right now the only change 
is to make `getInternalIndexReader` public instead of protected. I think we 
have to have that in this PR, otherwise we’re offering a way to put data in the 
taxonomy, but no way to get it back out.



##
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyIndexReader.java:
##
@@ -0,0 +1,43 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.taxonomy.directory;
+
+import java.io.IOException;
+import org.apache.lucene.index.DirectoryReader;
+import org.apache.lucene.store.Directory;
+
+/**
+ * This is like a {@link DirectoryTaxonomyReader}, except it provides access 
to the underlying
+ * {@link DirectoryReader} and full path field name.
+ */
+public class DirectoryTaxonomyIndexReader extends DirectoryTaxonomyReader {

Review Comment:
   This was more complicated in the first draft, but right now the only change 
is to make `getInternalIndexReader` public instead of protected. I think we 
have to have that in this PR, otherwise we’re offering a way to put data in the 
taxonomy, but no way to get it back out.



##
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java:
##
@@ -877,6 +886,26 @@ public synchronized void replaceTaxonomy(Directory 
taxoDir) throws IOException {
 ++indexEpoch;
   }
 
+  /** Delete all taxonomy data and start over. */
+  public synchronized void reset() throws IOException {

Review Comment:
   Good suggestions! I’ve done this in the new commit.



##
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java:
##
@@ -877,6 +886,26 @@ public synchronized void replaceTaxonomy(Directory 
taxoDir) throws IOException {
 ++indexEpoch;
   }
 
+  /** Delete all taxonomy data and start over. */
+  public synchronized void reset() throws IOException {

Review Comment:
   Good suggestions! I’ve done this in the new commit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] stefanvodita commented on pull request #12337: Index arbitrary fields in taxonomy docs

2023-09-12 Thread via GitHub


stefanvodita commented on PR #12337:
URL: https://github.com/apache/lucene/pull/12337#issuecomment-1715512722

   Thank you for the review @mikemccand! I’ve integrated your feedback. 
Updatable doc values are definitely something to consider.
   For comparison, I coded up an [association facet field that is constant for 
a facet label across document](https://github.com/apache/lucene/pull/12550). I 
think it helps highlight the advantages of the enriched taxonomy solution.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on pull request #12460: Allow reading binary doc values as a DataInput

2023-09-12 Thread via GitHub


uschindler commented on PR #12460:
URL: https://github.com/apache/lucene/pull/12460#issuecomment-1715514900

   > This has been a challenge so many times in the past, maybe it's time to 
add `seek()` support to `DataInput`?
   
   We have full random access (positional reads), if you extend the interface 
`RandomAccessInput`.
   
   Adding seek support to `DataInput` is not a good idea because it would 
prevent those `DataInput`s based on `InputStream` (e g. `InputStreamDataInput`, 
which is used at various places).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #12529: Introduce a random vector scorer in HNSW builder/searcher

2023-09-12 Thread via GitHub


jpountz commented on code in PR #12529:
URL: https://github.com/apache/lucene/pull/12529#discussion_r1322897603


##
lucene/core/src/java/org/apache/lucene/util/hnsw/RandomVectorScorerProvider.java:
##
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.hnsw;
+
+import java.io.IOException;
+import org.apache.lucene.index.VectorSimilarityFunction;
+
+/** A provider that creates {@link RandomVectorScorer} from an ordinal. */
+public interface RandomVectorScorerProvider {
+  /**
+   * This creates a {@link RandomVectorScorer} for scoring random nodes in 
batches against the given
+   * ordinal.
+   *
+   * @param ord the ordinal of the node to compare
+   * @return a new {@link RandomVectorScorer}
+   */
+  RandomVectorScorer scorer(int ord) throws IOException;
+
+  /**
+   * Creates a {@link RandomVectorScorerProvider} to compare float vectors.
+   *
+   * WARNING: The {@link RandomAccessVectorValues} given can contain 
stateful buffers. Avoid
+   * using it after calling this function. If you plan to use it again outside 
the returned {@link
+   * RandomVectorScorer}, think about passing a copied version ({@link
+   * RandomAccessVectorValues#copy}).
+   *
+   * @param vectors the underlying storage for vectors
+   * @param similarityFunction the similarity function to score vectors
+   */
+  static RandomVectorScorerProvider createFloats(
+  final RandomAccessVectorValues vectors,
+  final VectorSimilarityFunction similarityFunction)
+  throws IOException {
+final RandomAccessVectorValues vectorsCopy = vectors.copy();
+return queryOrd ->
+(RandomVectorScorer)
+cand ->
+similarityFunction.compare(
+vectors.vectorValue(queryOrd), 
vectorsCopy.vectorValue(cand));

Review Comment:
   OK, thanks for explaining. Maybe leave a small comment about this since it 
was not obvious to me from looking at the code?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on pull request #12460: Allow reading binary doc values as a DataInput

2023-09-12 Thread via GitHub


uschindler commented on PR #12460:
URL: https://github.com/apache/lucene/pull/12460#issuecomment-1715550666

   To save more memory copies, the codec may use a slice from the underlying 
IndexInput directly to support both access apis. All file pointer checks would 
then be performed by the low level JVM intrinsics used my 
MemorySegmentIndexInput's slices. If you use those views and not let them 
escape the current method, the GC pressure isn't there (be careful: Profilers 
show them as escape analysis no longer works when the profiler prevents 
inlining).
   
   At some point we must anyways go away from the reusing of instances and move 
Lucene to the shortliving immutable instances used by modern JDK APIs. Escape 
analysis works fine, if you have simple apis.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand merged pull request #12541: Document why we need `lastPosBlockOffset`

2023-09-12 Thread via GitHub


mikemccand merged PR #12541:
URL: https://github.com/apache/lucene/pull/12541


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on pull request #12541: Document why we need `lastPosBlockOffset`

2023-09-12 Thread via GitHub


mikemccand commented on PR #12541:
URL: https://github.com/apache/lucene/pull/12541#issuecomment-1715559983

   I backported to 9.x as well ... annoying that GitHub doesn't state in 
summary that the above push was to 9.x (it's only reflected here because it 
referenced this PR).  It does reflect that the first push was to `apache:main`.
   
   Is there a quick way via the GitHub UI to merge back to 9.x after passing 
all tests for simple (non-conflict) merges?  I fallback to command-line but 
maybe I'm "holding it wrong", thus losing the 9.x backport transparency here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jimczi merged pull request #12529: Introduce a random vector scorer in HNSW builder/searcher

2023-09-12 Thread via GitHub


jimczi merged PR #12529:
URL: https://github.com/apache/lucene/pull/12529


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz merged pull request #12490: Reduce the overhead of ImpactsDISI.

2023-09-12 Thread via GitHub


jpountz merged PR #12490:
URL: https://github.com/apache/lucene/pull/12490


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jimczi opened a new pull request, #12551: Introduce dynamic segment efSearch to Knn{Byte|Float}VectorQuery

2023-09-12 Thread via GitHub


jimczi opened a new pull request, #12551:
URL: https://github.com/apache/lucene/pull/12551

   This PR introduces a new parameter known as 'efSearch' to the knn vector 
query. 'efSearch' governs the maximum size of the priority queue employed for 
nearest neighbor searches. As each segment may contain a varying number of 
elements, 'efSearch' is dynamically adjusted on a per-segment basis, taking 
into account the segment's size.
   
   This addition is valuable for improving the precision of a knn query while 
accommodating variations in segment sizes within a single search. For instance, 
if an index comprises 2 segments, with one holding 90% of the total documents 
and the other 10%, setting 'efSearch' to 100 and 'k' to 10 will result in a 
priority queue size of 90 for the first segment and 10 for the other.
   
   I have initiated this PR to solicit feedback on the heuristic used to 
determine the 'ef' size for each segment. Meanwhile, I will be conducting tests 
to assess the recall and performance differences between single segments and 
multiple segments using different 'k' and 'efSearch' values.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Tony-X opened a new pull request, #12552: Make FSTPostingsFormat load FSTs off-heap

2023-09-12 Thread via GitHub


Tony-X opened a new pull request, #12552:
URL: https://github.com/apache/lucene/pull/12552

   ### Description
   
   FSTs supports to load offheap for a while. As we were trying to use 
`FSTPostingsFormat` for some fields we realized heap usage bumped. 
   
   Upon further investigation we realized the FSTPostingsFormat does not load 
FSTs offheap. This PR addresses that.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a diff in pull request #12552: Make FSTPostingsFormat load FSTs off-heap

2023-09-12 Thread via GitHub


msokolov commented on code in PR #12552:
URL: https://github.com/apache/lucene/pull/12552#discussion_r1323494538


##
lucene/codecs/src/java/org/apache/lucene/codecs/memory/FSTTermsReader.java:
##
@@ -191,7 +193,9 @@ final class TermsReader extends Terms {
   this.sumTotalTermFreq = sumTotalTermFreq;
   this.sumDocFreq = sumDocFreq;
   this.docCount = docCount;
-  this.dict = new FST<>(in, in, new FSTTermOutputs(fieldInfo));
+  OffHeapFSTStore offHeapFSTStore = new OffHeapFSTStore();
+  this.dict = new FST<>(in, in, new FSTTermOutputs(fieldInfo), 
offHeapFSTStore);
+  in.skipBytes(offHeapFSTStore.size());

Review Comment:
   just curious - why do we need to skip to the end?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Tony-X commented on a diff in pull request #12552: Make FSTPostingsFormat load FSTs off-heap

2023-09-12 Thread via GitHub


Tony-X commented on code in PR #12552:
URL: https://github.com/apache/lucene/pull/12552#discussion_r1323531587


##
lucene/codecs/src/java/org/apache/lucene/codecs/memory/FSTTermsReader.java:
##
@@ -191,7 +193,9 @@ final class TermsReader extends Terms {
   this.sumTotalTermFreq = sumTotalTermFreq;
   this.sumDocFreq = sumDocFreq;
   this.docCount = docCount;
-  this.dict = new FST<>(in, in, new FSTTermOutputs(fieldInfo));
+  OffHeapFSTStore offHeapFSTStore = new OffHeapFSTStore();
+  this.dict = new FST<>(in, in, new FSTTermOutputs(fieldInfo), 
offHeapFSTStore);
+  in.skipBytes(offHeapFSTStore.size());

Review Comment:
   The OffHeapFSTStore doesn't advance the input (but the on-heap one does), we 
need to seek it manually since the input contains multiple FSTs. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Tony-X commented on issue #12536: Remove `lastPosBlockOffset` from term metadata for Lucene90PostingsFormat

2023-09-12 Thread via GitHub


Tony-X commented on issue #12536:
URL: https://github.com/apache/lucene/issues/12536#issuecomment-1716406470

   https://github.com/apache/lucene/pull/12541 is merged and I'll close this 
one 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Tony-X closed issue #12536: Remove `lastPosBlockOffset` from term metadata for Lucene90PostingsFormat

2023-09-12 Thread via GitHub


Tony-X closed issue #12536: Remove `lastPosBlockOffset` from term metadata for 
Lucene90PostingsFormat
URL: https://github.com/apache/lucene/issues/12536


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] shubhamvishu commented on pull request #12183: Make some heavy query rewrites concurrent

2023-09-12 Thread via GitHub


shubhamvishu commented on PR #12183:
URL: https://github.com/apache/lucene/pull/12183#issuecomment-1716957965

   @jpountz  I have made some changes to the `TermStates#build` to unblock this 
PR and avoid the deadlock issue happening due to executor forking into itself 
by checking if its a `ThreadPoolExecutor` executor and check the executor 
state(i.e. no. of threads actively executing) and only submit the task to it 
when there are enough free threads to do so. i.e. we are making use of only 
considering `(X-1)` threads of the thread pool(which is having fixed `X` 
threads) and keeping/leaving at least **one thread** behind to avoid any 
deadlock issues so that there would always be one thread available to run those 
tasks("problemtic" tasks which are forking again into the executor). This 
passes all the tests and also the failing seeds. Instead of keeping this PR 
blocked till we address #12438 maybe we could consider this fix(short term?) to 
unblock this PR but in long term we could have a recommended concurrent 
executor as you mentioned in the issue #12438 to solve these issues. Let me 
know what
  do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] javanna commented on a diff in pull request #12183: Make some heavy query rewrites concurrent

2023-09-13 Thread via GitHub


javanna commented on code in PR #12183:
URL: https://github.com/apache/lucene/pull/12183#discussion_r1324086202


##
lucene/core/src/java/org/apache/lucene/index/TermStates.java:
##
@@ -86,19 +93,58 @@ public TermStates(
* @param needsStats if {@code true} then all leaf contexts will be visited 
up-front to collect
* term statistics. Otherwise, the {@link TermState} objects will be 
built only when requested
*/
-  public static TermStates build(IndexReaderContext context, Term term, 
boolean needsStats)
+  public static TermStates build(IndexSearcher indexSearcher, Term term, 
boolean needsStats)
   throws IOException {
-assert context != null && context.isTopLevel;
+IndexReaderContext context = indexSearcher.getTopReaderContext();
+assert context != null;
 final TermStates perReaderTermState = new TermStates(needsStats ? null : 
term, context);
 if (needsStats) {
-  for (final LeafReaderContext ctx : context.leaves()) {
-// if (DEBUG) System.out.println("  r=" + leaves[i].reader);
-TermsEnum termsEnum = loadTermsEnum(ctx, term);
-if (termsEnum != null) {
-  final TermState termState = termsEnum.termState();
-  // if (DEBUG) System.out.println("found");
-  perReaderTermState.register(
-  termState, ctx.ord, termsEnum.docFreq(), 
termsEnum.totalTermFreq());
+  Executor executor = indexSearcher.getExecutor();
+  if (executor == null) {
+executor = Runnable::run;
+  }
+  List> tasks =
+  context.leaves().stream()
+  .map(
+  ctx ->
+  new FutureTask<>(
+  () -> {
+TermsEnum termsEnum = loadTermsEnum(ctx, term);
+if (termsEnum != null) {
+  return new TermStateInfo(
+  termsEnum.termState(),
+  ctx.ord,
+  termsEnum.docFreq(),
+  termsEnum.totalTermFreq());
+}
+return null;
+  }))
+  .toList();
+  for (FutureTask task : tasks) {
+if (executor instanceof ThreadPoolExecutor pool) {
+  if ((pool.getCorePoolSize() - pool.getActiveCount()) <= 1) {
+task.run();
+  } else {
+executor.execute(task);
+  }
+} else {
+  executor.execute(task);
+}
+  }
+  for (FutureTask task : tasks) {
+try {
+  TermStateInfo wrapper = task.get();
+  if (wrapper != null) {
+perReaderTermState.register(
+wrapper.getState(),
+wrapper.getOrdinal(),
+wrapper.getDocFreq(),
+wrapper.getTotalTermFreq());
+  }
+} catch (InterruptedException e) {
+  throw new ThreadInterruptedException(e);
+} catch (ExecutionException e) {
+  throw new RuntimeException(e.getCause());

Review Comment:
   TaskExecutor handles exception for both concurrent query rewrite as well as 
sliced search, I think we should reuse that instead.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] javanna commented on a diff in pull request #12183: Make some heavy query rewrites concurrent

2023-09-13 Thread via GitHub


javanna commented on code in PR #12183:
URL: https://github.com/apache/lucene/pull/12183#discussion_r1324085210


##
lucene/core/src/java/org/apache/lucene/index/TermStates.java:
##
@@ -86,19 +93,58 @@ public TermStates(
* @param needsStats if {@code true} then all leaf contexts will be visited 
up-front to collect
* term statistics. Otherwise, the {@link TermState} objects will be 
built only when requested
*/
-  public static TermStates build(IndexReaderContext context, Term term, 
boolean needsStats)
+  public static TermStates build(IndexSearcher indexSearcher, Term term, 
boolean needsStats)
   throws IOException {
-assert context != null && context.isTopLevel;
+IndexReaderContext context = indexSearcher.getTopReaderContext();
+assert context != null;
 final TermStates perReaderTermState = new TermStates(needsStats ? null : 
term, context);
 if (needsStats) {
-  for (final LeafReaderContext ctx : context.leaves()) {
-// if (DEBUG) System.out.println("  r=" + leaves[i].reader);
-TermsEnum termsEnum = loadTermsEnum(ctx, term);
-if (termsEnum != null) {
-  final TermState termState = termsEnum.termState();
-  // if (DEBUG) System.out.println("found");
-  perReaderTermState.register(
-  termState, ctx.ord, termsEnum.docFreq(), 
termsEnum.totalTermFreq());
+  Executor executor = indexSearcher.getExecutor();
+  if (executor == null) {
+executor = Runnable::run;
+  }
+  List> tasks =
+  context.leaves().stream()
+  .map(
+  ctx ->
+  new FutureTask<>(
+  () -> {
+TermsEnum termsEnum = loadTermsEnum(ctx, term);
+if (termsEnum != null) {
+  return new TermStateInfo(
+  termsEnum.termState(),
+  ctx.ord,
+  termsEnum.docFreq(),
+  termsEnum.totalTermFreq());
+}
+return null;
+  }))
+  .toList();
+  for (FutureTask task : tasks) {
+if (executor instanceof ThreadPoolExecutor pool) {
+  if ((pool.getCorePoolSize() - pool.getActiveCount()) <= 1) {
+task.run();

Review Comment:
   I recently removed another instanceof check done against the executor, as 
well as the conditional offloading to the executor. I am thinking that we 
should try and not to go back to a similar mechanism and figure out a long-term 
solution. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] javanna commented on a diff in pull request #12183: Make some heavy query rewrites concurrent

2023-09-13 Thread via GitHub


javanna commented on code in PR #12183:
URL: https://github.com/apache/lucene/pull/12183#discussion_r1324087271


##
lucene/CHANGES.txt:
##
@@ -232,11 +172,6 @@ Other
 * GITHUB#12410: Refactor vectorization support (split provider from 
implementation classes).
   (Uwe Schindler, Chris Hegarty)
 
-* GITHUB#12428: Replace consecutive close() calls and close() calls with null 
checks with IOUtils.close().
-  (Shubham Chaudhary)
-
-* GITHUB#12512: Remove unused variable in BKDWriter. (Chao Zhang)
-

Review Comment:
   something went wrong merging from upstream I think, because a bunch of 
changes were removed. Can you address that please?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] javanna commented on a diff in pull request #12183: Make some heavy query rewrites concurrent

2023-09-13 Thread via GitHub


javanna commented on code in PR #12183:
URL: https://github.com/apache/lucene/pull/12183#discussion_r1324093739


##
lucene/core/src/java/org/apache/lucene/index/TermStates.java:
##
@@ -86,19 +93,58 @@ public TermStates(
* @param needsStats if {@code true} then all leaf contexts will be visited 
up-front to collect
* term statistics. Otherwise, the {@link TermState} objects will be 
built only when requested
*/
-  public static TermStates build(IndexReaderContext context, Term term, 
boolean needsStats)
+  public static TermStates build(IndexSearcher indexSearcher, Term term, 
boolean needsStats)
   throws IOException {
-assert context != null && context.isTopLevel;
+IndexReaderContext context = indexSearcher.getTopReaderContext();
+assert context != null;
 final TermStates perReaderTermState = new TermStates(needsStats ? null : 
term, context);
 if (needsStats) {
-  for (final LeafReaderContext ctx : context.leaves()) {
-// if (DEBUG) System.out.println("  r=" + leaves[i].reader);
-TermsEnum termsEnum = loadTermsEnum(ctx, term);
-if (termsEnum != null) {
-  final TermState termState = termsEnum.termState();
-  // if (DEBUG) System.out.println("found");
-  perReaderTermState.register(
-  termState, ctx.ord, termsEnum.docFreq(), 
termsEnum.totalTermFreq());
+  Executor executor = indexSearcher.getExecutor();
+  if (executor == null) {
+executor = Runnable::run;
+  }
+  List> tasks =
+  context.leaves().stream()
+  .map(
+  ctx ->
+  new FutureTask<>(
+  () -> {
+TermsEnum termsEnum = loadTermsEnum(ctx, term);
+if (termsEnum != null) {
+  return new TermStateInfo(
+  termsEnum.termState(),
+  ctx.ord,
+  termsEnum.docFreq(),
+  termsEnum.totalTermFreq());
+}
+return null;
+  }))
+  .toList();
+  for (FutureTask task : tasks) {
+if (executor instanceof ThreadPoolExecutor pool) {
+  if ((pool.getCorePoolSize() - pool.getActiveCount()) <= 1) {
+task.run();

Review Comment:
   I caught up with previous discussions and I believe that the suggestion that 
was made was to " run tasks in the current thread if called from a thread of 
the pool". I don't think this conditional achieves that, but it rather 
conditionally executes on the caller thread depending on size of the executor 
if it is a ThreadPoolExecutor. 
   
   I think it would be good to detect that we are already in an executor thread 
and execute on the caller thread instead of forking in that case, and that 
logic would fit well in TaskExecutor.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] shubhamvishu commented on a diff in pull request #12183: Make some heavy query rewrites concurrent

2023-09-13 Thread via GitHub


shubhamvishu commented on code in PR #12183:
URL: https://github.com/apache/lucene/pull/12183#discussion_r1324225960


##
lucene/core/src/java/org/apache/lucene/index/TermStates.java:
##
@@ -86,19 +93,58 @@ public TermStates(
* @param needsStats if {@code true} then all leaf contexts will be visited 
up-front to collect
* term statistics. Otherwise, the {@link TermState} objects will be 
built only when requested
*/
-  public static TermStates build(IndexReaderContext context, Term term, 
boolean needsStats)
+  public static TermStates build(IndexSearcher indexSearcher, Term term, 
boolean needsStats)
   throws IOException {
-assert context != null && context.isTopLevel;
+IndexReaderContext context = indexSearcher.getTopReaderContext();
+assert context != null;
 final TermStates perReaderTermState = new TermStates(needsStats ? null : 
term, context);
 if (needsStats) {
-  for (final LeafReaderContext ctx : context.leaves()) {
-// if (DEBUG) System.out.println("  r=" + leaves[i].reader);
-TermsEnum termsEnum = loadTermsEnum(ctx, term);
-if (termsEnum != null) {
-  final TermState termState = termsEnum.termState();
-  // if (DEBUG) System.out.println("found");
-  perReaderTermState.register(
-  termState, ctx.ord, termsEnum.docFreq(), 
termsEnum.totalTermFreq());
+  Executor executor = indexSearcher.getExecutor();
+  if (executor == null) {
+executor = Runnable::run;
+  }
+  List> tasks =
+  context.leaves().stream()
+  .map(
+  ctx ->
+  new FutureTask<>(
+  () -> {
+TermsEnum termsEnum = loadTermsEnum(ctx, term);
+if (termsEnum != null) {
+  return new TermStateInfo(
+  termsEnum.termState(),
+  ctx.ord,
+  termsEnum.docFreq(),
+  termsEnum.totalTermFreq());
+}
+return null;
+  }))
+  .toList();
+  for (FutureTask task : tasks) {
+if (executor instanceof ThreadPoolExecutor pool) {
+  if ((pool.getCorePoolSize() - pool.getActiveCount()) <= 1) {
+task.run();

Review Comment:
   > I recently removed another instanceof check done against the executor, as 
well as the conditional offloading to the executor. I am thinking that we 
should try and not to go back to a similar mechanism and figure out a long-term 
solution.
   
   Yes I'm aware about that change and complete agree its not the ideal way but 
maybe something that could unblock this PR till there is a concrete solution to 
this?
   
   > I caught up with previous discussions and I believe that the suggestion 
that was made was to " run tasks in the current thread if called from a thread 
of the pool". I don't think this conditional achieves that
   
   Yes I did try making this but there seems no straightforward way or API to 
know we are running on an executor thread. Maybe we could use 
`Thread.currentThread().getStackTrace()` to check if we earlier had a call to 
`run` from the executor?



##
lucene/core/src/java/org/apache/lucene/index/TermStates.java:
##
@@ -86,19 +93,58 @@ public TermStates(
* @param needsStats if {@code true} then all leaf contexts will be visited 
up-front to collect
* term statistics. Otherwise, the {@link TermState} objects will be 
built only when requested
*/
-  public static TermStates build(IndexReaderContext context, Term term, 
boolean needsStats)
+  public static TermStates build(IndexSearcher indexSearcher, Term term, 
boolean needsStats)
   throws IOException {
-assert context != null && context.isTopLevel;
+IndexReaderContext context = indexSearcher.getTopReaderContext();
+assert context != null;
 final TermStates perReaderTermState = new TermStates(needsStats ? null : 
term, context);
 if (needsStats) {
-  for (final LeafReaderContext ctx : context.leaves()) {
-// if (DEBUG) System.out.println("  r=" + leaves[i].reader);
-TermsEnum termsEnum = loadTermsEnum(ctx, term);
-if (termsEnum != null) {
-  final TermState termState = termsEnum.termState();
-  // if (DEBUG) System.out.println("found");
-  perReaderTermState.register(
-  termState, ctx.ord, termsEnum.docFreq(), 
termsEnum.totalTermFreq());
+  Executor executor = indexSearcher.getExecutor();
+  if (executor == null) {
+executor = Runnable::run;
+  }
+  List> tasks =
+  context.leaves().stream()
+  .map(
+  ctx ->
+  new FutureTask<>(
+  () -> {
+TermsEn

[GitHub] [lucene] uschindler commented on a diff in pull request #12183: Make TermStates#build concurrent

2023-09-13 Thread via GitHub


uschindler commented on code in PR #12183:
URL: https://github.com/apache/lucene/pull/12183#discussion_r1324259855


##
lucene/core/src/java/org/apache/lucene/index/TermStates.java:
##
@@ -86,19 +93,58 @@ public TermStates(
* @param needsStats if {@code true} then all leaf contexts will be visited 
up-front to collect
* term statistics. Otherwise, the {@link TermState} objects will be 
built only when requested
*/
-  public static TermStates build(IndexReaderContext context, Term term, 
boolean needsStats)
+  public static TermStates build(IndexSearcher indexSearcher, Term term, 
boolean needsStats)
   throws IOException {
-assert context != null && context.isTopLevel;
+IndexReaderContext context = indexSearcher.getTopReaderContext();
+assert context != null;
 final TermStates perReaderTermState = new TermStates(needsStats ? null : 
term, context);
 if (needsStats) {
-  for (final LeafReaderContext ctx : context.leaves()) {
-// if (DEBUG) System.out.println("  r=" + leaves[i].reader);
-TermsEnum termsEnum = loadTermsEnum(ctx, term);
-if (termsEnum != null) {
-  final TermState termState = termsEnum.termState();
-  // if (DEBUG) System.out.println("found");
-  perReaderTermState.register(
-  termState, ctx.ord, termsEnum.docFreq(), 
termsEnum.totalTermFreq());
+  Executor executor = indexSearcher.getExecutor();
+  if (executor == null) {
+executor = Runnable::run;
+  }
+  List> tasks =
+  context.leaves().stream()
+  .map(
+  ctx ->
+  new FutureTask<>(
+  () -> {
+TermsEnum termsEnum = loadTermsEnum(ctx, term);
+if (termsEnum != null) {
+  return new TermStateInfo(
+  termsEnum.termState(),
+  ctx.ord,
+  termsEnum.docFreq(),
+  termsEnum.totalTermFreq());
+}
+return null;
+  }))
+  .toList();
+  for (FutureTask task : tasks) {
+if (executor instanceof ThreadPoolExecutor pool) {
+  if ((pool.getCorePoolSize() - pool.getActiveCount()) <= 1) {
+task.run();

Review Comment:
   If you really want to do this use StackWalker. But take care that it might 
require additional privileges if used in a wrong way.
   
   The main problem with instance of checks is that there are many other 
implementations of Executor that may deadlock in similar ways. Think also about 
virtual threads in java 21!
   
   I think the only way is to go away with the stupid old Executor abstraction 
and use the ForkJoin framework of JDK. It is capable of forking from inside 
threads which were already forked and can handle that without deadlocks. 
   
   We need to change IndexSearcher to use ForkJoinPool instead of plain 
Executor: 
https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ForkJoinPool.html
   Then all tasks should be no runables but ForkJoinTasks: 
https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ForkJoinTask.html
   
   I thinks rewriting the concurrency in IndexSearcher und Fork/Join should be 
a separate PR.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #12489: Add support for recursive graph bisection.

2023-09-13 Thread via GitHub


jpountz commented on PR #12489:
URL: https://github.com/apache/lucene/pull/12489#issuecomment-1717341776

   I just found a bug that in practice only made BP run one iteration per 
level, fixing it makes performance better (wikibigall):
   
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
 IntNRQ  122.77 (15.4%)  114.15  
(0.7%)   -7.0% ( -20% -   10%) 0.363
   PKLookup  294.84  (2.9%)  282.06  
(2.7%)   -4.3% (  -9% -1%) 0.030
  OrHighLow  713.73  (3.5%)  688.95  
(3.7%)   -3.5% ( -10% -3%) 0.170
   Wildcard   78.71  (4.2%)   78.01  
(1.1%)   -0.9% (  -6% -4%) 0.682
Prefix3  131.65  (9.1%)  132.63  
(7.3%)0.7% ( -14% -   18%) 0.898
Respell  203.56  (0.3%)  205.74  
(1.1%)1.1% (   0% -2%) 0.051
  HighTermMonthSort 6065.88  (2.1%) 6162.98  
(1.5%)1.6% (  -1% -5%) 0.208
   HighSpanNear5.21  (1.7%)5.40  
(2.6%)3.6% (   0% -7%) 0.021
MedSloppyPhrase5.78  (3.5%)6.15  
(5.3%)6.3% (  -2% -   15%) 0.047
MedSpanNear9.40  (0.8%)   10.05  
(1.1%)6.9% (   4% -8%) 0.000
LowSpanNear   13.99  (1.0%)   15.28  
(1.2%)9.2% (   6% -   11%) 0.000
   HighSloppyPhrase1.26  (4.9%)1.38  
(8.3%)9.9% (  -3% -   24%) 0.039
 OrHighHigh   46.12  (8.9%)   55.13  
(6.8%)   19.5% (   3% -   38%) 0.001
 Fuzzy2  163.38  (0.8%)  199.07  
(0.7%)   21.8% (  20% -   23%) 0.000
LowSloppyPhrase   28.75  (2.2%)   35.28  
(3.1%)   22.7% (  17% -   28%) 0.000
 HighPhrase7.58  (2.1%)9.35  
(1.7%)   23.4% (  19% -   27%) 0.000
  OrHighMed  146.19  (6.5%)  183.57  
(5.2%)   25.6% (  12% -   39%) 0.000
  HighTermDayOfYearSort  153.45  (2.5%)  194.38  
(1.9%)   26.7% (  21% -   31%) 0.000
 Fuzzy1  259.92  (2.4%)  345.09  
(2.5%)   32.8% (  27% -   38%) 0.000
   HighTerm  478.18  (9.8%)  670.01  
(9.2%)   40.1% (  19% -   65%) 0.000
MedTerm  577.98  (9.0%)  845.32 
(10.0%)   46.3% (  25% -   71%) 0.000
 AndHighMed  157.39  (4.5%)  243.75  
(7.3%)   54.9% (  41% -   69%) 0.000
LowTerm 1016.15  (7.6%) 1671.11  
(9.8%)   64.5% (  43% -   88%) 0.000
 AndHighLow  746.14  (1.7%) 1227.66  
(4.2%)   64.5% (  57% -   71%) 0.000
  MedPhrase   41.72  (2.0%)   71.95  
(3.4%)   72.4% (  65% -   79%) 0.000
AndHighHigh   31.03  (7.0%)   56.59 
(13.4%)   82.4% (  57% -  110%) 0.000
  LowPhrase   69.04  (1.5%)  126.15  
(3.4%)   82.7% (  76% -   88%) 0.000
   ```
   
   Space savings are also bigger on postings:
   
   | File | before (MB) | after (MB) |
   | - | - | - |
   | terms (tim) | 767 |763 |
   | postings (doc) | 2779 | 2260 |
   | positions (pos) | 11356 | 10522 |
   | points (kdd) | 100 | 99 |
   | doc values (dvd) | 456 | 462 |
   | stored fields (fdt) | 249 | 226 |
   | norms (nvd) | 13 | 13 |
   | total | 15734 |14360 |


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] javanna commented on a diff in pull request #12183: Make TermStates#build concurrent

2023-09-13 Thread via GitHub


javanna commented on code in PR #12183:
URL: https://github.com/apache/lucene/pull/12183#discussion_r1324466373


##
lucene/core/src/java/org/apache/lucene/index/TermStates.java:
##
@@ -86,19 +93,58 @@ public TermStates(
* @param needsStats if {@code true} then all leaf contexts will be visited 
up-front to collect
* term statistics. Otherwise, the {@link TermState} objects will be 
built only when requested
*/
-  public static TermStates build(IndexReaderContext context, Term term, 
boolean needsStats)
+  public static TermStates build(IndexSearcher indexSearcher, Term term, 
boolean needsStats)
   throws IOException {
-assert context != null && context.isTopLevel;
+IndexReaderContext context = indexSearcher.getTopReaderContext();
+assert context != null;
 final TermStates perReaderTermState = new TermStates(needsStats ? null : 
term, context);
 if (needsStats) {
-  for (final LeafReaderContext ctx : context.leaves()) {
-// if (DEBUG) System.out.println("  r=" + leaves[i].reader);
-TermsEnum termsEnum = loadTermsEnum(ctx, term);
-if (termsEnum != null) {
-  final TermState termState = termsEnum.termState();
-  // if (DEBUG) System.out.println("found");
-  perReaderTermState.register(
-  termState, ctx.ord, termsEnum.docFreq(), 
termsEnum.totalTermFreq());
+  Executor executor = indexSearcher.getExecutor();
+  if (executor == null) {
+executor = Runnable::run;
+  }
+  List> tasks =
+  context.leaves().stream()
+  .map(
+  ctx ->
+  new FutureTask<>(
+  () -> {
+TermsEnum termsEnum = loadTermsEnum(ctx, term);
+if (termsEnum != null) {
+  return new TermStateInfo(
+  termsEnum.termState(),
+  ctx.ord,
+  termsEnum.docFreq(),
+  termsEnum.totalTermFreq());
+}
+return null;
+  }))
+  .toList();
+  for (FutureTask task : tasks) {
+if (executor instanceof ThreadPoolExecutor pool) {
+  if ((pool.getCorePoolSize() - pool.getActiveCount()) <= 1) {
+task.run();

Review Comment:
   Thanks for bringing this up Uwe, that sounds promising. I need to get 
familiar with ForkJoinPool and see how we can use it in IndexSearcher.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Shradha26 opened a new issue, #12553: [DISCUSS] Identifying Gaps in Lucene’s Faceting

2023-09-13 Thread via GitHub


Shradha26 opened a new issue, #12553:
URL: https://github.com/apache/lucene/issues/12553

   I’d like to gather a list of areas where Lucene’s support for aggregations 
can be improved and discuss if faceting can be augmented to offer that support 
or if it would need to be separate functionality. Please suggest more ideas or 
challenge the ones listed!
   
   ## Description
   
   Information Retrieval platforms built on top of Lucene like Solr, Elastic 
Search, and OpenSearch have rich aggregation engines that are different from 
what Lucene natively supports. Lucene has some unique ideas to make aggregation 
computation efficient. Some examples are -
   
   * side car taxonomy index that does some processing at index time to make 
search time computation of aggregations faster, but adds the hassle of 
maintaining another index, risk of it being corrupt, etc.
   * rolling up values from children to parents (post aggregation) to compute 
aggregations in some cases - reducing the overall number of values computed per 
aggregation. 
   
   
   Here are some ideas [@stefanvodita](https://github.com/stefanvodita) and I 
encountered in our work and through exploration of what the other platforms 
support -
   
    New features
   
   * Support dynamic group definitions: In Solr,  defining an aggregation group 
is as simple as providing an iterator over documents. In Lucene, we can’t make 
arbitrary group definitions to that extent.
   * Support adding data for groups (ordinals). Association Facet Fields can do 
this; but one problem is that there’s no single authority for the data. For 
example, if we have an index of books where the Author is a Facet Field on the 
book document and we want to store the Author’s popularity, with Association 
Facet Fields, we’ll need to denormalize this value once per each book document. 
This introduces the possibility of some document changing/overwriting the 
intended value, inconsistently. For Taxonomy Facets, we could use the side car 
taxonomy index to efficiently add data about aggregation groups in normalized 
form: https://github.com/apache/lucene/pull/12337
   * Aggregation Expression Facets: Users may want to define expressions that 
reference other aggregation functions: 
https://github.com/apache/lucene/pull/12184. Note that this is different from 
the “Expression Facets” in the current Lucene demos - those are “document 
expression facets” and are expressions defined using fields on the document and 
do not use aggregations in the definition.
   * Nested aggregations: This is similar to the idea of Expression Facets in 
that they are an aggregation over other aggregations, but this time the parent 
aggregation references aggregations from lower levels in a hierarchy. For 
example: if we have an index of books with a hierarchical Facet Field for 
Author like /, we want to be able to answer queries like 
“What is the nationality of the author with the most books”? To do this, we’ll 
need to compute an aggregation A1 = count(books) per Author (level 1 in the 
hierarchy) and then do a max(A1) per Nationality (level 2 in the hierarchy).
   * [Cascading aggregation 
groups](https://github.com/apache/lucene/issues/4195): I think this feature 
corresponds to nested facets in Solr. A clothing store could have products 
categorized by size and color, as different dimensions. They might want to give 
customers a navigation experience that breaks down sizes by color. The customer 
might like blue, but see that the store is in short supply of blue items of the 
right size, and select a different color instead.
   * API to associate aggregation groups with the aggregation functions: For 
single valued hierarchical facet fields, Lucene uses roll-ups to compute 
aggregation across the different levels. However, for all other cases, it uses 
an approach I’ll refer to as jump-ups - for each document, we iterate through 
all possible unique groups (ordinals) relevant to the document (irrespective of 
their relationship via a hierarchy), and update an aggregation accumulator 
corresponding to the group (ordinals). (This is the values array indexed by 
ordinal in TaxonomyFacet implementations). Right now, even the jump-up approach 
ends up computing the aggregation function across all groups. However, it has 
the potential to compute exactly the number of aggregations needed. By 
associating aggregation groups with the aggregation functions users are be 
interested in, Lucene can compute exact number of aggregations. It could also 
be beneficial in cases where users are interested in a variety of aggregatio
 ns for different groups.
   * Enable deep hierarchies: Very deep ordinal hierarchies can have trouble 
scaling. Storing every ordinal in the path up to a label on a doc may not be 
feasible. Computing aggregations would be slower than it has to be if they are 
only needed for portions of the hierarchy. Short-circuiting roll-ups or 
aggregating directly into targeted ordinals

[GitHub] [lucene] jmazanec15 commented on issue #12533: Init HNSW merge with graph containing deleted documents

2023-09-13 Thread via GitHub


jmazanec15 commented on issue #12533:
URL: https://github.com/apache/lucene/issues/12533#issuecomment-1717981259

   Additionally, the [FreshDiskANN](https://arxiv.org/pdf/2105.09613.pdf) paper 
did some work in this space. They ran a test for NSG where they iteratively 
repeat the following process a certain number of cycles and track the recall: 
   1. delete 5% of the index
   2. patch the incident nodes that were impacted via local neighborhoods 
(similiar to @zhaih (1))
   3. reinsert the deleted nodes
   4. measure recall
   
   They ran a similar one for HNSW where they do not patch the edges. In both 
cases, they saw some degradation:
   
   ![Screenshot 2023-09-13 at 9 40 10 
AM](https://github.com/apache/lucene/assets/19438237/49f2eff7-7388-4e9d-a877-9421b2c9f790).
   
   Their intuition for this happening is because of the graphs become sparser 
as this process happens, leading to less navigability. The graphs become 
sparser because the pruning policy is more strict.
   
   In their system, they do employ a similar algorithm to @zhaih (1), where 
they connect the incident edges and prune based on some criteria that shows 
promise.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on a diff in pull request #12183: Make TermStates#build concurrent

2023-09-13 Thread via GitHub


uschindler commented on code in PR #12183:
URL: https://github.com/apache/lucene/pull/12183#discussion_r1324955170


##
lucene/core/src/java/org/apache/lucene/search/TaskExecutor.java:
##
@@ -64,4 +68,57 @@ final  List invokeAll(Collection> 
tasks) throws IOExcept
 }
 return results;
   }
+
+  /**
+   * Execute all the tasks provided as an argument concurrently only if it is 
a {@link
+   * ThreadPoolExecutor} and the current thread invoking this is not a {@link 
ThreadPoolExecutor}
+   * thread, else run all the tasks sequentially, wait for them to complete 
and return the obtained
+   * results.
+   *
+   * @param tasks the tasks to execute
+   * @return a list containing the results from the tasks execution
+   * @param  the return type of the task execution
+   */
+  public final  List 
invokeAllWithThreadPoolExecutor(Collection> tasks)
+  throws IOException {
+boolean executeOnCallerThread;
+long frames;
+try {
+  frames =
+  StackWalker.getInstance(StackWalker.Option.RETAIN_CLASS_REFERENCE)

Review Comment:
   This is exactly the problematic option. Retain class reference requires 
additional privileges (as it is security relevant).
   Please avoid this where possible.
   Actually in your code this option is not required, because you only need the 
string class name. So remove it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] shubhamvishu commented on a diff in pull request #12183: Make TermStates#build concurrent

2023-09-13 Thread via GitHub


shubhamvishu commented on code in PR #12183:
URL: https://github.com/apache/lucene/pull/12183#discussion_r1325027898


##
lucene/core/src/java/org/apache/lucene/index/TermStates.java:
##
@@ -86,19 +93,58 @@ public TermStates(
* @param needsStats if {@code true} then all leaf contexts will be visited 
up-front to collect
* term statistics. Otherwise, the {@link TermState} objects will be 
built only when requested
*/
-  public static TermStates build(IndexReaderContext context, Term term, 
boolean needsStats)
+  public static TermStates build(IndexSearcher indexSearcher, Term term, 
boolean needsStats)
   throws IOException {
-assert context != null && context.isTopLevel;
+IndexReaderContext context = indexSearcher.getTopReaderContext();
+assert context != null;
 final TermStates perReaderTermState = new TermStates(needsStats ? null : 
term, context);
 if (needsStats) {
-  for (final LeafReaderContext ctx : context.leaves()) {
-// if (DEBUG) System.out.println("  r=" + leaves[i].reader);
-TermsEnum termsEnum = loadTermsEnum(ctx, term);
-if (termsEnum != null) {
-  final TermState termState = termsEnum.termState();
-  // if (DEBUG) System.out.println("found");
-  perReaderTermState.register(
-  termState, ctx.ord, termsEnum.docFreq(), 
termsEnum.totalTermFreq());
+  Executor executor = indexSearcher.getExecutor();
+  if (executor == null) {
+executor = Runnable::run;
+  }
+  List> tasks =
+  context.leaves().stream()
+  .map(
+  ctx ->
+  new FutureTask<>(
+  () -> {
+TermsEnum termsEnum = loadTermsEnum(ctx, term);
+if (termsEnum != null) {
+  return new TermStateInfo(
+  termsEnum.termState(),
+  ctx.ord,
+  termsEnum.docFreq(),
+  termsEnum.totalTermFreq());
+}
+return null;
+  }))
+  .toList();
+  for (FutureTask task : tasks) {
+if (executor instanceof ThreadPoolExecutor pool) {
+  if ((pool.getCorePoolSize() - pool.getActiveCount()) <= 1) {
+task.run();

Review Comment:
   > If you really want to do this use StackWalker. But take care that it might 
require additional privileges if used in a wrong way.
   
   I did give this a shot in the new revision(s). I tried to use 
`Thread.currentThread().getStackTrace()` first and also `StackWalker` using 
`StackWalker.Option.SHOW_HIDDEN_FRAMES` option. I tried using different 
`StackWalker.Option`s and observed the tests failing while using 
`StackWalker.Option.RETAIN_CLASS_REFERENCE` due to missing privileges. Hence, I 
tried to keep a fail-safe/fallback approach i.e. incase of permission issues we 
would fallback to running the tasks on the caller thread sequentially. Since 
I'm using `StackWalker.Option.SHOW_HIDDEN_FRAMES` here maybe we don't need the 
fail-safe approach and directly throw the exception instead but I reckon we 
would not want to fail here due to permission issues? 
   PS: I'm not much aware of the privileges in `StackWalker`  and the scenarios 
it could cause issues.
   
   > The main problem with instance of checks is that there are many other 
implementations of Executor that may deadlock in similar ways. Think also about 
virtual threads in java 21!
   
   Totally agreed💯in the new revision I have tried to only go with 
concurrent approach if its a `ThreadPoolExecutor` and it's not already running 
on executor thread, if not the tasks would run sequentially on the caller 
thread. This means for any other executor implementations we would not make use 
of concurrency and stick to current behaviour of `TermStates#build` which is 
non concurrent atleast till we have the nice permanent fix to this.
   
   > I think the only way is to go away with the stupid old Executor 
abstraction and use the ForkJoin framework of JDK. It is capable of forking 
from inside threads which were already forked and can handle that without 
deadlocks.
   
   Nice! I really liked this ideait's time to bid bye to these old executor 
implementations. I'll try to dig and see more into this Fork/Join strategy.
   
   -
   
   @javanna @uschindler In the new revision I have tried to work around and 
made changes to `TaskExecutor` to handle this specific deadlock issue/situation 
only if 1) its a `ThreadPoolExecutor` and 2) we are already not running 
executor's thread. So this solves the deadlock issues but we only run 
`TermStates#build` concurrently when the above 2 situations suffice. This seems 
to be a fix to unblock this PR but indeed we need to get rid of this and make 
use of Fork/Join s

[GitHub] [lucene] shubhamvishu commented on a diff in pull request #12183: Make TermStates#build concurrent

2023-09-13 Thread via GitHub


shubhamvishu commented on code in PR #12183:
URL: https://github.com/apache/lucene/pull/12183#discussion_r1325031509


##
lucene/core/src/java/org/apache/lucene/search/TaskExecutor.java:
##
@@ -64,4 +68,57 @@ final  List invokeAll(Collection> 
tasks) throws IOExcept
 }
 return results;
   }
+
+  /**
+   * Execute all the tasks provided as an argument concurrently only if it is 
a {@link
+   * ThreadPoolExecutor} and the current thread invoking this is not a {@link 
ThreadPoolExecutor}
+   * thread, else run all the tasks sequentially, wait for them to complete 
and return the obtained
+   * results.
+   *
+   * @param tasks the tasks to execute
+   * @return a list containing the results from the tasks execution
+   * @param  the return type of the task execution
+   */
+  public final  List 
invokeAllWithThreadPoolExecutor(Collection> tasks)
+  throws IOException {
+boolean executeOnCallerThread;
+long frames;
+try {
+  frames =
+  StackWalker.getInstance(StackWalker.Option.RETAIN_CLASS_REFERENCE)

Review Comment:
   @uschindler Indeed!I discovered this when I tried the other 
options(which seems to require no/less privileges?). But this led me to have a 
fail-safe approach to not throw exception in this case and fallback to running 
these tasks sequentially. Let me know if it makes sense?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on a diff in pull request #12183: Make TermStates#build concurrent

2023-09-13 Thread via GitHub


uschindler commented on code in PR #12183:
URL: https://github.com/apache/lucene/pull/12183#discussion_r1325065806


##
lucene/core/src/java/org/apache/lucene/index/TermStates.java:
##
@@ -86,19 +93,58 @@ public TermStates(
* @param needsStats if {@code true} then all leaf contexts will be visited 
up-front to collect
* term statistics. Otherwise, the {@link TermState} objects will be 
built only when requested
*/
-  public static TermStates build(IndexReaderContext context, Term term, 
boolean needsStats)
+  public static TermStates build(IndexSearcher indexSearcher, Term term, 
boolean needsStats)
   throws IOException {
-assert context != null && context.isTopLevel;
+IndexReaderContext context = indexSearcher.getTopReaderContext();
+assert context != null;
 final TermStates perReaderTermState = new TermStates(needsStats ? null : 
term, context);
 if (needsStats) {
-  for (final LeafReaderContext ctx : context.leaves()) {
-// if (DEBUG) System.out.println("  r=" + leaves[i].reader);
-TermsEnum termsEnum = loadTermsEnum(ctx, term);
-if (termsEnum != null) {
-  final TermState termState = termsEnum.termState();
-  // if (DEBUG) System.out.println("found");
-  perReaderTermState.register(
-  termState, ctx.ord, termsEnum.docFreq(), 
termsEnum.totalTermFreq());
+  Executor executor = indexSearcher.getExecutor();
+  if (executor == null) {
+executor = Runnable::run;
+  }
+  List> tasks =
+  context.leaves().stream()
+  .map(
+  ctx ->
+  new FutureTask<>(
+  () -> {
+TermsEnum termsEnum = loadTermsEnum(ctx, term);
+if (termsEnum != null) {
+  return new TermStateInfo(
+  termsEnum.termState(),
+  ctx.ord,
+  termsEnum.docFreq(),
+  termsEnum.totalTermFreq());
+}
+return null;
+  }))
+  .toList();
+  for (FutureTask task : tasks) {
+if (executor instanceof ThreadPoolExecutor pool) {
+  if ((pool.getCorePoolSize() - pool.getActiveCount()) <= 1) {
+task.run();

Review Comment:
   Instead of counting all possible frames (it's too expansive, use 
`anyMatch(frame -> 
frame.getClassName().contains(THREAD_POOL_EXECUTOR_WORKER_CLASS) && 
frame.getMethodName().contains(RUN_METHOD))` to just find the first one.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on a diff in pull request #12183: Make TermStates#build concurrent

2023-09-13 Thread via GitHub


uschindler commented on code in PR #12183:
URL: https://github.com/apache/lucene/pull/12183#discussion_r1325066260


##
lucene/core/src/java/org/apache/lucene/index/TermStates.java:
##
@@ -86,19 +93,58 @@ public TermStates(
* @param needsStats if {@code true} then all leaf contexts will be visited 
up-front to collect
* term statistics. Otherwise, the {@link TermState} objects will be 
built only when requested
*/
-  public static TermStates build(IndexReaderContext context, Term term, 
boolean needsStats)
+  public static TermStates build(IndexSearcher indexSearcher, Term term, 
boolean needsStats)
   throws IOException {
-assert context != null && context.isTopLevel;
+IndexReaderContext context = indexSearcher.getTopReaderContext();
+assert context != null;
 final TermStates perReaderTermState = new TermStates(needsStats ? null : 
term, context);
 if (needsStats) {
-  for (final LeafReaderContext ctx : context.leaves()) {
-// if (DEBUG) System.out.println("  r=" + leaves[i].reader);
-TermsEnum termsEnum = loadTermsEnum(ctx, term);
-if (termsEnum != null) {
-  final TermState termState = termsEnum.termState();
-  // if (DEBUG) System.out.println("found");
-  perReaderTermState.register(
-  termState, ctx.ord, termsEnum.docFreq(), 
termsEnum.totalTermFreq());
+  Executor executor = indexSearcher.getExecutor();
+  if (executor == null) {
+executor = Runnable::run;
+  }
+  List> tasks =
+  context.leaves().stream()
+  .map(
+  ctx ->
+  new FutureTask<>(
+  () -> {
+TermsEnum termsEnum = loadTermsEnum(ctx, term);
+if (termsEnum != null) {
+  return new TermStateInfo(
+  termsEnum.termState(),
+  ctx.ord,
+  termsEnum.docFreq(),
+  termsEnum.totalTermFreq());
+}
+return null;
+  }))
+  .toList();
+  for (FutureTask task : tasks) {
+if (executor instanceof ThreadPoolExecutor pool) {
+  if ((pool.getCorePoolSize() - pool.getActiveCount()) <= 1) {
+task.run();

Review Comment:
   Why do you need to check hidden frames?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] shubhamvishu commented on a diff in pull request #12183: Make TermStates#build concurrent

2023-09-13 Thread via GitHub


shubhamvishu commented on code in PR #12183:
URL: https://github.com/apache/lucene/pull/12183#discussion_r1325085377


##
lucene/core/src/java/org/apache/lucene/index/TermStates.java:
##
@@ -86,19 +93,58 @@ public TermStates(
* @param needsStats if {@code true} then all leaf contexts will be visited 
up-front to collect
* term statistics. Otherwise, the {@link TermState} objects will be 
built only when requested
*/
-  public static TermStates build(IndexReaderContext context, Term term, 
boolean needsStats)
+  public static TermStates build(IndexSearcher indexSearcher, Term term, 
boolean needsStats)
   throws IOException {
-assert context != null && context.isTopLevel;
+IndexReaderContext context = indexSearcher.getTopReaderContext();
+assert context != null;
 final TermStates perReaderTermState = new TermStates(needsStats ? null : 
term, context);
 if (needsStats) {
-  for (final LeafReaderContext ctx : context.leaves()) {
-// if (DEBUG) System.out.println("  r=" + leaves[i].reader);
-TermsEnum termsEnum = loadTermsEnum(ctx, term);
-if (termsEnum != null) {
-  final TermState termState = termsEnum.termState();
-  // if (DEBUG) System.out.println("found");
-  perReaderTermState.register(
-  termState, ctx.ord, termsEnum.docFreq(), 
termsEnum.totalTermFreq());
+  Executor executor = indexSearcher.getExecutor();
+  if (executor == null) {
+executor = Runnable::run;
+  }
+  List> tasks =
+  context.leaves().stream()
+  .map(
+  ctx ->
+  new FutureTask<>(
+  () -> {
+TermsEnum termsEnum = loadTermsEnum(ctx, term);
+if (termsEnum != null) {
+  return new TermStateInfo(
+  termsEnum.termState(),
+  ctx.ord,
+  termsEnum.docFreq(),
+  termsEnum.totalTermFreq());
+}
+return null;
+  }))
+  .toList();
+  for (FutureTask task : tasks) {
+if (executor instanceof ThreadPoolExecutor pool) {
+  if ((pool.getCorePoolSize() - pool.getActiveCount()) <= 1) {
+task.run();

Review Comment:
   > Instead of counting all possible frames (it's too expansive, use 
anyMatch(frame -> 
frame.getClassName().contains(THREAD_POOL_EXECUTOR_WORKER_CLASS) && 
frame.getMethodName().contains(RUN_METHOD)) to just find the first one.
   
   Good point. I'll change it in next revision.
   
   > Why do you need to check hidden frames?
   
   You mean we don't need the `StackWalker.Option.SHOW_HIDDEN_FRAMES` and 
instead use the `StackWalker.Option.SHOW_REFLECT_FRAMES` because the call we 
are trying to find won't be a hidden frame so the other would suffice our use 
case? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on a diff in pull request #12183: Make TermStates#build concurrent

2023-09-13 Thread via GitHub


uschindler commented on code in PR #12183:
URL: https://github.com/apache/lucene/pull/12183#discussion_r1325092791


##
lucene/core/src/java/org/apache/lucene/index/TermStates.java:
##
@@ -86,19 +93,58 @@ public TermStates(
* @param needsStats if {@code true} then all leaf contexts will be visited 
up-front to collect
* term statistics. Otherwise, the {@link TermState} objects will be 
built only when requested
*/
-  public static TermStates build(IndexReaderContext context, Term term, 
boolean needsStats)
+  public static TermStates build(IndexSearcher indexSearcher, Term term, 
boolean needsStats)
   throws IOException {
-assert context != null && context.isTopLevel;
+IndexReaderContext context = indexSearcher.getTopReaderContext();
+assert context != null;
 final TermStates perReaderTermState = new TermStates(needsStats ? null : 
term, context);
 if (needsStats) {
-  for (final LeafReaderContext ctx : context.leaves()) {
-// if (DEBUG) System.out.println("  r=" + leaves[i].reader);
-TermsEnum termsEnum = loadTermsEnum(ctx, term);
-if (termsEnum != null) {
-  final TermState termState = termsEnum.termState();
-  // if (DEBUG) System.out.println("found");
-  perReaderTermState.register(
-  termState, ctx.ord, termsEnum.docFreq(), 
termsEnum.totalTermFreq());
+  Executor executor = indexSearcher.getExecutor();
+  if (executor == null) {
+executor = Runnable::run;
+  }
+  List> tasks =
+  context.leaves().stream()
+  .map(
+  ctx ->
+  new FutureTask<>(
+  () -> {
+TermsEnum termsEnum = loadTermsEnum(ctx, term);
+if (termsEnum != null) {
+  return new TermStateInfo(
+  termsEnum.termState(),
+  ctx.ord,
+  termsEnum.docFreq(),
+  termsEnum.totalTermFreq());
+}
+return null;
+  }))
+  .toList();
+  for (FutureTask task : tasks) {
+if (executor instanceof ThreadPoolExecutor pool) {
+  if ((pool.getCorePoolSize() - pool.getActiveCount()) <= 1) {
+task.run();

Review Comment:
   hidden franes are fine, I just wanted to know why you want to see hidden 
frames at all? Is it because you would otherwise not see frames like lambdas 
generated to implement runnables?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on a diff in pull request #12183: Make TermStates#build concurrent

2023-09-13 Thread via GitHub


uschindler commented on code in PR #12183:
URL: https://github.com/apache/lucene/pull/12183#discussion_r1325097710


##
lucene/core/src/java/org/apache/lucene/index/TermStates.java:
##
@@ -86,19 +93,58 @@ public TermStates(
* @param needsStats if {@code true} then all leaf contexts will be visited 
up-front to collect
* term statistics. Otherwise, the {@link TermState} objects will be 
built only when requested
*/
-  public static TermStates build(IndexReaderContext context, Term term, 
boolean needsStats)
+  public static TermStates build(IndexSearcher indexSearcher, Term term, 
boolean needsStats)
   throws IOException {
-assert context != null && context.isTopLevel;
+IndexReaderContext context = indexSearcher.getTopReaderContext();
+assert context != null;
 final TermStates perReaderTermState = new TermStates(needsStats ? null : 
term, context);
 if (needsStats) {
-  for (final LeafReaderContext ctx : context.leaves()) {
-// if (DEBUG) System.out.println("  r=" + leaves[i].reader);
-TermsEnum termsEnum = loadTermsEnum(ctx, term);
-if (termsEnum != null) {
-  final TermState termState = termsEnum.termState();
-  // if (DEBUG) System.out.println("found");
-  perReaderTermState.register(
-  termState, ctx.ord, termsEnum.docFreq(), 
termsEnum.totalTermFreq());
+  Executor executor = indexSearcher.getExecutor();
+  if (executor == null) {
+executor = Runnable::run;
+  }
+  List> tasks =
+  context.leaves().stream()
+  .map(
+  ctx ->
+  new FutureTask<>(
+  () -> {
+TermsEnum termsEnum = loadTermsEnum(ctx, term);
+if (termsEnum != null) {
+  return new TermStateInfo(
+  termsEnum.termState(),
+  ctx.ord,
+  termsEnum.docFreq(),
+  termsEnum.totalTermFreq());
+}
+return null;
+  }))
+  .toList();
+  for (FutureTask task : tasks) {
+if (executor instanceof ThreadPoolExecutor pool) {
+  if ((pool.getCorePoolSize() - pool.getActiveCount()) <= 1) {
+task.run();

Review Comment:
   As you not retain class references, the SecuirtyException can never happen: 
   
   > Returns a StackWalker instance with the given option specifying the stack 
frame information it can access.
   > If a security manager is present and the given option is 
[Option.RETAIN_CLASS_REFERENCE](https://docs.oracle.com/javase%2F9%2Fdocs%2Fapi%2F%2F/java/lang/StackWalker.Option.html#RETAIN_CLASS_REFERENCE),
 it calls its 
[checkPermission](https://docs.oracle.com/javase%2F9%2Fdocs%2Fapi%2F%2F/java/lang/SecurityManager.html#checkPermission-java.security.Permission-)
 method for RuntimePermission("getStackWalkerWithClassReference").



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] shubhamvishu commented on a diff in pull request #12183: Make TermStates#build concurrent

2023-09-13 Thread via GitHub


shubhamvishu commented on code in PR #12183:
URL: https://github.com/apache/lucene/pull/12183#discussion_r1325111790


##
lucene/core/src/java/org/apache/lucene/index/TermStates.java:
##
@@ -86,19 +93,58 @@ public TermStates(
* @param needsStats if {@code true} then all leaf contexts will be visited 
up-front to collect
* term statistics. Otherwise, the {@link TermState} objects will be 
built only when requested
*/
-  public static TermStates build(IndexReaderContext context, Term term, 
boolean needsStats)
+  public static TermStates build(IndexSearcher indexSearcher, Term term, 
boolean needsStats)
   throws IOException {
-assert context != null && context.isTopLevel;
+IndexReaderContext context = indexSearcher.getTopReaderContext();
+assert context != null;
 final TermStates perReaderTermState = new TermStates(needsStats ? null : 
term, context);
 if (needsStats) {
-  for (final LeafReaderContext ctx : context.leaves()) {
-// if (DEBUG) System.out.println("  r=" + leaves[i].reader);
-TermsEnum termsEnum = loadTermsEnum(ctx, term);
-if (termsEnum != null) {
-  final TermState termState = termsEnum.termState();
-  // if (DEBUG) System.out.println("found");
-  perReaderTermState.register(
-  termState, ctx.ord, termsEnum.docFreq(), 
termsEnum.totalTermFreq());
+  Executor executor = indexSearcher.getExecutor();
+  if (executor == null) {
+executor = Runnable::run;
+  }
+  List> tasks =
+  context.leaves().stream()
+  .map(
+  ctx ->
+  new FutureTask<>(
+  () -> {
+TermsEnum termsEnum = loadTermsEnum(ctx, term);
+if (termsEnum != null) {
+  return new TermStateInfo(
+  termsEnum.termState(),
+  ctx.ord,
+  termsEnum.docFreq(),
+  termsEnum.totalTermFreq());
+}
+return null;
+  }))
+  .toList();
+  for (FutureTask task : tasks) {
+if (executor instanceof ThreadPoolExecutor pool) {
+  if ((pool.getCorePoolSize() - pool.getActiveCount()) <= 1) {
+task.run();

Review Comment:
   > hidden franes are fine, I just wanted to know why you want to see hidden 
frames at all? Is it because you would otherwise not see frames like lambdas 
generated to implement runnables?
   
   Yes I chose that thinking there might be such hidden calls in stack as well. 
I retained it back now(had removed it thinking you are recommending the other 
one).
   
   > As you not retain class references, the SecuirtyException can never happen:
   
   Got it. Thanks for pointing Uwe! 😀



##
lucene/core/src/java/org/apache/lucene/index/TermStates.java:
##
@@ -86,19 +93,58 @@ public TermStates(
* @param needsStats if {@code true} then all leaf contexts will be visited 
up-front to collect
* term statistics. Otherwise, the {@link TermState} objects will be 
built only when requested
*/
-  public static TermStates build(IndexReaderContext context, Term term, 
boolean needsStats)
+  public static TermStates build(IndexSearcher indexSearcher, Term term, 
boolean needsStats)
   throws IOException {
-assert context != null && context.isTopLevel;
+IndexReaderContext context = indexSearcher.getTopReaderContext();
+assert context != null;
 final TermStates perReaderTermState = new TermStates(needsStats ? null : 
term, context);
 if (needsStats) {
-  for (final LeafReaderContext ctx : context.leaves()) {
-// if (DEBUG) System.out.println("  r=" + leaves[i].reader);
-TermsEnum termsEnum = loadTermsEnum(ctx, term);
-if (termsEnum != null) {
-  final TermState termState = termsEnum.termState();
-  // if (DEBUG) System.out.println("found");
-  perReaderTermState.register(
-  termState, ctx.ord, termsEnum.docFreq(), 
termsEnum.totalTermFreq());
+  Executor executor = indexSearcher.getExecutor();
+  if (executor == null) {
+executor = Runnable::run;
+  }
+  List> tasks =
+  context.leaves().stream()
+  .map(
+  ctx ->
+  new FutureTask<>(
+  () -> {
+TermsEnum termsEnum = loadTermsEnum(ctx, term);
+if (termsEnum != null) {
+  return new TermStateInfo(
+  termsEnum.termState(),
+  ctx.ord,
+  termsEnum.docFreq(),
+  termsEnum.totalTermFreq());
+   

[GitHub] [lucene] jpountz commented on pull request #12526: Speed up disjunctions by computing estimations of the score of the k-th top hit up-front.

2023-09-14 Thread via GitHub


jpountz commented on PR #12526:
URL: https://github.com/apache/lucene/pull/12526#issuecomment-1718893926

   FYI there was an interesting observation on another benchmark that took 
advantage of recursive graph bisection: 
https://jpountz.github.io/lucene-9.7-vs-9.8/. One query (`the incredibles`) 
became more than 7x (!) slower because recursive graph bisection had moved 
matches of the term with the highest score weight towards the end of the doc ID 
space. This should get addressed by a change like this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gokaai opened a new pull request, #12554: Allow FilteredDocIdSetIterator.match(doc) to throw IOException

2023-09-14 Thread via GitHub


gokaai opened a new pull request, #12554:
URL: https://github.com/apache/lucene/pull/12554

   ### Description
   
   Allows `org.apache.lucene.search.FilteredDocIdSetIterator#match(doc)` to 
throw an IOException so that users don't have to explicitly catch it
   
   Closes #12492 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on a diff in pull request #12552: Make FSTPostingsFormat load FSTs off-heap

2023-09-14 Thread via GitHub


mikemccand commented on code in PR #12552:
URL: https://github.com/apache/lucene/pull/12552#discussion_r1325827523


##
lucene/codecs/src/java/org/apache/lucene/codecs/memory/FSTTermsReader.java:
##
@@ -191,7 +193,9 @@ final class TermsReader extends Terms {
   this.sumTotalTermFreq = sumTotalTermFreq;
   this.sumDocFreq = sumDocFreq;
   this.docCount = docCount;
-  this.dict = new FST<>(in, in, new FSTTermOutputs(fieldInfo));
+  OffHeapFSTStore offHeapFSTStore = new OffHeapFSTStore();
+  this.dict = new FST<>(in, in, new FSTTermOutputs(fieldInfo), 
offHeapFSTStore);
+  in.skipBytes(offHeapFSTStore.size());

Review Comment:
   I was worried at first that we are not cloning this `IndexInput` anywhere 
and that this would cause concurrency bugs when two queries pull a `TermsEnum` 
here, but we are OK because `OffHeapFSTStore` does this cloning when it [pulls 
a random access slice from the 
`IndexInput`](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/fst/OffHeapFSTStore.java#L66).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on pull request #12552: Make FSTPostingsFormat load FSTs off-heap

2023-09-14 Thread via GitHub


mikemccand commented on PR #12552:
URL: https://github.com/apache/lucene/pull/12552#issuecomment-1719297920

   @Tony-X have you tried passing all Lucene unit tests using this Codec?  I 
think you can add `-Dtests.codec=...` to force all tests to use it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #12554: Allow FilteredDocIdSetIterator.match(doc) to throw IOException

2023-09-14 Thread via GitHub


jpountz commented on PR #12554:
URL: https://github.com/apache/lucene/pull/12554#issuecomment-1719334101

   Looks great, can you add a CHANGES entry under "Lucene 9.8.0" / "API 
Changes"?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jimczi commented on pull request #12551: Introduce dynamic segment efSearch to Knn{Byte|Float}VectorQuery

2023-09-14 Thread via GitHub


jimczi commented on PR #12551:
URL: https://github.com/apache/lucene/pull/12551#issuecomment-1719529457

   I made some adjustments to the formula to consider the logarithmic 
complexity of the greedy search. I conducted tests on two datasets: 
   
   1. The standard SIFT dataset, which has 128 dimensions and 1 million 
documents.
   2. The Cohere Wikipedia dataset, where I used the first 1 million documents 
and randomly selected 100 queries from the rest of the dataset.
   
   I compared the performance between two different indexing strategies:
   
   - One using forced merging with a single segment.
   - Another using the default merging strategy, starting with a writer buffer 
size of 1000 documents.
   
   SIFT results:
   
   | Method  | k | ef | recall | # comparisons |
   | - | - | - |- 
|- |
   | Single Segment|10|100|0.973000|1361|
   |Multiple Segments|10|100|0.999000|27278|
   |Multiple Segments with Dynamic Strategy|10|100|0.987000|10137|
   |Single Segment|100|1000|0.987700|7926|
   |Multiple Segments|100|1000|0.999500|146868|
   |Multiple Segments with Dynamic Strategy|100|1000|0.999200|58434|
   |Single Segment|1000|1|0.977250|63692|
   |Multiple Segments|1000|1|0.999080|526198|
   |Multiple Segments with Dynamic Strategy|1000|1|0.998770|292739|
   
   Cohere results:
   
   | Method  | k | ef | recall | # comparisons |
   | - | - | - |- 
|- |
   |Single Segment|10|100|0.947000|1920|
   |Multiple Segments|10|100|0.996000|45096|
   |Multiple Segments with Dynamic Strategy|10|100|0.98|18565|
   |Single Segment|100|1000|0.977800|12848|
   |Multiple Segments|100|1000|0.999200|199685|
   |Multiple Segments with Dynamic Strategy|100|1000|0.997900|96321|
   |Single Segment|1000|1|0.992530|81033
   |Multiple Segments|1000|1|0.00|584438
   |Multiple Segments with Dynamic Strategy|1000|1|0.999700|407804
   
   We already knew that searching across multiple segments improves recall but 
comes with a higher computational cost. For example, with 37 segments, 
searching with a size of 100 requires 23 times more vector comparisons compared 
to the single-segment version on the Cohere dataset. While the recall is 
significantly better, it may be challenging for users to strike the right 
balance between performance and recall since the number of segments can vary 
considerably during indexing.
   
   This is where using the segment size to determine the optimal queue size 
seems like a reasonable compromise. For the Cohere dataset, it reduced the 
number of comparisons by a factor of 2.5 (for the 10-100 case) while still 
maintaining better recall than the single-segment case. Similar improvements 
were observed with the SIFT dataset.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] epotyom opened a new pull request, #12555: Fix: Lucene90DocValuesProducer.TermsDict.seekCeil doesn't always position bytes correctly (#12167)

2023-09-14 Thread via GitHub


epotyom opened a new pull request, #12555:
URL: https://github.com/apache/lucene/pull/12555

   Fix: Lucene90DocValuesProducer.TermsDict.seekCeil doesn't always position 
bytes correctly (#12167)
   
   TermsDict `ord` and `bytes` can be out of sync after a call to seekCeil 
which caused test failures in 
https://github.com/apache/lucene/issues/12167
   
   TODO: write a test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz merged pull request #12554: Allow FilteredDocIdSetIterator.match(doc) to throw IOException

2023-09-14 Thread via GitHub


jpountz merged PR #12554:
URL: https://github.com/apache/lucene/pull/12554


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz closed issue #12492: Allow FilteredDocIdSetIterator.match(doc) to throw IOException

2023-09-14 Thread via GitHub


jpountz closed issue #12492: Allow FilteredDocIdSetIterator.match(doc) to throw 
IOException
URL: https://github.com/apache/lucene/issues/12492


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz merged pull request #12489: Add support for recursive graph bisection.

2023-09-14 Thread via GitHub


jpountz merged PR #12489:
URL: https://github.com/apache/lucene/pull/12489


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #12489: Add support for recursive graph bisection.

2023-09-14 Thread via GitHub


jpountz commented on PR #12489:
URL: https://github.com/apache/lucene/pull/12489#issuecomment-1719763923

   Since it's fairly unintrusive to other functionality, I felt free to merge.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #12489: Add support for recursive graph bisection.

2023-09-14 Thread via GitHub


jpountz commented on PR #12489:
URL: https://github.com/apache/lucene/pull/12489#issuecomment-1719763914

   Since it's fairly unintrusive to other functionality, I felt free to merge.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Tony-X commented on pull request #12552: Make FSTPostingsFormat load FSTs off-heap

2023-09-14 Thread via GitHub


Tony-X commented on PR #12552:
URL: https://github.com/apache/lucene/pull/12552#issuecomment-1719878383

   @mikemccand hey Mike, I did not make a new Codec for this. IIRC, 
`FSTPostingsFormat` will be exercised by the RandomCodec. Also there is 
`TestFSTPostingsFormat extends BasePostingsFormatTestCase` that tested this 
PostingsFormat. 
   
   Oh wait, I see that the test target support `-Dtests.postingsformat` to 
override posting format. So I ran `./gradlew test -Dtests.postingsformat=FST50
   ` and all tests passed.
   
   Since the successful output doesn't provide randomization info, I also ran 
with an non-existent postings format `./gradlew test 
-Dtests.postingsformat=FST5x` and it failed, which means the override is 
happening. 
   `


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] epotyom commented on pull request #12555: Fix: Lucene90DocValuesProducer.TermsDict.seekCeil doesn't always position bytes correctly (#12167)

2023-09-14 Thread via GitHub


epotyom commented on PR #12555:
URL: https://github.com/apache/lucene/pull/12555#issuecomment-1719935323

   Extended existing nightly random tests to catch the issue most of the time. 
Would that be enough or do we need a test that catches it every single time?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on pull request #12551: Introduce dynamic segment efSearch to Knn{Byte|Float}VectorQuery

2023-09-14 Thread via GitHub


benwtrent commented on PR #12551:
URL: https://github.com/apache/lucene/pull/12551#issuecomment-1720048714

   @jimczi I like this idea at first glance, but I have one major concern. 
   
   What about data that is indexed according to a specific order? Two tests to 
verify how this behaves would be:
   
- Index the cohere data sorted by magnitude
- Cluster the data and index sorted by each assigned cluster
   
   While typical things should be randomly distributed, any option we provide 
here should be escapable or at least be able to handle those situations 
(potentially but just increasing efSearch as a whole).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jimczi commented on pull request #12551: Introduce dynamic segment efSearch to Knn{Byte|Float}VectorQuery

2023-09-14 Thread via GitHub


jimczi commented on PR #12551:
URL: https://github.com/apache/lucene/pull/12551#issuecomment-1720078533

   Adding some charts together to compare how effective it is to use a dynamic 
efSearch.
   
   The first chart shows how well different efSearch values work on one 
segment, on multiple segments with a fixed efSearch, and on multiple segments 
with a dynamic efSearch that changes based on segment size.
   
   The second chart compares how many comparisons are needed for each efSearch 
value, with the single-segment case as the starting point. For example, if you 
set efSearch to 10, you'll need more than 16 times as many comparisons when 
using the same value for all segments compared to when you're dealing with just 
one segment.
   
   ## SIFT (10-200)
   
   
![sift_10_200](https://github.com/apache/lucene/assets/15977469/6a9c14b3-3afa-4b5e-b0b4-bdd48e74bcab)
   
   ## SIFT (100-2000)
   
   
![sift_100_2000](https://github.com/apache/lucene/assets/15977469/01d2f678-17ae-4d04-8201-3127bd71df69)
   
   ## Cohere (10-200)
   
   
![cohere_10_200](https://github.com/apache/lucene/assets/15977469/94acb49b-30df-4f35-8024-0614ba63d177)
   
   ## Cohere (100-2000)
   
![cohere_100_2000](https://github.com/apache/lucene/assets/15977469/232bac74-c3f4-40c4-8628-40e2b26b277b)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih commented on a diff in pull request #12555: Fix: Lucene90DocValuesProducer.TermsDict.seekCeil doesn't always position bytes correctly (#12167)

2023-09-14 Thread via GitHub


zhaih commented on code in PR #12555:
URL: https://github.com/apache/lucene/pull/12555#discussion_r1326538550


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90DocValuesProducer.java:
##
@@ -1205,7 +1205,15 @@ public SeekStatus seekCeil(BytesRef text) throws 
IOException {
   ord = 0;
   return SeekStatus.END;
 } else {
-  seekExact(0L);
+  // seekBlock doesn't update ord and it repositions bytes when calls 
getFirstTermFromBlock

Review Comment:
   So, I think the ultimate reason might be previously when we call 
`seekExact(0)`, this if condition is not true, and thus the `bytes` and `ord` 
are not reset correctly.
   ```
 if (ord < this.ord || blockIndex != currentBlockIndex) {
   // The looked up ord is before the current ord or belongs to a 
different block, seek again
   final long blockAddress = blockAddresses.get(blockIndex);
   bytes.seek(blockAddress);
   this.ord = (blockIndex << TERMS_DICT_BLOCK_LZ4_SHIFT) - 1;
 }
   ```
   So I would prefer we do
   ```
   ord = 1; // Probably need to add some comments on why we do this weird thing
   seekExact(0L);
   ```
   Instead of doing it manually here in case we're changing the seek behavior 
in the future.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



  1   2   3   4   5   6   7   8   9   10   >