Re: [PR] TestIndexWriterOnVMError.testUnknownError times out (fixes potential IW deadlock on tragic exceptions). [lucene]

2023-11-08 Thread via GitHub


dweiss merged PR #12751:
URL: https://github.com/apache/lucene/pull/12751


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] TestIndexWriterOnVMError.testUnknownError times out (potential IndexWriter deadlock with tragic exceptions) [lucene]

2023-11-08 Thread via GitHub


dweiss closed issue #12654: TestIndexWriterOnVMError.testUnknownError times out 
(potential IndexWriter deadlock with tragic exceptions)
URL: https://github.com/apache/lucene/issues/12654


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Introduce the similarity as boost functionality to the Word2VecSynonyFilter [lucene]

2023-11-08 Thread via GitHub


dantuzi commented on PR #12433:
URL: https://github.com/apache/lucene/pull/12433#issuecomment-1801394824

   Thanks @mikemccand for your feedback.
   I had to address some comments from @alessandrobenedetti, that's why this PR 
is still WIP.
   At the moment I have other priorities at work but I'll resolve all conflicts 
and address your comments in the next few days


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager [lucene]

2023-11-08 Thread via GitHub


mikemccand commented on code in PR #240:
URL: https://github.com/apache/lucene/pull/240#discussion_r1386407230


##
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java:
##
@@ -602,34 +579,10 @@ private TopFieldDocs searchAfter(
 final Sort rewrittenSort = sort.rewrite(this);
 final LeafSlice[] leafSlices = getSlices();
 
+final boolean supportsConcurrency = leafSlices.length > 1;
 final CollectorManager manager =
-new CollectorManager<>() {
-
-  private final HitsThresholdChecker hitsThresholdChecker =
-  leafSlices.length <= 1
-  ? HitsThresholdChecker.create(Math.max(TOTAL_HITS_THRESHOLD, 
numHits))
-  : 
HitsThresholdChecker.createShared(Math.max(TOTAL_HITS_THRESHOLD, numHits));
-
-  private final MaxScoreAccumulator minScoreAcc =
-  leafSlices.length <= 1 ? null : new MaxScoreAccumulator();
-
-  @Override
-  public TopFieldCollector newCollector() throws IOException {
-// TODO: don't pay the price for accurate hit counts by default
-return TopFieldCollector.create(
-rewrittenSort, cappedNumHits, after, hitsThresholdChecker, 
minScoreAcc);
-  }
-
-  @Override
-  public TopFieldDocs reduce(Collection collectors) 
throws IOException {
-final TopFieldDocs[] topDocs = new TopFieldDocs[collectors.size()];
-int i = 0;
-for (TopFieldCollector collector : collectors) {
-  topDocs[i++] = collector.topDocs();
-}
-return TopDocs.merge(rewrittenSort, 0, cappedNumHits, topDocs);
-  }
-};
+new TopFieldCollectorManager(

Review Comment:
   Wonderful to see all this redundant code go away!



##
lucene/core/src/java/org/apache/lucene/search/TopFieldCollector.java:
##
@@ -429,106 +432,29 @@ public static TopFieldCollector create(Sort sort, int 
numHits, int totalHitsThre
* field is indexed both with doc values and points. In this case, there 
is an assumption that
* the same data is stored in these points and doc values.
* @return a {@link TopFieldCollector} instance which will sort the results 
by the sort criteria.
+   * @deprecated This method is being deprecated in favor of using the 
constructor of {@link

Review Comment:
   Remove `being` and `using`?  Makes the language more punchy (avoids passive 
voice).



##
lucene/core/src/java/org/apache/lucene/search/TopDocs.java:
##
@@ -232,8 +232,8 @@ public static TopDocs merge(
   /**
* Returns a new TopFieldDocs, containing topN results across the provided 
TopFieldDocs, sorting
* by the specified {@link Sort}. Each of the TopDocs must have been sorted 
by the same Sort, and
-   * sort field values must have been filled (ie, fillFields=true 
must be passed to
-   * {@link TopFieldCollector#create}).
+   * sort field values must have been filled (ie, fillFields=true 
must be passed to the
+   * constructor of {@link TopFieldCollectorManager}).

Review Comment:
   Hmm I don't think I see a `boolean fillFields` parameter to 
`TopFieldCollectorManager`?



##
lucene/core/src/java/org/apache/lucene/search/TopScoreDocCollector.java:
##
@@ -44,7 +43,7 @@ public void setScorer(Scorable scorer) throws IOException {
 }

Review Comment:
   Should we add a comment to the javadoc for this class that one should use 
`TopScoreDocCollectorManager` instead of the static factory methods here?  Is 
it too soon to deprecate this class (so we can make it package private 
eventually)?



##
lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/SearchWithCollectorTask.java:
##
@@ -45,20 +43,6 @@ public boolean withCollector() {
 return true;
   }
 
-  @Override
-  protected Collector createCollector() throws Exception {

Review Comment:
   Hmm the purpose of this benchmark component is explicitly to test ones own 
`Collector` impl -- which I think is no longer possible with this diff 
(neutering it).  Maybe we can move this to a `SearchWithCollectorManagerTask` 
instead?  Or we could defer this to a later PR -- this one is already great.  
Progress not perfection (PNP)!



##
lucene/core/src/test/org/apache/lucene/document/BaseSpatialTestCase.java:
##
@@ -695,8 +695,8 @@ protected void verifyRandomDistanceQueries(IndexReader 
reader, Object... shapes)
 }
   }
 
-  private FixedBitSet searchIndex(IndexSearcher s, Query query, int maxDoc) 
throws IOException {
-return s.search(query, FixedBitSetCollector.createManager(maxDoc));
+  private CollectorManager searchIndex(int 
maxDoc) {

Review Comment:
   Hmm rename method to `createCollectorManager` or so?  It's no longer 
searching.



##
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java:
##
@@ -527,7 +503,10 @@ public TopDocs search(Query query, int n) throws 
IOException {
*
 

Re: [PR] LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager [lucene]

2023-11-08 Thread via GitHub


mikemccand commented on PR #240:
URL: https://github.com/apache/lucene/pull/240#issuecomment-1801638506

   > We are still a ways away
   
   For example: query concurrency is still tied to segment geometry, which is 
insane.  An "optimized" (`forceMerge to 1 segment`) index loses all of its 
concurrency!  Hideously, at my team (Amazon product search) we had to create a 
neutered merge policy that intentionally avoids "accidentally" making a 
lopsided index (where too many docs are in a single segment) because it harms 
our long pole latencies, due to this silly Lucene limitation.  We have a 
longstanding issue open for this: https://github.com/apache/lucene/issues/9721, 
and it ought not be so difficult because Lucene already has strong support for 
one thread to search a "slice" of a big segment, i.e. we can make the thread 
work units slices of segments instead of whole segments or collections of 
segments (what IndexSearcher now calls slices, confusingly).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Re-adding the backward_codecs.lucene90 TestPForUtil + TestForUtil [lucene]

2023-11-08 Thread via GitHub


slow-J opened a new pull request, #12781:
URL: https://github.com/apache/lucene/pull/12781

   Clean-up from adding the Lucene99PostingsFormat in 
https://github.com/apache/lucene/pull/12741
   
   These test cases were moved to Lucene99 test directory and I forgot to copy 
the unmodified versions for the backward_codecs.lucene90


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Copy directly between 2 ByteBlockPool to avoid double-copy [lucene]

2023-11-08 Thread via GitHub


mikemccand commented on code in PR #12778:
URL: https://github.com/apache/lucene/pull/12778#discussion_r1386504215


##
lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java:
##
@@ -289,21 +273,34 @@ public long getNodeAddress(long hashSlot) {
 }
 
 /**
- * Set the node address and bytes from the provided hash slot
+ * Set the node address from the provided hash slot
  *
  * @param hashSlot the hash slot to write to
  * @param nodeAddress the node address
- * @param bytes the node bytes to be copied
  */
-public void setNode(long hashSlot, long nodeAddress, byte[] bytes) {
+public void setNodeAddress(long hashSlot, long nodeAddress) {
   assert fstNodeAddress.get(hashSlot) == 0;
   fstNodeAddress.set(hashSlot, nodeAddress);
   count++;
+}
 
+private void setBytes(long hashSlot, byte[] bytes) {

Review Comment:
   Can we rename to maybe `copyNodeBytes` or so?



##
lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java:
##
@@ -289,21 +273,34 @@ public long getNodeAddress(long hashSlot) {
 }
 
 /**
- * Set the node address and bytes from the provided hash slot
+ * Set the node address from the provided hash slot
  *
  * @param hashSlot the hash slot to write to
  * @param nodeAddress the node address
- * @param bytes the node bytes to be copied
  */
-public void setNode(long hashSlot, long nodeAddress, byte[] bytes) {
+public void setNodeAddress(long hashSlot, long nodeAddress) {
   assert fstNodeAddress.get(hashSlot) == 0;
   fstNodeAddress.set(hashSlot, nodeAddress);
   count++;
+}
 
+private void setBytes(long hashSlot, byte[] bytes) {
+  assert copiedNodeAddress.get(hashSlot) == 0;
   copiedNodes.append(bytes);
   // write the offset, which points to the last byte of the node we copied 
since we later read
   // this node in reverse
+  copiedNodeAddress.set(hashSlot, copiedNodes.getPosition() - 1);
+}
+
+/** promote the node bytes from the fallback table */
+private void setBytes(

Review Comment:
   Can we rename to `copyFallbackNodeBytes` or so?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up BytesRefHash#sort [lucene]

2023-11-08 Thread via GitHub


jpountz commented on PR #12775:
URL: https://github.com/apache/lucene/pull/12775#issuecomment-1801748121

   FWIW I think it's fine if this yields a speedup on some platforms but not 
other platforms. From your results, it doesn't seem to slow down flushing 
either?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up BytesRefHash#sort [lucene]

2023-11-08 Thread via GitHub


gf2121 commented on PR #12775:
URL: https://github.com/apache/lucene/pull/12775#issuecomment-1801756914

   > it's fine if this yields a speedup on some platforms but not other 
platforms
   
   +1
   
   > it doesn't seem to slow down flushing either?
   
   Yes!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Deprecated public constructor of FSTCompiler in favor of the Builder. [lucene]

2023-11-08 Thread via GitHub


dungba88 commented on PR #12715:
URL: https://github.com/apache/lucene/pull/12715#issuecomment-1801792656

   @mikemccand can you help to review this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Normalize written scalar quantized vectors when using cosine similarity [lucene]

2023-11-08 Thread via GitHub


benwtrent commented on code in PR #12780:
URL: https://github.com/apache/lucene/pull/12780#discussion_r1386577696


##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99ScalarQuantizedVectorsWriter.java:
##
@@ -143,6 +143,13 @@ private void 
writeQuantizedVectors(QuantizationFieldVectorWriter fieldData) thro
 byte[] vector = new byte[fieldData.dim];
 final ByteBuffer offsetBuffer = 
ByteBuffer.allocate(Float.BYTES).order(ByteOrder.LITTLE_ENDIAN);
 for (float[] v : fieldData.floatVectors) {
+  if (fieldData.normalize) {
+float[] copy = new float[v.length];

Review Comment:
   we should only have one `copy` array created. Outside the loop check if we 
need it, if so, initialize and then you can use it here in the loop. Seems 
better to do that than create a new `float[dim]` for every vector.



##
lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java:
##
@@ -81,7 +81,8 @@ public void init() {
   protected void addRandomFields(Document doc) {
 switch (vectorEncoding) {
   case BYTE -> doc.add(new KnnByteVectorField("v2", randomVector8(30), 
similarityFunction));
-  case FLOAT32 -> doc.add(new KnnFloatVectorField("v2", randomVector(30), 
similarityFunction));
+  case FLOAT32 -> doc.add(
+  new KnnFloatVectorField("v2", randomNormalizedVector(30), 
similarityFunction));

Review Comment:
   Were you getting many failures with your change when you put non-normalized 
vectors? The fix to normalize on cosine should be transparent. The caller 
should never be aware that it happens as it only occurs for things that are 
then quantized.



##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99ScalarQuantizedVectorsWriter.java:
##
@@ -196,6 +203,13 @@ void 
writeSortedQuantizedVectors(QuantizationFieldVectorWriter fieldData, int[]
 final ByteBuffer offsetBuffer = 
ByteBuffer.allocate(Float.BYTES).order(ByteOrder.LITTLE_ENDIAN);
 for (int ordinal : ordMap) {
   float[] v = fieldData.floatVectors.get(ordinal);
+  if (fieldData.normalize) {
+float[] copy = new float[v.length];

Review Comment:
   Similar to above ^



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Use group-varint encoding for the tail of postings [lucene]

2023-11-08 Thread via GitHub


easyice opened a new pull request, #12782:
URL: https://github.com/apache/lucene/pull/12782

   As discussed in issue https://github.com/apache/lucene/issues/12717
   
   the read performance of group-varint  is 14-30%% faster than vint, the 
`Mode`  16-248 is the number of ints will be read.
feel free to close the PR if the performance improves is not enough :)
   
   ```
   Benchmark(size)   Mode  Cnt   Score   Error   Units
   GroupVInt.readGroupVInt  16  thrpt5  30.743 ± 5.054  ops/us
   GroupVInt.readGroupVInt  32  thrpt5  14.495 ± 0.606  ops/us
   GroupVInt.readGroupVInt  64  thrpt5   6.930 ± 4.679  ops/us
   GroupVInt.readGroupVInt 128  thrpt5   3.593 ± 0.687  ops/us
   GroupVInt.readGroupVInt 248  thrpt5   2.356 ± 0.073  ops/us
   GroupVInt.readVInt   16  thrpt5  21.437 ± 1.102  ops/us
   GroupVInt.readVInt   32  thrpt5  10.482 ± 3.620  ops/us
   GroupVInt.readVInt   64  thrpt5   5.966 ± 0.707  ops/us
   GroupVInt.readVInt  128  thrpt5   2.750 ± 1.668  ops/us
   GroupVInt.readVInt  248  thrpt5   1.606 ± 0.042  ops/us
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Copy directly between 2 ByteBlockPool to avoid double-copy [lucene]

2023-11-08 Thread via GitHub


dungba88 commented on code in PR #12778:
URL: https://github.com/apache/lucene/pull/12778#discussion_r1386637002


##
lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java:
##
@@ -289,21 +273,34 @@ public long getNodeAddress(long hashSlot) {
 }
 
 /**
- * Set the node address and bytes from the provided hash slot
+ * Set the node address from the provided hash slot
  *
  * @param hashSlot the hash slot to write to
  * @param nodeAddress the node address
- * @param bytes the node bytes to be copied
  */
-public void setNode(long hashSlot, long nodeAddress, byte[] bytes) {
+public void setNodeAddress(long hashSlot, long nodeAddress) {
   assert fstNodeAddress.get(hashSlot) == 0;
   fstNodeAddress.set(hashSlot, nodeAddress);
   count++;
+}
 
+private void setBytes(long hashSlot, byte[] bytes) {

Review Comment:
   Thank you, I have renamed these 2



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Re-adding the backward_codecs.lucene90 TestPForUtil + TestForUtil [lucene]

2023-11-08 Thread via GitHub


mikemccand merged PR #12781:
URL: https://github.com/apache/lucene/pull/12781


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] HnwsGraph creates disconnected components [lucene]

2023-11-08 Thread via GitHub


nitirajrathore commented on issue #12627:
URL: https://github.com/apache/lucene/issues/12627#issuecomment-1801982741

   I was able to conduct some perf-test as well. I would like to propose 
following changes
   With example of adding node `a` to 'b, c, d'
   
   | node | neighbours|
   |---|---|
   | b | c, d, e|
   | c | b, d|
   | e | e, x, y, z |
   
   ### Proposed heuristic 
   1. At the time of adding a node always add all (upto maxConn) edges to 
highest scoring neighbours (disregarding the diversity check) eg. adding edges 
a-->b, a-->c etc. Currently we check for diverse connections at this time also.
   2. At the time of adding reverse edges from neighbours to the node, if the 
given neighbour has more than maxConn number of connections then use new 
diversity check formula to calculate the edge that can be removed. For example 
at the time of adding edge b --> a if we see that `b` has more than maxConn 
connections then apply new hiuristic and try to remove one edge.
   3. Check for diversity only between the common neighbours of candidate node 
and the node. Example, if `b` has more than maxConn connections then we iterate 
over b's neighbours (which now includes 'a' as well) considering each one as a 
possible candidate for disconnection. When deciding if we should disconnect an 
edge say b-->c or not. We will consider the only the common neighbours of c and 
b for diversity check, so we will compare the score(c,b) with only score(c,d) 
and not with score(c,e) as only `d` is the common neighbour of c and b. If we 
find a non diverse node we return that, otherwise
   4. we return a node which has highest number of common neighbours with `b`, 
otherwise
   5. we return a node with maximum number of edges (eg. `e` in above case).
   6. If none of the above condition exists, then we return -1, meaning we 
should not remove any connection.
   7. If we receive a valid node for removal, we remove that and the other half 
of the connection i.e. we remove both bi-directional connections. so if `c` is 
returned as non-diverse node, then we remove both b---> c and c ---> b 
connections.
   
   Performance number:
   |Branch| recall | avgCpuTime | numDocs | reindexTimeMsec | fanout | maxConn 
| beamWidth | totalVisited | selectivity | prefilter   |
   |---| -- | -- | --- | -- | --- | - | 
 | --- | --- | --- |
   |Baseline | 0.451  | 17.78  | 100 | 370916  | 0  | 16  | 100 
  | 10   | 1.00| post-filter |
   |Candidate| 0.599 (33% increase)  | 21.63 (22% increase) | 100 | 
624797 (68% increase)  | 0  | 16  | 100   | 10   | 
1.00| post-filter |
   
   Since now we are increasing the number of connections per node, we also see 
a good increase in the recall for same max-conn. For the same reason we do 
expect the avgCpuTime to increase at query time. I can work a little bit on 
improving the indexing time as the my current implementation is very naive and 
finds common nodes by two for loops and some other places recalculates the 
scores where not needed. But overall the indexing time will increase for sure.
   
   Another data that I will post next is the number of (histogram) number of 
edges each node has. Expect it to increase from lower number to max-conn on 
average. 
   
   I will upload my (rough) draft PR now.
   
    How I reached this heuristic:
   Before coming up with this heuristic I tried quite some simpler ones, but in 
those cases I was either getting much much higher disconnections than the 
mainline or if very less and even just 1 disconnection, but some disconnection 
was happening. I am open for removing or adding any other condition which is 
considered for edge removal.
   
   Initially, I tried with just (1) and left the rest of the algorithm same as 
in mainline. This "increased" the disconnections by large factor instead of 
decreasing it. The issue was that say 0 to 32 nodes are added with star kind of 
formation, now if 33rd node comes in, it will form connections with 1-32 nodes 
and each may find 0 as the non diverse node, so each one removes 0 leaving it 
disconnected from the graph. Other than this extreme case there were others 
too. 
   That is where (3) came in handy where we don't check diverseness of an edge 
unless we are sure that there is an alternate path to the disconnecting node 
via a neighbouring node. This coupled with (7) reduced the disconnectedness 
from few thousand nodes to just couple of nodes at level-1 and only occassional 
disconnected node at level-0.
   Then to achieve no disconnection at all, I initially tried (1, 2, 3) with 
(6) to remove even slightest chance of creating disconnectedness, but with this 
we end up not removing any connection a large number of times and the number of 
connections per node increased from max-conn of 32 at level-0 to upwards of 250 
for more than 5% of nod

Re: [I] HnwsGraph creates disconnected components [lucene]

2023-11-08 Thread via GitHub


benwtrent commented on issue #12627:
URL: https://github.com/apache/lucene/issues/12627#issuecomment-1802098274

   `68% increase` in index time is untenable, I would be a hard no on a change 
that slows things down this much. Maybe we can find something better.
   
   @nitirajrathore i know 
https://github.com/nmslib/nmslib/blob/master/similarity_search/src/method/hnsw.cc
 has various heuristics, have you explored those to see how they can be 
modified or added here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Copy directly between 2 ByteBlockPool to avoid double-copy [lucene]

2023-11-08 Thread via GitHub


mikemccand merged PR #12778:
URL: https://github.com/apache/lucene/pull/12778


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Copy directly between 2 ByteBlockPool to avoid double-copy [lucene]

2023-11-08 Thread via GitHub


mikemccand commented on PR #12778:
URL: https://github.com/apache/lucene/pull/12778#issuecomment-1802213131

   Hmm running `./gradlew check` after pulling this change into my dev box I 
hit:
   
   ```
   org.apache.lucene.util.fst.TestFSTs > test suite's output saved to 
/s1/l/trunk/lucene/core/build/test-results/test/outputs/OUTPUT-org.apache.lucene.util.fst.TestFSTs.txt,
 copied below:
  > java.lang.AssertionError

   
  > at 
__randomizedtesting.SeedInfo.seed([CD812A59C5A455F0:D3A718D42274A0C2]:0)


  > at 
org.apache.lucene.util.fst.FST$Arc$BitTable.assertIsValid(FST.java:376) 


  > at org.apache.lucene.util.fst.FST.readNextRealArc(FST.java:907) 

   
  > at 
org.apache.lucene.util.fst.FST.readFirstRealTargetArc(FST.java:763) 


  > at 
org.apache.lucene.util.fst.NodeHash$PagedGrowableHash.nodesEqual(NodeHash.java:387)

 
  > at 
org.apache.lucene.util.fst.NodeHash.getFallback(NodeHash.java:94)   


  > at org.apache.lucene.util.fst.NodeHash.add(NodeHash.java:121)   

   
  > at 
org.apache.lucene.util.fst.FSTCompiler.compileNode(FSTCompiler.java:295)


  > at 
org.apache.lucene.util.fst.FSTCompiler.freezeTail(FSTCompiler.java:703) 


  > at 
org.apache.lucene.util.fst.FSTCompiler.add(FSTCompiler.java:777)


  > at 
org.apache.lucene.util.fst.TestFSTs.testRealTerms(TestFSTs.java:404)


  > at 
java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)

   
  > at java.base/java.lang.reflect.Method.invoke(Method.java:580)   

   
  > at 
com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)

  
  > at 
com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)

   
  > at 
com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)

   
  > at 
com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)

  
  > at 
org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)

  
  > at 
org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)


  > at 
org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)


  > at 
org.apache.lucene.tests.util.TestRuleIgnoreAfterM

Re: [PR] Copy directly between 2 ByteBlockPool to avoid double-copy [lucene]

2023-11-08 Thread via GitHub


mikemccand commented on PR #12778:
URL: https://github.com/apache/lucene/pull/12778#issuecomment-1802214263

   And it does reproduce for me.  I'll revert this change for now!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Copy directly between 2 ByteBlockPool to avoid double-copy [lucene]

2023-11-08 Thread via GitHub


mikemccand commented on PR #12778:
URL: https://github.com/apache/lucene/pull/12778#issuecomment-1802219722

   (And does not reproduce once I revert).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Normalize written scalar quantized vectors when using cosine similarity [lucene]

2023-11-08 Thread via GitHub


kevindrosendahl commented on code in PR #12780:
URL: https://github.com/apache/lucene/pull/12780#discussion_r1386980331


##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99ScalarQuantizedVectorsWriter.java:
##
@@ -143,6 +143,13 @@ private void 
writeQuantizedVectors(QuantizationFieldVectorWriter fieldData) thro
 byte[] vector = new byte[fieldData.dim];
 final ByteBuffer offsetBuffer = 
ByteBuffer.allocate(Float.BYTES).order(ByteOrder.LITTLE_ENDIAN);
 for (float[] v : fieldData.floatVectors) {
+  if (fieldData.normalize) {
+float[] copy = new float[v.length];

Review Comment:
   Good catch, changed in 
https://github.com/apache/lucene/pull/12780/commits/1d9f68be208ad2b111640d370bee8f410203099f



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Normalize written scalar quantized vectors when using cosine similarity [lucene]

2023-11-08 Thread via GitHub


kevindrosendahl commented on code in PR #12780:
URL: https://github.com/apache/lucene/pull/12780#discussion_r1386980454


##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99ScalarQuantizedVectorsWriter.java:
##
@@ -196,6 +203,13 @@ void 
writeSortedQuantizedVectors(QuantizationFieldVectorWriter fieldData, int[]
 final ByteBuffer offsetBuffer = 
ByteBuffer.allocate(Float.BYTES).order(ByteOrder.LITTLE_ENDIAN);
 for (int ordinal : ordMap) {
   float[] v = fieldData.floatVectors.get(ordinal);
+  if (fieldData.normalize) {
+float[] copy = new float[v.length];

Review Comment:
   Also changed in 
https://github.com/apache/lucene/pull/12780/commits/1d9f68be208ad2b111640d370bee8f410203099f



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Normalize written scalar quantized vectors when using cosine similarity [lucene]

2023-11-08 Thread via GitHub


kevindrosendahl commented on code in PR #12780:
URL: https://github.com/apache/lucene/pull/12780#discussion_r1386992189


##
lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java:
##
@@ -81,7 +81,8 @@ public void init() {
   protected void addRandomFields(Document doc) {
 switch (vectorEncoding) {
   case BYTE -> doc.add(new KnnByteVectorField("v2", randomVector8(30), 
similarityFunction));
-  case FLOAT32 -> doc.add(new KnnFloatVectorField("v2", randomVector(30), 
similarityFunction));
+  case FLOAT32 -> doc.add(
+  new KnnFloatVectorField("v2", randomNormalizedVector(30), 
similarityFunction));

Review Comment:
   The issue I was having was that I couldn't reproduce the issue in 
`testQuantizedVectorsWriteAndRead()` because 
`BaseKnnVectorsFormatTestCase::randomVector()` [always normalizes the vectors 
it 
creates](https://github.com/apache/lucene/blob/20d5de448a739beb85e380d153fb13bf1817a8d6/lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java#L1236).
 That means that the persisted vectors matched the expected vectors (even after 
normalizing the expected vectors when using cosine).
   
   To isolate the effects of the test changes from normalized to potentially 
non-normalized vectors to just this test, I kept the existing behavior (i.e. 
existing tests use normalized vectors) by renaming the existing method from 
`randomVector()` to `randomNormalizedVector()`, then adding a new method 
`randomVector()` which does not normalize the vectors. This commit 
https://github.com/apache/lucene/pull/12780/commits/7cb2375e850f963630788028af34e3275fc5af09
 has those changes.
   
   Not sure if that really answers your question, I didn't try changing any 
other tests to use non-normalized vectors. I'm happy to structure all of these 
tests however makes sense to you.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Normalize written scalar quantized vectors when using cosine similarity [lucene]

2023-11-08 Thread via GitHub


kevindrosendahl commented on code in PR #12780:
URL: https://github.com/apache/lucene/pull/12780#discussion_r1386992189


##
lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java:
##
@@ -81,7 +81,8 @@ public void init() {
   protected void addRandomFields(Document doc) {
 switch (vectorEncoding) {
   case BYTE -> doc.add(new KnnByteVectorField("v2", randomVector8(30), 
similarityFunction));
-  case FLOAT32 -> doc.add(new KnnFloatVectorField("v2", randomVector(30), 
similarityFunction));
+  case FLOAT32 -> doc.add(
+  new KnnFloatVectorField("v2", randomNormalizedVector(30), 
similarityFunction));

Review Comment:
   The issue I was having was that I couldn't reproduce the issue in 
`testQuantizedVectorsWriteAndRead()` because 
`BaseKnnVectorsFormatTestCase::randomVector()` [always normalizes the vectors 
it 
creates](https://github.com/apache/lucene/blob/20d5de448a739beb85e380d153fb13bf1817a8d6/lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java#L1236).
 That means that the persisted vectors matched the expected vectors even though 
the vectors were not explicitly normalized by the `Writer` (even after 
normalizing the expected vectors when using cosine).
   
   To isolate the effects of the test changes from normalized to potentially 
non-normalized vectors to just this test, I kept the existing behavior (i.e. 
existing tests use normalized vectors) by renaming the existing method from 
`randomVector()` to `randomNormalizedVector()`, then adding a new method 
`randomVector()` which does not normalize the vectors. This commit 
https://github.com/apache/lucene/pull/12780/commits/7cb2375e850f963630788028af34e3275fc5af09
 has those changes.
   
   Not sure if that really answers your question, I didn't try changing any 
other tests to use non-normalized vectors. I'm happy to structure all of these 
tests however makes sense to you.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Normalize written scalar quantized vectors when using cosine similarity [lucene]

2023-11-08 Thread via GitHub


benwtrent commented on code in PR #12780:
URL: https://github.com/apache/lucene/pull/12780#discussion_r1386996018


##
lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java:
##
@@ -81,7 +81,8 @@ public void init() {
   protected void addRandomFields(Document doc) {
 switch (vectorEncoding) {
   case BYTE -> doc.add(new KnnByteVectorField("v2", randomVector8(30), 
similarityFunction));
-  case FLOAT32 -> doc.add(new KnnFloatVectorField("v2", randomVector(30), 
similarityFunction));
+  case FLOAT32 -> doc.add(
+  new KnnFloatVectorField("v2", randomNormalizedVector(30), 
similarityFunction));

Review Comment:
   @kevindrosendahl I see now! thank you!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Normalize written scalar quantized vectors when using cosine similarity [lucene]

2023-11-08 Thread via GitHub


benwtrent commented on code in PR #12780:
URL: https://github.com/apache/lucene/pull/12780#discussion_r1386996238


##
lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java:
##
@@ -81,7 +81,8 @@ public void init() {
   protected void addRandomFields(Document doc) {
 switch (vectorEncoding) {
   case BYTE -> doc.add(new KnnByteVectorField("v2", randomVector8(30), 
similarityFunction));
-  case FLOAT32 -> doc.add(new KnnFloatVectorField("v2", randomVector(30), 
similarityFunction));
+  case FLOAT32 -> doc.add(
+  new KnnFloatVectorField("v2", randomNormalizedVector(30), 
similarityFunction));

Review Comment:
   I will merge and backport once this is green.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Updated heuristic to remove non diverse edges keeping overall graph c… [lucene]

2023-11-08 Thread via GitHub


nitirajrathore opened a new pull request, #12783:
URL: https://github.com/apache/lucene/pull/12783

   …onnected. No test cases, unoptimized, draft only version.
   
   ### Description
   
   Details in this comment : 
https://github.com/apache/lucene/issues/12627#issuecomment-1801982741
   I will first update the code for performance, then add / update test cases 
and then create a PR for review. This is just a draft code.
   Have some ideas commented inline for improving indexing performance.
   Also, need to check if this change works well with the parallel graph 
merge/construction that got added recently(?)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] HnwsGraph creates disconnected components [lucene]

2023-11-08 Thread via GitHub


nitirajrathore commented on issue #12627:
URL: https://github.com/apache/lucene/issues/12627#issuecomment-1802424885

   @benwtrent : I have added draft PR. The code is not at all optimized right 
now for performance and I am hoping to fix some obvious stuff and will post 
perf results here. 
   
   > have you explored those to see how they can be modified or added here?
   
   No, I haven't checked those. Thanks for pointing it out. I will check and 
revert.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Speed up BytesRefHash#sort (another approach) [lucene]

2023-11-08 Thread via GitHub


gf2121 opened a new pull request, #12784:
URL: https://github.com/apache/lucene/pull/12784

   Following https://github.com/apache/lucene/pull/12775, this PR tries another 
approach to speed up `BytesRefHash#sort`:
   The idea is that since we have extra ints in this map, we can cache the 
bucket when building the histograms, and reuse them when `reorder`.  I checked 
this approach on intel chip, showing ~30% speed up. I'll check M2 chip and 
wikimedium data tomorrow.
   
   ```
   BASELINE:  sort 5169965 terms, build histogram took: 1968ms, reorder took: 
2132ms, total took: 5470ms.
   BASELINE:  sort 5169965 terms, build histogram took: 1975ms, reorder took: 
2133ms, total took: 5526ms.
   BASELINE:  sort 5169965 terms, build histogram took: 1999ms, reorder took: 
2157ms, total took: 5573ms.
   BASELINE:  sort 5169965 terms, build histogram took: 1955ms, reorder took: 
2138ms, total took: 5446ms.
   BASELINE:  sort 5169965 terms, build histogram took: 1990ms, reorder took: 
2161ms, total took: 5528ms.
   BASELINE:  sort 5169965 terms, build histogram took: 1997ms, reorder took: 
2175ms, total took: 5571ms.
   BASELINE:  sort 5169965 terms, build histogram took: 2004ms, reorder took: 
2119ms, total took: 5477ms.
   BASELINE:  sort 5169965 terms, build histogram took: 1978ms, reorder took: 
2155ms, total took: 5501ms.
   BASELINE:  sort 5169965 terms, build histogram took: 2015ms, reorder took: 
2169ms, total took: 5572ms.
   BASELINE:  sort 5169965 terms, build histogram took: 1941ms, reorder took: 
2138ms, total took: 5400ms.
   BASELINE:  sort 5169965 terms, build histogram took: 2000ms, reorder took: 
2155ms, total took: 5558ms.
   
   CANDIDATE:  sort 5169965 terms, build histogram took: 1996ms, reorder took: 
133ms, total took: 3734ms.
   CANDIDATE:  sort 5169965 terms, build histogram took: 1989ms, reorder took: 
142ms, total took: 3655ms.
   CANDIDATE:  sort 5169965 terms, build histogram took: 2031ms, reorder took: 
155ms, total took: 3762ms.
   CANDIDATE:  sort 5169965 terms, build histogram took: 2016ms, reorder took: 
145ms, total took: 3739ms.
   CANDIDATE:  sort 5169965 terms, build histogram took: 1994ms, reorder took: 
142ms, total took: 3667ms.
   CANDIDATE:  sort 5169965 terms, build histogram took: 2010ms, reorder took: 
140ms, total took: 3651ms.
   CANDIDATE:  sort 5169965 terms, build histogram took: 2021ms, reorder took: 
154ms, total took: 3731ms.
   CANDIDATE:  sort 5169965 terms, build histogram took: 2019ms, reorder took: 
144ms, total took: 3727ms.
   CANDIDATE:  sort 5169965 terms, build histogram took: 2064ms, reorder took: 
138ms, total took: 3784ms.
   CANDIDATE:  sort 5169965 terms, build histogram took: 2043ms, reorder took: 
142ms, total took: 3727ms.
   CANDIDATE:  sort 5169965 terms, build histogram took: 1964ms, reorder took: 
140ms, total took: 3630ms.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up BytesRefHash#sort [lucene]

2023-11-08 Thread via GitHub


gf2121 commented on PR #12775:
URL: https://github.com/apache/lucene/pull/12775#issuecomment-1802480151

   I came up with https://github.com/apache/lucene/pull/12784 as another idea 
to speed up `BytesRefHash#sort`, which has been shown to have performance 
improvements running on Intel chips.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Normalize written scalar quantized vectors when using cosine similarity [lucene]

2023-11-08 Thread via GitHub


benwtrent merged PR #12780:
URL: https://github.com/apache/lucene/pull/12780


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Clean up ordinal map in default SSDV reader state [lucene]

2023-11-08 Thread via GitHub


gsmiller commented on PR #12454:
URL: https://github.com/apache/lucene/pull/12454#issuecomment-1802519348

   > @gsmiller, I think this PR is ready. Is there anything else you'd like to 
see changed?
   
   Gah! I'm sorry I missed this. I'll have a look here shortly. Apologies again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-11-08 Thread via GitHub


benwtrent commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1802705393

   @kevindrosendahl if I am reading the code correctly, it does the following:
   
- Write int8 quantized vectors along side the vector ordinals in the graph 
(`.vex` or whatever has a copy of each vector).
- Continue to write vectors in `.vec`. I am guessing this is a stop-gap and 
you are thinking of removing this? Maybe not?
   
   I have a concern around index sorting. How does building the graph & 
subsequent `getFloatVectors()` play with index sorting? Usually when folks sort 
an index, they expect to be able to iterate values in that given order. Is it 
horrifically slow to iterate `.vex` in this scenario? 
   
   What do we think about always keeping `.vec` around? Probably should for 
re-ranking purposes once more extreme quantization measures are used.
   
   
   One more question, have you tested your implementation in the situation 
where `.vex` cannot all be paged into memory and it faired ok?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Copy directly between 2 ByteBlockPool to avoid double-copy [lucene]

2023-11-08 Thread via GitHub


dungba88 commented on PR #12778:
URL: https://github.com/apache/lucene/pull/12778#issuecomment-1802753207

   Thank you for reproducing this! I found the bug, it's quite silly. The node 
address is the last address, so I should have do this
   
   ```
 copiedNodes.append(fallbackTable.copiedNodes, fallbackAddress - 
nodeLength + 1, nodeLength);
   ```
   
   instead of
   
   ```
 copiedNodes.append(fallbackTable.copiedNodes, fallbackAddress, 
nodeLength);
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Redo #12707: Do not rely on isAlive() status of MemorySegment#Scope [lucene]

2023-11-08 Thread via GitHub


uschindler opened a new pull request, #12785:
URL: https://github.com/apache/lucene/pull/12785

   Unfortunately the solution in #12707 was not working well with concurrency. 
The is alive status of `MemorySegment.Scope` may be stale. In that case the 
`IllegalStateException` was catched, but the confirmation using `isAlive()` did 
not work, so the original `IllegalStateException` was rethrown instead of the 
required `AlreadyClosedException`.
   
   This PR reverts #12707 and falls back to the naive approach by checking 
exception message for "closed". I am in contact with @mcimadamore to find a 
better solution to confirm the closed status without relying on Exception 
message. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] MMapDirectory with MemorySegment: Confirm that scope/session is no longer alive before throwing AlreadyClosedException [lucene]

2023-11-08 Thread via GitHub


uschindler commented on PR #12707:
URL: https://github.com/apache/lucene/pull/12707#issuecomment-1802765955

   This code did not work well as the `isAlive` status may be stale in other 
threads. I reworked this one here: #12785


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Copy directly between 2 ByteBlockPool to avoid double-copy [lucene]

2023-11-08 Thread via GitHub


dungba88 opened a new pull request, #12786:
URL: https://github.com/apache/lucene/pull/12786

   ### Description
   
   See the previous PR: https://github.com/apache/lucene/pull/12778
   
   There was a bug in the PR, the copiedNodeAddress is the last address 
(inclusively) of the node, thus the offset we want to copy from should be 
`fallbackAddress - nodeLength + 1` 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Redo #12707: Do not rely on isAlive() status of MemorySegment#Scope [lucene]

2023-11-08 Thread via GitHub


uschindler commented on PR #12785:
URL: https://github.com/apache/lucene/pull/12785#issuecomment-1802841410

   The bug in JDK is here: https://bugs.openjdk.org/browse/JDK-8319756


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Copy directly between 2 ByteBlockPool to avoid double-copy [lucene]

2023-11-08 Thread via GitHub


dungba88 commented on PR #12786:
URL: https://github.com/apache/lucene/pull/12786#issuecomment-1802854503

   > Could not copy file 
'/home/runner/work/lucene/lucene/lucene/JRE_VERSION_MIGRATION.md' to 
'/home/runner/work/lucene/lucene/lucene/documentation/build/site/JRE_VERSION_MIGRATION.html'.
   
   This seems to be some intermittent error


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Copy directly between 2 ByteBlockPool to avoid double-copy [lucene]

2023-11-08 Thread via GitHub


dungba88 commented on code in PR #12786:
URL: https://github.com/apache/lucene/pull/12786#discussion_r1387298308


##
lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java:
##
@@ -289,21 +273,38 @@ public long getNodeAddress(long hashSlot) {
 }
 
 /**
- * Set the node address and bytes from the provided hash slot
+ * Set the node address from the provided hash slot
  *
  * @param hashSlot the hash slot to write to
  * @param nodeAddress the node address
- * @param bytes the node bytes to be copied
  */
-public void setNode(long hashSlot, long nodeAddress, byte[] bytes) {
+public void setNodeAddress(long hashSlot, long nodeAddress) {
   assert fstNodeAddress.get(hashSlot) == 0;
   fstNodeAddress.set(hashSlot, nodeAddress);
   count++;
+}
 
+/** copy the node bytes from the FST */
+private void copyNodeBytes(long hashSlot, byte[] bytes) {
+  assert copiedNodeAddress.get(hashSlot) == 0;
   copiedNodes.append(bytes);
   // write the offset, which points to the last byte of the node we copied 
since we later read
   // this node in reverse
+  copiedNodeAddress.set(hashSlot, copiedNodes.getPosition() - 1);
+}
+
+/** promote the node bytes from the fallback table */
+private void copyFallbackNodeBytes(

Review Comment:
   I think we can have a test for this class to be safe, but that means we need 
to expose it (package private maybe)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Clean up ordinal map in default SSDV reader state [lucene]

2023-11-08 Thread via GitHub


gsmiller merged PR #12454:
URL: https://github.com/apache/lucene/pull/12454


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager [lucene]

2023-11-08 Thread via GitHub


zacharymorn commented on code in PR #240:
URL: https://github.com/apache/lucene/pull/240#discussion_r1387436327


##
lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/SearchWithCollectorTask.java:
##
@@ -45,20 +43,6 @@ public boolean withCollector() {
 return true;
   }
 
-  @Override
-  protected Collector createCollector() throws Exception {

Review Comment:
   Hmm after some more thought about this, I feel maybe I should revert this 
change here for now given the task tests not just the `TopScoreDocCollector`?  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager [lucene]

2023-11-08 Thread via GitHub


zacharymorn commented on code in PR #240:
URL: https://github.com/apache/lucene/pull/240#discussion_r1387438590


##
lucene/core/src/java/org/apache/lucene/search/TopDocs.java:
##
@@ -232,8 +232,8 @@ public static TopDocs merge(
   /**
* Returns a new TopFieldDocs, containing topN results across the provided 
TopFieldDocs, sorting
* by the specified {@link Sort}. Each of the TopDocs must have been sorted 
by the same Sort, and
-   * sort field values must have been filled (ie, fillFields=true 
must be passed to
-   * {@link TopFieldCollector#create}).
+   * sort field values must have been filled (ie, fillFields=true 
must be passed to the
+   * constructor of {@link TopFieldCollectorManager}).

Review Comment:
   Good catch! This most likely was an updated doc as 
`TopFieldCollector#create` also did not take boolean field. Removed. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager [lucene]

2023-11-08 Thread via GitHub


zacharymorn commented on code in PR #240:
URL: https://github.com/apache/lucene/pull/240#discussion_r1387443316


##
lucene/core/src/java/org/apache/lucene/search/TopFieldCollector.java:
##
@@ -174,7 +173,7 @@ private static boolean canEarlyTerminateOnPrefix(Sort 
searchSort, Sort indexSort
* Implements a TopFieldCollector over one SortField criteria, with tracking

Review Comment:
   I feel extending or instantiating collectors should still be allowed, as 
long as they are done inside collector managers, and this would become 
enforceable once we fully deprecate `IndexSearch#search(Query, Collector)`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager [lucene]

2023-11-08 Thread via GitHub


zacharymorn commented on code in PR #240:
URL: https://github.com/apache/lucene/pull/240#discussion_r1387449164


##
lucene/core/src/java/org/apache/lucene/search/TopFieldCollectorManager.java:
##
@@ -0,0 +1,198 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Collection;
+import java.util.List;
+
+/**
+ * Create a TopFieldCollectorManager which uses a shared hit counter to 
maintain number of hits and
+ * a shared {@link MaxScoreAccumulator} to propagate the minimum score across 
segments if the
+ * primary sort is by relevancy.
+ *
+ * Note that a new collectorManager should be created for each search due 
to its internal states.
+ */
+public class TopFieldCollectorManager implements 
CollectorManager {
+  private final Sort sort;
+  private final int numHits;
+  private final FieldDoc after;
+  private final HitsThresholdChecker hitsThresholdChecker;
+  private final MaxScoreAccumulator minScoreAcc;
+  private final List collectors;
+  private final boolean supportsConcurrency;
+  private boolean collectorCreated;
+
+  /**
+   * Creates a new {@link TopFieldCollectorManager} from the given arguments.
+   *
+   * NOTE: The instances returned by this method pre-allocate a full 
array of length
+   * numHits.
+   *
+   * @param sort the sort criteria (SortFields).
+   * @param numHits the number of results to collect.
+   * @param after the previous doc after which matching docs will be collected.
+   * @param totalHitsThreshold the number of docs to count accurately. If the 
query matches more
+   * than {@code totalHitsThreshold} hits then its hit count will be a 
lower bound. On the other
+   * hand if the query matches less than or exactly {@code 
totalHitsThreshold} hits then the hit
+   * count of the result will be accurate. {@link Integer#MAX_VALUE} may 
be used to make the hit
+   * count accurate, but this will also make query processing slower.
+   * @param supportsConcurrency to use thread-safe and slower internal states 
for count tracking.
+   */
+  public TopFieldCollectorManager(
+  Sort sort, int numHits, FieldDoc after, int totalHitsThreshold, boolean 
supportsConcurrency) {

Review Comment:
   Hmm that's a good question. I'll try to setup some benchmark later on to see 
if I can measure this. Technically speaking the current expected usage would be 
`supportsConcurrency = search.getSlices().length > 1` (I didn't add doc for 
this though as I'm not sure if it makes sense to tie the parameter to this 
condition this early on), so setting `supportsConcurrency` to `true` while 
using single thread / slice seems to be a misuse (that we should maybe warn 
about)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager [lucene]

2023-11-08 Thread via GitHub


zacharymorn commented on code in PR #240:
URL: https://github.com/apache/lucene/pull/240#discussion_r1387450681


##
lucene/core/src/java/org/apache/lucene/search/TopFieldCollector.java:
##
@@ -429,106 +432,29 @@ public static TopFieldCollector create(Sort sort, int 
numHits, int totalHitsThre
* field is indexed both with doc values and points. In this case, there 
is an assumption that
* the same data is stored in these points and doc values.
* @return a {@link TopFieldCollector} instance which will sort the results 
by the sort criteria.
+   * @deprecated This method is being deprecated in favor of using the 
constructor of {@link

Review Comment:
   Done.



##
lucene/core/src/java/org/apache/lucene/search/TopFieldCollectorManager.java:
##
@@ -0,0 +1,198 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Collection;
+import java.util.List;
+
+/**
+ * Create a TopFieldCollectorManager which uses a shared hit counter to 
maintain number of hits and
+ * a shared {@link MaxScoreAccumulator} to propagate the minimum score across 
segments if the
+ * primary sort is by relevancy.
+ *
+ * Note that a new collectorManager should be created for each search due 
to its internal states.
+ */
+public class TopFieldCollectorManager implements 
CollectorManager {
+  private final Sort sort;
+  private final int numHits;
+  private final FieldDoc after;
+  private final HitsThresholdChecker hitsThresholdChecker;
+  private final MaxScoreAccumulator minScoreAcc;
+  private final List collectors;
+  private final boolean supportsConcurrency;
+  private boolean collectorCreated;
+
+  /**
+   * Creates a new {@link TopFieldCollectorManager} from the given arguments.
+   *
+   * NOTE: The instances returned by this method pre-allocate a full 
array of length
+   * numHits.
+   *
+   * @param sort the sort criteria (SortFields).
+   * @param numHits the number of results to collect.
+   * @param after the previous doc after which matching docs will be collected.
+   * @param totalHitsThreshold the number of docs to count accurately. If the 
query matches more
+   * than {@code totalHitsThreshold} hits then its hit count will be a 
lower bound. On the other
+   * hand if the query matches less than or exactly {@code 
totalHitsThreshold} hits then the hit
+   * count of the result will be accurate. {@link Integer#MAX_VALUE} may 
be used to make the hit
+   * count accurate, but this will also make query processing slower.
+   * @param supportsConcurrency to use thread-safe and slower internal states 
for count tracking.
+   */
+  public TopFieldCollectorManager(
+  Sort sort, int numHits, FieldDoc after, int totalHitsThreshold, boolean 
supportsConcurrency) {
+if (totalHitsThreshold < 0) {
+  throw new IllegalArgumentException(
+  "totalHitsThreshold must be >= 0, got " + totalHitsThreshold);
+}
+
+if (numHits <= 0) {
+  throw new IllegalArgumentException(
+  "numHits must be > 0; please use TotalHitCountCollector if you just 
need the total hit count");
+}
+
+if (sort.getSort().length == 0) {
+  throw new IllegalArgumentException("Sort must contain at least one 
field");
+}
+
+if (after != null) {
+  if (after.fields == null) {
+throw new IllegalArgumentException(
+"after.fields wasn't set; you must pass fillFields=true for the 
previous search");
+  }
+
+  if (after.fields.length != sort.getSort().length) {
+throw new IllegalArgumentException(
+"after.fields has "
++ after.fields.length
++ " values but sort has "
++ sort.getSort().length);
+  }
+}
+
+this.sort = sort;
+this.numHits = numHits;
+this.after = after;
+this.supportsConcurrency = supportsConcurrency;
+this.hitsThresholdChecker =
+supportsConcurrency
+? HitsThresholdChecker.createShared(Math.max(totalHitsThreshold, 
numHits))
+: HitsThresholdChecker.create(Math.max(totalHitsThreshold, 
numHits));
+this.minScor

Re: [PR] LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager [lucene]

2023-11-08 Thread via GitHub


zacharymorn commented on code in PR #240:
URL: https://github.com/apache/lucene/pull/240#discussion_r1387453179


##
lucene/core/src/java/org/apache/lucene/search/TopScoreDocCollector.java:
##
@@ -44,7 +43,7 @@ public void setScorer(Scorable scorer) throws IOException {
 }

Review Comment:
   See https://github.com/apache/lucene/pull/240#discussion_r1387443316



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager [lucene]

2023-11-08 Thread via GitHub


zacharymorn commented on code in PR #240:
URL: https://github.com/apache/lucene/pull/240#discussion_r1387453543


##
lucene/core/src/test/org/apache/lucene/document/BaseSpatialTestCase.java:
##
@@ -695,8 +695,8 @@ protected void verifyRandomDistanceQueries(IndexReader 
reader, Object... shapes)
 }
   }
 
-  private FixedBitSet searchIndex(IndexSearcher s, Query query, int maxDoc) 
throws IOException {
-return s.search(query, FixedBitSetCollector.createManager(maxDoc));
+  private CollectorManager searchIndex(int 
maxDoc) {

Review Comment:
   Ah this change should have been reverted. Fixed. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager [lucene]

2023-11-08 Thread via GitHub


zacharymorn commented on PR #240:
URL: https://github.com/apache/lucene/pull/240#issuecomment-1803126121

   > > We are still a ways away (from seeing Lucene fully utilize available 
hardware concurrency available at search time to reduce query latencies)
   > 
   > For example: query concurrency is still tied to segment geometry, which is 
insane. An "optimized" (`forceMerge to 1 segment`) index loses all of its 
concurrency! Hideously, at my team (Amazon product search) we had to create a 
neutered merge policy that intentionally avoids "accidentally" making a 
lopsided index (where too many docs are in a single segment) because it harms 
our long pole query latencies (we run all queries concurrently, except close to 
red-line), due to this silly Lucene limitation.
   > 
   > We have a longstanding issue open for this: #9721, and it ought not be so 
difficult to fix because Lucene already has strong support for one thread to 
search a "slice" of a big segment, i.e. we can make the thread work units 
slices of segments instead of whole segments or collections of segments (what 
IndexSearcher now calls slices, confusingly).
   
   Thanks @mikemccand for sharing the additional insight and context! Looks 
like there's already a lot of good discussions in that ticket as well. Will 
definitely take a look at that and see what I can contribute, once most of the 
work for deprecating `IndexSearch#search(Query, Collector)` has been completed. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up BytesRefHash#sort (another approach) [lucene]

2023-11-08 Thread via GitHub


gf2121 commented on PR #12784:
URL: https://github.com/apache/lucene/pull/12784#issuecomment-1803144990

   Even faster than the original approach on M2:
   ```
   BASELINE:  sort 5169965 terms, build histogram took: 489ms, reorder took: 
1359ms, total took: 2381ms.
   BASELINE:  sort 5169965 terms, build histogram took: 449ms, reorder took: 
1290ms, total took: 2249ms.
   BASELINE:  sort 5169965 terms, build histogram took: 458ms, reorder took: 
1279ms, total took: 2238ms.
   BASELINE:  sort 5169965 terms, build histogram took: 462ms, reorder took: 
1302ms, total took: 2260ms.
   BASELINE:  sort 5169965 terms, build histogram took: 455ms, reorder took: 
1282ms, total took: 2239ms.
   BASELINE:  sort 5169965 terms, build histogram took: 449ms, reorder took: 
1327ms, total took: 2282ms.
   BASELINE:  sort 5169965 terms, build histogram took: 474ms, reorder took: 
1320ms, total took: 2303ms.
   BASELINE:  sort 5169965 terms, build histogram took: 448ms, reorder took: 
1260ms, total took: 2227ms.
   BASELINE:  sort 5169965 terms, build histogram took: 448ms, reorder took: 
1307ms, total took: 2258ms.
   BASELINE:  sort 5169965 terms, build histogram took: 464ms, reorder took: 
1313ms, total took: 2280ms.
   BASELINE:  sort 5169965 terms, build histogram took: 444ms, reorder took: 
1294ms, total took: 2261ms.
   
   CANDIDATE:  sort 5169965 terms, build histogram took: 549ms, reorder took: 
84ms, total took: 1287ms.
   CANDIDATE:  sort 5169965 terms, build histogram took: 535ms, reorder took: 
83ms, total took: 1205ms.
   CANDIDATE:  sort 5169965 terms, build histogram took: 531ms, reorder took: 
65ms, total took: 1197ms.
   CANDIDATE:  sort 5169965 terms, build histogram took: 552ms, reorder took: 
70ms, total took: 1217ms.
   CANDIDATE:  sort 5169965 terms, build histogram took: 576ms, reorder took: 
73ms, total took: 1255ms.
   CANDIDATE:  sort 5169965 terms, build histogram took: 551ms, reorder took: 
66ms, total took: 1207ms.
   CANDIDATE:  sort 5169965 terms, build histogram took: 551ms, reorder took: 
68ms, total took: 1208ms.
   CANDIDATE:  sort 5169965 terms, build histogram took: 553ms, reorder took: 
85ms, total took: 1240ms.
   CANDIDATE:  sort 5169965 terms, build histogram took: 562ms, reorder took: 
66ms, total took: 1235ms.
   CANDIDATE:  sort 5169965 terms, build histogram took: 557ms, reorder took: 
76ms, total took: 1233ms.
   CANDIDATE:  sort 5169965 terms, build histogram took: 541ms, reorder took: 
66ms, total took: 1204ms.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Skip docs with Docvalues in NumericLeafComparator [lucene]

2023-11-08 Thread via GitHub


LuXugang merged PR #12405:
URL: https://github.com/apache/lucene/pull/12405


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Skip docs with Docvalues in NumericLeafComparator [lucene]

2023-11-08 Thread via GitHub


LuXugang closed issue #12401: Skip docs with Docvalues in NumericLeafComparator
URL: https://github.com/apache/lucene/issues/12401


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Multiple ClassNotFoundExceptions in IntelliJ Fat Jar on ARM64 Java 20 [lucene]

2023-11-08 Thread via GitHub


davido commented on issue #12307:
URL: https://github.com/apache/lucene/issues/12307#issuecomment-1803192152

   @uschindler
   
   We are using [Bazel](https://bazel.build) build system, and merging the two 
JARs like this:
   
   
   # Merge jars so
   # META-INF/services/org.apache.lucene.codecs.Codec
   # contains the union of both Codec collections.
   java_binary(
   name = "lucene-core-and-backward-codecs-merged",
   data = ["//lib:LICENSE-Apache2.0"],
   main_class = "NotImportant",
   runtime_deps = [
   # in case of conflict, we want the implementation of backwards-codecs
   # first.
   "@backward-codecs//jar",
   "@lucene-core//jar",
   ],
   )
   
   java_import(
   name = "lucene-core-and-backward-codecs",
   jars = [
   ":lucene-core-and-backward-codecs-merged_deploy.jar",
   ],
   )
   
   And it works with JDk11 and JDK17 as expected, but is failing on JDK21:
   
   7) 
offlineReindexForAllAvailableIndicesIsPossibleInSlaveMode(com.google.gerrit.acceptance.pgm.LuceneReindexIT)
   java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.lucene.store.MMapDirectory
at org.apache.lucene.store.FSDirectory.open(FSDirectory.java:161)
at org.apache.lucene.store.FSDirectory.open(FSDirectory.java:156)
at 
com.google.gerrit.lucene.LuceneAccountIndex.dir(LuceneAccountIndex.java:91)
at 
com.google.gerrit.lucene.LuceneAccountIndex.(LuceneAccountIndex.java:105)
at 
com.google.gerrit.lucene.LuceneAccountIndex$$FastClassByGuice$$416172e5.GUICE$TRAMPOLINE()
at 
com.google.gerrit.lucene.LuceneAccountIndex$$FastClassByGuice$$416172e5.apply()
at 
com.google.inject.internal.DefaultConstructionProxyFactory$FastClassProxy.newInstance(DefaultConstructionProxyFactory.java:82)
at 
com.google.inject.internal.ConstructorInjector.provision(ConstructorInjector.java:114)
at 
com.google.inject.internal.ConstructorInjector.construct(ConstructorInjector.java:91)
at 
com.google.inject.internal.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:300)
at com.google.inject.internal.InjectorImpl$1.get(InjectorImpl.java:1148)
at 
com.google.inject.assistedinject.FactoryProvider2.invoke(FactoryProvider2.java:907)
at jdk.proxy2/jdk.proxy2.$Proxy59.create(Unknown Source)
at 
com.google.gerrit.server.index.SingleVersionModule$SingleVersionListener.start(SingleVersionModule.java:98)
at 
com.google.gerrit.server.index.SingleVersionModule$SingleVersionListener.start(SingleVersionModule.java:79)
at 
com.google.gerrit.pgm.init.index.IndexManagerOnInit.start(IndexManagerOnInit.java:40)
at com.google.gerrit.pgm.init.BaseInit.run(BaseInit.java:121)
at 
com.google.gerrit.pgm.util.AbstractProgram.main(AbstractProgram.java:62)
at com.google.gerrit.acceptance.GerritServer.init(GerritServer.java:344)
at 
com.google.gerrit.acceptance.StandaloneSiteTest.beforeTest(StandaloneSiteTest.java:140)
at 
com.google.gerrit.acceptance.StandaloneSiteTest$1.evaluate(StandaloneSiteTest.java:117)
at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
at org.junit.rules.RunRules.evaluate(RunRules.java:20)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at org.junit.runners.Suite.runChild(Suite.java:128)
at org.junit.runners.Suite.runChild(Suite.java:27)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at org.junit.runners.Suite.runChild(Suite.java:128)
at org.junit.runners.Suite.runChild(Suite.java:27)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org

Re: [PR] Enable executing using NFA in RegexpQuery [lucene]

2023-11-08 Thread via GitHub


zhaih commented on code in PR #12767:
URL: https://github.com/apache/lucene/pull/12767#discussion_r1387529653


##
lucene/core/src/test/org/apache/lucene/search/TestRegexpQuery.java:
##
@@ -80,7 +80,10 @@ private long caseInsensitiveRegexQueryNrHits(String regex) 
throws IOException {
 newTerm(regex),
 RegExp.ALL,
 RegExp.ASCII_CASE_INSENSITIVE,
-Operations.DEFAULT_DETERMINIZE_WORK_LIMIT);
+RegexpQuery.DEFAULT_PROVIDER,
+Operations.DEFAULT_DETERMINIZE_WORK_LIMIT,
+MultiTermQuery.CONSTANT_SCORE_BLENDED_REWRITE,
+random().nextBoolean());

Review Comment:
   OK, I added the NFA version to `TestRegexpRandom2`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up BytesRefHash#sort (another approach) [lucene]

2023-11-08 Thread via GitHub


gf2121 commented on PR #12784:
URL: https://github.com/apache/lucene/pull/12784#issuecomment-1803223146

   As "reorder" gets faster, I'm considering lowering the fallback threshold 
and letting radix sort do more of the work.
   ### Benchmark result:
   
   MAC Intel
   
   ```
   BASELINE:  sort 5169965 terms, build histogram took: 1969ms, reorder took: 
2103ms, total took: 5524ms.
   BASELINE:  sort 5169965 terms, build histogram took: 1975ms, reorder took: 
2118ms, total took: 5435ms.
   BASELINE:  sort 5169965 terms, build histogram took: 1977ms, reorder took: 
2140ms, total took: 5441ms.
   BASELINE:  sort 5169965 terms, build histogram took: 1982ms, reorder took: 
2168ms, total took: 5452ms.
   BASELINE:  sort 5169965 terms, build histogram took: 1987ms, reorder took: 
2164ms, total took: 5475ms.
   BASELINE:  sort 5169965 terms, build histogram took: 1963ms, reorder took: 
2157ms, total took: 5456ms.
   BASELINE:  sort 5169965 terms, build histogram took: 1979ms, reorder took: 
2162ms, total took: 5452ms.
   BASELINE:  sort 5169965 terms, build histogram took: 1984ms, reorder took: 
2172ms, total took: 5493ms.
   BASELINE:  sort 5169965 terms, build histogram took: 1986ms, reorder took: 
2166ms, total took: 5476ms.
   BASELINE:  sort 5169965 terms, build histogram took: 2016ms, reorder took: 
2287ms, total took: 5660ms.
   BASELINE:  sort 5169965 terms, build histogram took: 1761ms, reorder took: 
2109ms, total took: 5183ms.
   
   CANDIDATE( fallback threshold = 100 ):  sort 5169965 terms, build histogram 
took: 2018ms, reorder took: 125ms, total took: 3760ms.
   CANDIDATE( fallback threshold = 100 ):  sort 5169965 terms, build histogram 
took: 1998ms, reorder took: 118ms, total took: 3630ms.
   CANDIDATE( fallback threshold = 100 ):  sort 5169965 terms, build histogram 
took: 1987ms, reorder took: 115ms, total took: 3596ms.
   CANDIDATE( fallback threshold = 100 ):  sort 5169965 terms, build histogram 
took: 2001ms, reorder took: 111ms, total took: 3622ms.
   CANDIDATE( fallback threshold = 100 ):  sort 5169965 terms, build histogram 
took: 2017ms, reorder took: 110ms, total took: 3610ms.
   CANDIDATE( fallback threshold = 100 ):  sort 5169965 terms, build histogram 
took: 2021ms, reorder took: 118ms, total took: 3712ms.
   CANDIDATE( fallback threshold = 100 ):  sort 5169965 terms, build histogram 
took: 2006ms, reorder took: 108ms, total took: 3634ms.
   CANDIDATE( fallback threshold = 100 ):  sort 5169965 terms, build histogram 
took: 2015ms, reorder took: 106ms, total took: 3620ms.
   CANDIDATE( fallback threshold = 100 ):  sort 5169965 terms, build histogram 
took: 2006ms, reorder took: 101ms, total took: 3611ms.
   CANDIDATE( fallback threshold = 100 ):  sort 5169965 terms, build histogram 
took: 1993ms, reorder took: 115ms, total took: 3609ms.
   CANDIDATE( fallback threshold = 100 ):  sort 5169965 terms, build histogram 
took: 2024ms, reorder took: 106ms, total took: 3633ms.
   
   CANDIDATE( fallback threshold = 50 ):  sort 5169965 terms, build histogram 
took: 2972ms, reorder took: 199ms, total took: 3288ms.
   CANDIDATE( fallback threshold = 50 ):  sort 5169965 terms, build histogram 
took: 2941ms, reorder took: 211ms, total took: 3279ms.
   CANDIDATE( fallback threshold = 50 ):  sort 5169965 terms, build histogram 
took: 2991ms, reorder took: 178ms, total took: 3283ms.
   CANDIDATE( fallback threshold = 50 ):  sort 5169965 terms, build histogram 
took: 2932ms, reorder took: 195ms, total took: 3233ms.
   CANDIDATE( fallback threshold = 50 ):  sort 5169965 terms, build histogram 
took: 2954ms, reorder took: 202ms, total took: 3259ms.
   CANDIDATE( fallback threshold = 50 ):  sort 5169965 terms, build histogram 
took: 2938ms, reorder took: 201ms, total took: 3249ms.
   CANDIDATE( fallback threshold = 50 ):  sort 5169965 terms, build histogram 
took: 2949ms, reorder took: 195ms, total took: 3295ms.
   CANDIDATE( fallback threshold = 50 ):  sort 5169965 terms, build histogram 
took: 2961ms, reorder took: 208ms, total took: 3274ms.
   CANDIDATE( fallback threshold = 50 ):  sort 5169965 terms, build histogram 
took: 2965ms, reorder took: 201ms, total took: 3257ms.
   CANDIDATE( fallback threshold = 50 ):  sort 5169965 terms, build histogram 
took: 2957ms, reorder took: 188ms, total took: 3285ms.
   CANDIDATE( fallback threshold = 50 ):  sort 5169965 terms, build histogram 
took: 2975ms, reorder took: 182ms, total took: 3291ms.
   ```
   
   
   
   MAC M2
   
   ```
   BASELINE:  sort 5169965 terms, build histogram took: 478ms, reorder took: 
1332ms, total took: 2334ms.
   BASELINE:  sort 5169965 terms, build histogram took: 483ms, reorder took: 
1351ms, total took: 2333ms.
   BASELINE:  sort 5169965 terms, build histogram took: 462ms, reorder took: 
1319ms, total took: 2284ms.
   BASELINE:  sort 5169965 terms, build histogram took: 463ms, reorder took: 
1272ms, total took: 2246ms.
   BASELINE:  sort 5169965 terms, build histogram took: 466ms, reorder took: 
1285ms, total took: 2257ms.
   

Re: [PR] speedup arm int functions? [lucene]

2023-11-08 Thread via GitHub


rmuir closed pull request #12743: speedup arm int functions?
URL: https://github.com/apache/lucene/pull/12743


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] speedup arm int functions? [lucene]

2023-11-08 Thread via GitHub


rmuir commented on PR #12743:
URL: https://github.com/apache/lucene/pull/12743#issuecomment-1803265618

   speeds up as many machines as it slows down.
   
   cascadelake: `['0', 'GenuineIntel', 'Intel(R) Xeon(R) Platinum 8275CL CPU @ 
3.00GHz', '1', 'GenuineIntel', 'Intel(R) Xeon(R) Platinum 8275CL CPU @ 
3.00GHz']`
   
   main
   ```
   Benchmark   (size)   Mode  Cnt  Score   
Error   Units
   VectorUtilBenchmark.binaryDotProductScalar1024  thrpt   15  2.504 ± 
0.005  ops/us
   ```
   patch
   ```
   Benchmark   (size)   Mode  Cnt  Score   
Error   Units
   VectorUtilBenchmark.binaryDotProductScalar1024  thrpt   15  1.850 ± 
0.013  ops/us
   ```
   
   graviton2: `['0', '1']`
   
   main
   ```
   Benchmark   (size)   Mode  Cnt  Score   
Error   Units
   VectorUtilBenchmark.binaryDotProductScalar1024  thrpt   15  2.387 ± 
0.001  ops/us
   ```
   patch
   ```
   Benchmark   (size)   Mode  Cnt  Score   
Error   Units
   VectorUtilBenchmark.binaryDotProductScalar1024  thrpt   15  2.121 ± 
0.004  ops/us
   ```
   
   graviton3: `['0', '1']`
   
   main
   ```
   Benchmark   (size)   Mode  Cnt  Score   
Error   Units
   VectorUtilBenchmark.binaryDotProductScalar1024  thrpt   15  2.370 ± 
0.002  ops/us
   ```
   patch
   ```
   Benchmark   (size)   Mode  Cnt  Score   
Error   Units
   VectorUtilBenchmark.binaryDotProductScalar1024  thrpt   15  3.346 ± 
0.025  ops/us
   ```
   
   haswell: `['0', 'GenuineIntel', 'Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz', 
'1', 'GenuineIntel', 'Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz']`
   
   main
   ```
   Benchmark   (size)   Mode  Cnt  Score   
Error   Units
   VectorUtilBenchmark.binaryDotProductScalar1024  thrpt   15  2.096 ± 
0.016  ops/us
   ```
   patch
   ```
   Benchmark   (size)   Mode  Cnt  Score   
Error   Units
   VectorUtilBenchmark.binaryDotProductScalar1024  thrpt   15  1.678 ± 
0.039  ops/us
   ```
   
   icelake: `['0', 'GenuineIntel', 'Intel(R) Xeon(R) Platinum 8375C CPU @ 
2.90GHz', '1', 'GenuineIntel', 'Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz']`
   
   main
   ```
   Benchmark   (size)   Mode  Cnt  Score   
Error   Units
   VectorUtilBenchmark.binaryDotProductScalar1024  thrpt   15  2.586 ± 
0.010  ops/us
   ```
   patch
   ```
   Benchmark   (size)   Mode  Cnt  Score   
Error   Units
   VectorUtilBenchmark.binaryDotProductScalar1024  thrpt   15  2.240 ± 
0.006  ops/us
   ```
   
   sapphirerapids: `['0', 'GenuineIntel', 'Intel(R) Xeon(R) Platinum 8488C', 
'1', 'GenuineIntel', 'Intel(R) Xeon(R) Platinum 8488C']`
   
   main
   ```
   Benchmark   (size)   Mode  Cnt  Score   
Error   Units
   VectorUtilBenchmark.binaryDotProductScalar1024  thrpt   15  2.408 ± 
0.009  ops/us
   ```
   patch
   ```
   Benchmark   (size)   Mode  Cnt  Score   
Error   Units
   VectorUtilBenchmark.binaryDotProductScalar1024  thrpt   15  2.729 ± 
0.058  ops/us
   ```
   
   zen2: `['0', 'AuthenticAMD', 'AMD EPYC 7R32', '1', 'AuthenticAMD', 'AMD EPYC 
7R32']`
   
   main
   ```
   Benchmark   (size)   Mode  Cnt  Score   
Error   Units
   VectorUtilBenchmark.binaryDotProductScalar1024  thrpt   15  1.544 ± 
0.004  ops/us
   ```
   patch
   ```
   Benchmark   (size)   Mode  Cnt  Score   
Error   Units
   VectorUtilBenchmark.binaryDotProductScalar1024  thrpt   15  1.435 ± 
0.004  ops/us
   ```
   
   zen3: `['0', 'AuthenticAMD', 'AMD EPYC 7R13 Processor', '1', 'AuthenticAMD', 
'AMD EPYC 7R13 Processor']`
   
   main
   ```
   Benchmark   (size)   Mode  Cnt  Score   
Error   Units
   VectorUtilBenchmark.binaryDotProductScalar1024  thrpt   15  1.580 ± 
0.002  ops/us
   ```
   patch
   ```
   Benchmark   (size)   Mode  Cnt  Score   
Error   Units
   VectorUtilBenchmark.binaryDotProductScalar1024  thrpt   15  1.791 ± 
0.004  ops/us
   ```
   zen4: `['0', 'AuthenticAMD', 'AMD EPYC 9R14', '1', 'AuthenticAMD', 'AMD EPYC 
9R14']`
   
   main
   ```
   Benchmark   (size)   Mode  Cnt  Score   
Error   Units
   VectorUtilBenchmark.binaryDotProductScalar1024  thrpt   15  1.856 ± 
0.002  ops/us
   ```
   patch
   ```
   Benchmark   (size)   Mode  Cnt  Score   
Error   Units
   VectorUtilBenchmark.binaryDotProductScalar1024  thrpt   15  2.364 ± 
0.007  ops/us
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-ma

[PR] script to run microbenchmarks across different ec2 instance types [lucene]

2023-11-08 Thread via GitHub


rmuir opened a new pull request, #12787:
URL: https://github.com/apache/lucene/pull/12787

   This saves me a lot of time and prevents making bad changes that help some 
cpus and hurt others.
   
   Case in point: #12743
   
   You run a command such as:
   ```
   make PATCH_BRANCH=rmuir:some-speedup
   ```
   
   The benchmark procedure is run across different instance types in parallel. 
"large" instances are used, although I try to keep it inexpensive, it is java 
and needs RAM.
   
   At the end, you get a combined `build/report.txt` which can be pasted into a 
PR comment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up BytesRefHash#sort (another approach) [lucene]

2023-11-08 Thread via GitHub


gf2121 commented on PR #12784:
URL: https://github.com/apache/lucene/pull/12784#issuecomment-1803319176

   I try this approach with `wikimedium10m` on the M2 mac, the sort took sum 
decreased ~60%.
   
Details 
   
   https://bytedance.larkoffice.com/sheets/XfVCsZL5phx9letbDEQcaw0snnf";
 data-importRangeRawData-range="'Sheet1'!D1:I62">
   
   flush round | sort field name | sort term count | baseline | candidate | diff
   -- | -- | -- | -- | -- | --
   1 | date | 45925 | 32 | 15 | -53.13%
   1 | groupend | 1 | 0 | 0 | #DIV/0!
   1 | titleTokenized | 33250 | 10 | 6 | -40.00%
   1 | id | 989620 | 356 | 167 | -53.09%
   1 | body | 3444798 | 1991 | 698 | -64.94%
   1 | title | 46261 | 10 | 7 | -30.00%
   2 | date | 84517 | 52 | 21 | -59.62%
   2 | groupend | 1 | 0 | 0 | #DIV/0!
   2 | titleTokenized | 50134 | 15 | 7 | -53.33%
   2 | id | 1059900 | 680 | 212 | -68.82%
   2 | body | 3210318 | 1730 | 660 | -61.85%
   2 | title | 85009 | 22 | 13 | -40.91%
   3 | date | 86188 | 37 | 21 | -43.24%
   3 | groupend | 1 | 0 | 0 | #DIV/0!
   3 | titleTokenized | 60502 | 14 | 9 | -35.71%
   3 | id | 1011210 | 397 | 185 | -53.40%
   3 | body | 3389689 | 1865 | 692 | -62.90%
   3 | title | 86756 | 24 | 15 | -37.50%
   4 | date | 123222 | 76 | 39 | -48.68%
   4 | groupend | 1 | 0 | 0 | #DIV/0!
   4 | titleTokenized | 84573 | 23 | 16 | -30.43%
   4 | id | 1030670 | 533 | 205 | -61.54%
   4 | body | 3442602 | 1985 | 717 | -63.88%
   4 | title | 124156 | 41 | 24 | -41.46%
   5 | date | 150957 | 85 | 38 | -55.29%
   5 | groupend | 1 | 0 | 0 | #DIV/0!
   5 | titleTokenized | 100910 | 37 | 16 | -56.76%
   5 | id | 1045550 | 595 | 209 | -64.87%
   5 | body | 3267300 | 2031 | 689 | -66.08%
   5 | title | 152061 | 58 | 32 | -44.83%
   6 | date | 162374 | 77 | 41 | -46.75%
   6 | groupend | 1 | 0 | 0 | #DIV/0!
   6 | titleTokenized | 107464 | 30 | 17 | -43.33%
   6 | id | 1057430 | 385 | 194 | -49.61%
   6 | body | 3140161 | 1834 | 633 | -65.49%
   6 | title | 163347 | 68 | 36 | -47.06%
   7 | date | 177647 | 116 | 45 | -61.21%
   7 | groupend | 1 | 0 | 0 | #DIV/0!
   7 | titleTokenized | 120037 | 59 | 18 | -69.49%
   7 | id | 1062920 | 652 | 211 | -67.64%
   7 | body | 3173313 | 2373 | 638 | -73.11%
   7 | title | 178886 | 79 | 37 | -53.16%
   8 | date | 199744 | 106 | 50 | -52.83%
   8 | groupend | 1 | 0 | 0 | #DIV/0!
   8 | titleTokenized | 133801 | 42 | 21 | -50.00%
   8 | id | 1072970 | 409 | 193 | -52.81%
   8 | body | 3090582 | 1817 | 641 | -64.72%
   8 | title | 201130 | 84 | 45 | -46.43%
   9 | date | 209298 | 104 | 62 | -40.38%
   9 | groupend | 1 | 0 | 0 | #DIV/0!
   9 | titleTokenized | 142697 | 50 | 25 | -50.00%
   9 | id | 1084580 | 571 | 225 | -60.60%
   9 | body | 3032046 | 1642 | 631 | -61.57%
   9 | title | 210958 | 92 | 46 | -50.00%
   10 | date | 119014 | 47 | 30 | -36.17%
   10 | groupend | 1 | 0 | 0 | #DIV/0!
   10 | titleTokenized | 93308 | 26 | 16 | -38.46%
   10 | id | 585150 | 208 | 111 | -46.63%
   10 | body | 1909153 | 918 | 380 | -58.61%
   10 | title | 119806 | 50 | 27 | -46.00%
   sort took sum |   |   | 24538 | 9086 | -62.97%
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Multiple ClassNotFoundExceptions in IntelliJ Fat Jar on ARM64 Java 20 [lucene]

2023-11-08 Thread via GitHub


uschindler commented on issue #12307:
URL: https://github.com/apache/lucene/issues/12307#issuecomment-1803320694

   Hi,
   this has nothing to do with the minimum Java version in your Java built, it 
only has to do with the runtime Java version. If you use Java 19 or later, the 
classloader needs to be able to find the version specific support classes for 
MemorySegment (foreign API) and Vector API in the JAR file, which it clearly 
says:
   
   Caused by: java.lang.ExceptionInInitializerError: Exception 
java.lang.LinkageError: MemorySegmentIndexInputProvider is missing in Lucene 
JAR file [in thread "main"]
   
   Lucene's JAR files are modularized (https://openjdk.org/jeps/261) MR-JAR 
(https://openjdk.org/jeps/238) files with service provider (SPI, 
https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/ServiceLoader.html).
   
   If you want to merge different JAR files and Lucene's JAR files are part of 
it, please read ALL the three specs above and then ensure the following:
   - All files in `META-INF/services` _with same name need to be merged 
together_ and placed in resulting JAR 
(https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/ServiceLoader.html)
   - All files in `META-INF/versions` need to be copied to resulting JAR with 
exactly same directory structure. Those files are the version specific classes 
and are only loaded depending on the Runtime JDK (https://openjdk.org/jeps/238).
   - META-INF/MANIFEST.MF must contain `Multi-Release: true` 
(https://openjdk.org/jeps/238).
   - All class files named `module-info.class` should be removed as the 
resulting JAR file is no longer modularized. Keeping them inside may fail to 
load the application in modularized runtime (https://openjdk.org/jeps/261).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org