[GitHub] [lucene] jpountz closed pull request #752: LUCENE-10474: Avoid throwing StackOverflowError when creating RegExp

2023-09-06 Thread via GitHub


jpountz closed pull request #752: LUCENE-10474: Avoid throwing 
StackOverflowError when creating RegExp
URL: https://github.com/apache/lucene/pull/752


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #752: LUCENE-10474: Avoid throwing StackOverflowError when creating RegExp

2023-09-06 Thread via GitHub


jpountz commented on PR #752:
URL: https://github.com/apache/lucene/pull/752#issuecomment-1708013295

   This was addressed in https://github.com/apache/lucene/pull/12462.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] cpoerschke opened a new pull request, #12540: clarify QueryVisitor.acceptField javadoc w.r.t. not being term-specific

2023-09-06 Thread via GitHub


cpoerschke opened a new pull request, #12540:
URL: https://github.com/apache/lucene/pull/12540

   ### Description
   
   Adjusting javadocs as suggested by @romseygeek in 
https://github.com/apache/lucene/issues/12538#issuecomment-1706747711 issue 
comment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] cpoerschke closed pull request #12539: remove visitor.acceptField guard for visitor.visitLeaf calls

2023-09-06 Thread via GitHub


cpoerschke closed pull request #12539: remove visitor.acceptField guard for 
visitor.visitLeaf calls
URL: https://github.com/apache/lucene/pull/12539


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] cpoerschke commented on pull request #12539: remove visitor.acceptField guard for visitor.visitLeaf calls

2023-09-06 Thread via GitHub


cpoerschke commented on PR #12539:
URL: https://github.com/apache/lucene/pull/12539#issuecomment-1708043810

   closing in favour of #12540


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] cpoerschke merged pull request #12540: clarify QueryVisitor.acceptField javadoc w.r.t. not being term-specific

2023-09-06 Thread via GitHub


cpoerschke merged PR #12540:
URL: https://github.com/apache/lucene/pull/12540


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] cpoerschke closed issue #12538: clarify QueryVisitor.visitLeaf interaction with QueryVisitor.acceptField

2023-09-06 Thread via GitHub


cpoerschke closed issue #12538: clarify QueryVisitor.visitLeaf interaction with 
QueryVisitor.acceptField
URL: https://github.com/apache/lucene/issues/12538


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on issue #12440: Make HNSW merges faster

2023-09-06 Thread via GitHub


benwtrent commented on issue #12440:
URL: https://github.com/apache/lucene/issues/12440#issuecomment-1708370161

   > For example, if we have segment 1,2,3,4 wants to merge and form a new 
segment, can we just leave the HNSW graphs as-is, or we only merge the smaller 
ones and leave the big ones as-is. And when we do searching, we can do a greedy 
search (always go to the closest node, whichever graph), or we can just let 
user choose to use multithreading to exchange for the latency?
   
   This would be tricky. This scales linearly per graph and technically recall 
is higher when searching multiple graphs than a single one. It seems that the 
`efSearch` (lucene's `k`) per nearest neighbor search should be adjusted 
internally if this is the approach taken. Thinking further, it seems like a 
nice experiment to make regardless to see how multi-segment search speed can be 
improved (potentially keeping the minimally matching score or candidates 
between graph searches...)
   
   @zhaih In short, I think this has merit, but we could look at how it can 
improve multi-segment search by keeping track of min_score or dynamically 
adjusting `efSearch` when multiple segments are being searched.
   
   > An idea, instead of trying to merge the subgraph, is to do a union of 
subgraphs:
   When we merge, we build a disconnected graph which is the union of all the 
segment graphs.
   
   @mbrette 
   
   🤔 I am not sure about this. It seems to me that allowing the graph to be 
disconnected like this (pure union) would be very difficult to keep track of 
within the Codec. Now we not only have each node's doc_id, but which graph they 
belong to, etc. This would cause significant overhead and seems like it 
wouldn't be worth the effort.
   
   > It's worth exploring some variation of this in my opinion.
   
   I agree @mbrette, which is why I linked the paper :). I think if we do 
something like the nnDescent but still adhering to the diversity of 
neighborhood, it could help. HNSW is fairly resilient and I wouldn't expect 
horrific recall changes.
   
   One tricky thing there would be when do we promote/demote a node to a 
higher/lower level in the graph? I am not sure nodes should simply stay at 
their levels as graph size increases.
   
   I haven't thought about what the logic could be for determining which NSW 
get put on which layer. It might be as simple as re-running the probabilistic 
level assignment for every node. Since nodes at higher levels are also on the 
lower levels, we could still get a significant performance improvement as we 
would only have to "start from scratch" on nodes that are promoted from a lower 
level to a higher level (which is very rare). Node demotions could still keep 
their previous connections (or keep what percentage of the connections we allow 
in our implementation).
   
   > What is the current way to measure Lucene knn recall/performance this days 
?
   @mbrette 
   
   I tried to reproduced your test from 
https://github.com/apache/lucene/issues/11354#issuecomment-1239961308, but was 
not able to (recall = 0.574).
   
   [Lucene Util](https://github.com/mikemccand/luceneutil) is the way to do 
this. What issues are you having? It can be sort of frustrating to keep up and 
running, but once you got the flow down and it built, its useful. I will 
happily help with this or even run what benchmarking I can my self once you 
have some POC work.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Tony-X opened a new pull request, #12541: Document why we need `lastPosBlockOffset`

2023-09-06 Thread via GitHub


Tony-X opened a new pull request, #12541:
URL: https://github.com/apache/lucene/pull/12541

   ### Description
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Tony-X commented on issue #12536: Remove `lastPosBlockOffset` from term metadata for Lucene90PostingsFormat

2023-09-06 Thread via GitHub


Tony-X commented on issue #12536:
URL: https://github.com/apache/lucene/issues/12536#issuecomment-1708839963

   https://github.com/apache/lucene/pull/12541 adds more comment for 
`lastPosBlockOffset`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jainankitk commented on issue #12527: Optimize readInts24 performance for DocIdsWriter

2023-09-06 Thread via GitHub


jainankitk commented on issue #12527:
URL: https://github.com/apache/lucene/issues/12527#issuecomment-1708857931

   @mikemccand - Thanks for sharing the numbers. This is truly surprising 
result. Even though the impact of this small change not positive, it is 
significant enough to explore areas of improvement on this. I am thinking of 
trying out couple of things below:
   
   - Update the patch to use scratch array similar to int, and rerun the 
benchmark:
   ```
   diff --git 
a/lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java 
b/lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java
   index 40db4c0069d..64ed9b84084 100644
   --- a/lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java
   +++ b/lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java
   @@ -35,9 +35,11 @@ final class DocIdsWriter {
  private static final byte LEGACY_DELTA_VINT = (byte) 0;

  private final int[] scratch;
   +  private final long[] scratchLong;

  DocIdsWriter(int maxPointsInLeaf) {
scratch = new int[maxPointsInLeaf];
   +scratchLong = new long[(maxPointsInLeaf / 8) * 3];
  }

  void writeDocIds(int[] docIds, int start, int count, DataOutput out) 
throws IOException {
   @@ -236,12 +238,14 @@ final class DocIdsWriter {
}
  }

   -  private static void readInts24(IndexInput in, int count, int[] docIDs) 
throws IOException {
   +  private void readInts24(IndexInput in, int count, int[] docIDs) throws 
IOException {
   +in.readLongs(scratchLong, 0, (count/8) * 3);
int i;
for (i = 0; i < count - 7; i += 8) {
   -  long l1 = in.readLong();
   -  long l2 = in.readLong();
   -  long l3 = in.readLong();
   +  int li = (i/8) * 3;
   +  long l1 = scratchLong[li];
   +  long l2 = scratchLong[li+1];
   +  long l3 = scratchLong[li+2];
  docIDs[i] = (int) (l1 >>> 40);
  docIDs[i + 1] = (int) (l1 >>> 16) & 0xff;
  docIDs[i + 2] = (int) (((l1 & 0x) << 8) | (l2 >>> 56));
   @@ -323,13 +327,15 @@ final class DocIdsWriter {
}
  }

   -  private static void readInts24(IndexInput in, int count, IntersectVisitor 
visitor)
   +  private void readInts24(IndexInput in, int count, IntersectVisitor 
visitor)
  throws IOException {
   +in.readLongs(scratchLong, 0, (count/8) * 3);
int i;
for (i = 0; i < count - 7; i += 8) {
   -  long l1 = in.readLong();
   -  long l2 = in.readLong();
   -  long l3 = in.readLong();
   +  int li = (i/8) * 3;
   +  long l1 = scratchLong[li];
   +  long l2 = scratchLong[li+1];
   +  long l3 = scratchLong[li+2];
  visitor.visit((int) (l1 >>> 40));
  visitor.visit((int) (l1 >>> 16) & 0xff);
  visitor.visit((int) (((l1 & 0x) << 8) | (l2 >>> 56)));
   ```
   
   - If the performance is still regressed after this, we can try removing the 
scratch array even for `readInts32`. Although, not sure if the benchmark has 
sufficient coverage for both types of docIds


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] snow-lily-warner commented on issue #11537: StackOverflow when RegExp encounters a very large string [LUCENE-10501]

2023-09-06 Thread via GitHub


snow-lily-warner commented on issue #11537:
URL: https://github.com/apache/lucene/issues/11537#issuecomment-1709027590

   Given that the version would include a fix that is tracked as a security 
vulnerability, can there be some commitment to a timeline? Or possibly 
cherry-pick to a minor release 9.7.1?
   
   
   
   
   From: Patrick Zhai ***@***.***>
   Sent: Wednesday, September 6, 2023, 9:05 PM
   To: apache/lucene ***@***.***>
   Cc: Lily Warner ***@***.***>; Comment ***@***.***>
   Subject: Re: [apache/lucene] StackOverflow when RegExp encounters a very 
large string [LUCENE-10501] (Issue #11537)
   
   
   [External Email]
   
   
   There's no fixed schedule, we usually release after some amount of changes
   are checked in. IIRC the last release was in July
   
   On Wed, Sep 6, 2023, 13:02 thomas-warner ***@***.***> wrote:
   
   > @zhaih > when is lucene 
9.8 expected to be
   > released?
   >
   > —
   > Reply to this email directly, view it on GitHub
   > 
>,
   > or unsubscribe
   > 
>
   > .
   > You are receiving this because you were mentioned.Message ID:
   > ***@***.***>
   >
   
   —
   Reply to this email directly, view it on 
GitHub, 
or 
unsubscribe.
   You are receiving this because you commented.Message ID: ***@***.***>
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org