jainankitk commented on PR #14527:
URL: https://github.com/apache/lucene/pull/14527#issuecomment-2917542707
Thanks @benwtrent and @weizijun for seeing this through!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
U
weizijun commented on PR #14527:
URL: https://github.com/apache/lucene/pull/14527#issuecomment-2914720940
Here are the statistics of 100w hnsw graphs, with m = 16 and ef = 100:
Level count = 5:
```
level: 0, node count: 100
level: 1, node count: 62835
level: 2, node count:
benwtrent merged PR #14527:
URL: https://github.com/apache/lucene/pull/14527
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscr...@lucene.a
weizijun commented on code in PR #14527:
URL: https://github.com/apache/lucene/pull/14527#discussion_r2109476565
##
.gitignore:
##
@@ -32,3 +32,10 @@ __pycache__
# SDKMAN
.sdkmanrc
+
+# Java class files
+*.class
+
+# Ignore bin directories
+bin/
+**/bin/
Review Comment:
benwtrent commented on code in PR #14527:
URL: https://github.com/apache/lucene/pull/14527#discussion_r2109471013
##
.gitignore:
##
@@ -32,3 +32,10 @@ __pycache__
# SDKMAN
.sdkmanrc
+
+# Java class files
+*.class
+
+# Ignore bin directories
+bin/
+**/bin/
Review Comment:
weizijun commented on PR #14527:
URL: https://github.com/apache/lucene/pull/14527#issuecomment-2863408186
> I don't understand this. How does the FloatArrayList bypass the
overallocation possible via `ArrayUtils.grow`? Its the same structure right?
and for every node there is a score.
benwtrent commented on PR #14527:
URL: https://github.com/apache/lucene/pull/14527#issuecomment-2863042916
> but the float array of the scores will not be passed out. Do I need to
change FloatArrayList to MaxSizedFloatArrayList?
I don't understand this. How does the FloatArrayList byp
weizijun commented on PR #14527:
URL: https://github.com/apache/lucene/pull/14527#issuecomment-2861178499
> You need one for scores as well. That `MaxSizedIntArrayList` looks good :)
I didn't add the MaxSizedFloatArrayList because the int array will be passed
out via the nodes() metho
benwtrent commented on PR #14527:
URL: https://github.com/apache/lucene/pull/14527#issuecomment-2859301884
> I add the MaxSizedIntArrayList class to solve oversize problem.
You need one for scores as well. That `MaxSizedIntArrayList` looks good :)
--
This is an automated message fro
weizijun commented on code in PR #14527:
URL: https://github.com/apache/lucene/pull/14527#discussion_r2076987925
##
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java:
##
@@ -32,13 +33,15 @@
public class NeighborArray {
private final boolean scoresDescOrder;
weizijun commented on PR #14527:
URL: https://github.com/apache/lucene/pull/14527#issuecomment-2857434676
> The underlying structures utilize `ArrayUtil.grow` to ensure capacity.
This means its very easy to overshoot the maximum size. This is why I was
saying we should use a new structure d
benwtrent commented on PR #14527:
URL: https://github.com/apache/lucene/pull/14527#issuecomment-2854224239
> I use the maxSize variable to ensure that the elementsCount of the array
cannot exceed the maxSize value.
The underlying structures utilize `ArrayUtil.grow` to ensure capacity.
weizijun commented on PR #14527:
URL: https://github.com/apache/lucene/pull/14527#issuecomment-2853389550
hi, @benwtrent, I updated the ramBytesUsed method, the memory is correct now.
> Please also adjust the inner arrays to enforce their maximal length. This
way we never over-allocat
benwtrent commented on PR #14527:
URL: https://github.com/apache/lucene/pull/14527#issuecomment-2847706746
@weizijun
It should be fixed. Having an estimation that is more than 2x off is pretty
bad.
this estimation is used to determine how often flushes should occur, etc.
weizijun commented on PR #14527:
URL: https://github.com/apache/lucene/pull/14527#issuecomment-2841473318
The failed TestHnswFloatVectorGraph.testRamUsageEstimate detail:
The seed is `-Dtests.seed=FEF2FF50C6824DEF`
```
Expected :1723096.0
Actual :3785393.0
jav
weizijun commented on PR #14527:
URL: https://github.com/apache/lucene/pull/14527#issuecomment-2841464168
I added the change log and changed the initial size of the list to maxSize/8
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to Gi
msokolov commented on PR #14527:
URL: https://github.com/apache/lucene/pull/14527#issuecomment-2828649522
re: concurrency; in theory it should be safe since we lock a row before
inserting anything into it. Consider that even with fixed-size arrays we need
to track the occupancy (how full t
benwtrent commented on PR #14527:
URL: https://github.com/apache/lucene/pull/14527#issuecomment-2828219136
> Ok, and about the OnHeapHnswGraph.ramBytesUsed, I didn't change to code.
But it maybe need to change the logic.
I think utilizing the underlying array estimations is what we ne
benwtrent commented on code in PR #14527:
URL: https://github.com/apache/lucene/pull/14527#discussion_r2058763024
##
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java:
##
@@ -32,13 +33,15 @@
public class NeighborArray {
private final boolean scoresDescOrder
weizijun commented on PR #14527:
URL: https://github.com/apache/lucene/pull/14527#issuecomment-2828149196
> I left a comment on the underlying structures used.
>
> Please update CHANGES.txt under 10.3 optimizations for this nice
optimization! It will be very nice to have better heap u
weizijun commented on PR #14527:
URL: https://github.com/apache/lucene/pull/14527#issuecomment-2827978234
@benwtrent I see the default parameters:
```
"numMergeWorker": (12,),
"numMergeThread": (4,),
```
Is the current merger effective with these parameters?
--
This
benwtrent commented on PR #14527:
URL: https://github.com/apache/lucene/pull/14527#issuecomment-2827970657
@weizijun I am running some benchmarking as well. But the key thing is to
update the parameters in `knnPerfTest.py` making sure `numMergeWorker` and
`numMergeThread` are greater than `
weizijun commented on PR #14527:
URL: https://github.com/apache/lucene/pull/14527#issuecomment-2827261485
> We need to make sure that there are no significant performance or
concurrency bugs introduced with this. Could you test with
https://github.com/mikemccand/luceneutil to verify recall,
jainankitk commented on PR #14527:
URL: https://github.com/apache/lucene/pull/14527#issuecomment-2822073026
> I am just being cautious. I know we provide locking, etc. for the
concurrent graph builder, but maybe its taking advantage of arrays never
growing by accident.
Thanks for ela
weizijun commented on PR #14527:
URL: https://github.com/apache/lucene/pull/14527#issuecomment-2821532658
Yes, maybe it's not suitable for concurrent graph builders. I tested this
case in elasticsearch and it worked fine.
Probably because in elasticsearch, the merge of graph builders is i
benwtrent commented on PR #14527:
URL: https://github.com/apache/lucene/pull/14527#issuecomment-2821413727
> Do you have anything specific in mind, that I might be missing?
I am just being cautious. I know we provide locking, etc. for the concurrent
graph builder, but maybe its taking
benwtrent commented on code in PR #14527:
URL: https://github.com/apache/lucene/pull/14527#discussion_r2054172393
##
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java:
##
@@ -32,13 +33,15 @@
public class NeighborArray {
private final boolean scoresDescOrder
jainankitk commented on PR #14527:
URL: https://github.com/apache/lucene/pull/14527#issuecomment-2820142632
> We need to make sure that there are no significant performance or
concurrency bugs introduced with this. Could you test with
https://github.com/mikemccand/luceneutil to verify recal
jainankitk commented on code in PR #14527:
URL: https://github.com/apache/lucene/pull/14527#discussion_r2053369115
##
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java:
##
@@ -32,13 +33,15 @@
public class NeighborArray {
private final boolean scoresDescOrde
benwtrent commented on PR #14527:
URL: https://github.com/apache/lucene/pull/14527#issuecomment-2818418948
We need to make sure that there are no significant performance or
concurrency bugs introduced with this. Could you test with
https://github.com/mikemccand/luceneutil to verify recall,
weizijun commented on PR #14527:
URL: https://github.com/apache/lucene/pull/14527#issuecomment-2817996406
The TestHnswFloatVectorGraph.testRamUsageEstimate maybe failed, because the
OnHeapHnswGraph.ramBytesUsed use the fixed array size to calculate the ram
value.
--
This is an automated
weizijun opened a new pull request, #14527:
URL: https://github.com/apache/lucene/pull/14527
When bbq is used with lucene, one datanode can contain more data.
So when more shards are merged concurrently, there will be a problem of very
high heap memory size.
I found that the NeighborAr
32 matches
Mail list logo