Re: [PR] Reduce NeighborArray heap memory [lucene]

2025-05-28 Thread via GitHub
jainankitk commented on PR #14527: URL: https://github.com/apache/lucene/pull/14527#issuecomment-2917542707 Thanks @benwtrent and @weizijun for seeing this through! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the U

Re: [PR] Reduce NeighborArray heap memory [lucene]

2025-05-27 Thread via GitHub
weizijun commented on PR #14527: URL: https://github.com/apache/lucene/pull/14527#issuecomment-2914720940 Here are the statistics of 100w hnsw graphs, with m = 16 and ef = 100: Level count = 5: ``` level: 0, node count: 100 level: 1, node count: 62835 level: 2, node count:

Re: [PR] Reduce NeighborArray heap memory [lucene]

2025-05-27 Thread via GitHub
benwtrent merged PR #14527: URL: https://github.com/apache/lucene/pull/14527 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.a

Re: [PR] Reduce NeighborArray heap memory [lucene]

2025-05-27 Thread via GitHub
weizijun commented on code in PR #14527: URL: https://github.com/apache/lucene/pull/14527#discussion_r2109476565 ## .gitignore: ## @@ -32,3 +32,10 @@ __pycache__ # SDKMAN .sdkmanrc + +# Java class files +*.class + +# Ignore bin directories +bin/ +**/bin/ Review Comment:

Re: [PR] Reduce NeighborArray heap memory [lucene]

2025-05-27 Thread via GitHub
benwtrent commented on code in PR #14527: URL: https://github.com/apache/lucene/pull/14527#discussion_r2109471013 ## .gitignore: ## @@ -32,3 +32,10 @@ __pycache__ # SDKMAN .sdkmanrc + +# Java class files +*.class + +# Ignore bin directories +bin/ +**/bin/ Review Comment:

Re: [PR] Reduce NeighborArray heap memory [lucene]

2025-05-08 Thread via GitHub
weizijun commented on PR #14527: URL: https://github.com/apache/lucene/pull/14527#issuecomment-2863408186 > I don't understand this. How does the FloatArrayList bypass the overallocation possible via `ArrayUtils.grow`? Its the same structure right? and for every node there is a score.

Re: [PR] Reduce NeighborArray heap memory [lucene]

2025-05-08 Thread via GitHub
benwtrent commented on PR #14527: URL: https://github.com/apache/lucene/pull/14527#issuecomment-2863042916 > but the float array of the scores will not be passed out. Do I need to change FloatArrayList to MaxSizedFloatArrayList? I don't understand this. How does the FloatArrayList byp

Re: [PR] Reduce NeighborArray heap memory [lucene]

2025-05-07 Thread via GitHub
weizijun commented on PR #14527: URL: https://github.com/apache/lucene/pull/14527#issuecomment-2861178499 > You need one for scores as well. That `MaxSizedIntArrayList` looks good :) I didn't add the MaxSizedFloatArrayList because the int array will be passed out via the nodes() metho

Re: [PR] Reduce NeighborArray heap memory [lucene]

2025-05-07 Thread via GitHub
benwtrent commented on PR #14527: URL: https://github.com/apache/lucene/pull/14527#issuecomment-2859301884 > I add the MaxSizedIntArrayList class to solve oversize problem. You need one for scores as well. That `MaxSizedIntArrayList` looks good :) -- This is an automated message fro

Re: [PR] Reduce NeighborArray heap memory [lucene]

2025-05-07 Thread via GitHub
weizijun commented on code in PR #14527: URL: https://github.com/apache/lucene/pull/14527#discussion_r2076987925 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -32,13 +33,15 @@ public class NeighborArray { private final boolean scoresDescOrder;

Re: [PR] Reduce NeighborArray heap memory [lucene]

2025-05-07 Thread via GitHub
weizijun commented on PR #14527: URL: https://github.com/apache/lucene/pull/14527#issuecomment-2857434676 > The underlying structures utilize `ArrayUtil.grow` to ensure capacity. This means its very easy to overshoot the maximum size. This is why I was saying we should use a new structure d

Re: [PR] Reduce NeighborArray heap memory [lucene]

2025-05-06 Thread via GitHub
benwtrent commented on PR #14527: URL: https://github.com/apache/lucene/pull/14527#issuecomment-2854224239 > I use the maxSize variable to ensure that the elementsCount of the array cannot exceed the maxSize value. The underlying structures utilize `ArrayUtil.grow` to ensure capacity.

Re: [PR] Reduce NeighborArray heap memory [lucene]

2025-05-05 Thread via GitHub
weizijun commented on PR #14527: URL: https://github.com/apache/lucene/pull/14527#issuecomment-2853389550 hi, @benwtrent, I updated the ramBytesUsed method, the memory is correct now. > Please also adjust the inner arrays to enforce their maximal length. This way we never over-allocat

Re: [PR] Reduce NeighborArray heap memory [lucene]

2025-05-02 Thread via GitHub
benwtrent commented on PR #14527: URL: https://github.com/apache/lucene/pull/14527#issuecomment-2847706746 @weizijun It should be fixed. Having an estimation that is more than 2x off is pretty bad. this estimation is used to determine how often flushes should occur, etc.

Re: [PR] Reduce NeighborArray heap memory [lucene]

2025-04-30 Thread via GitHub
weizijun commented on PR #14527: URL: https://github.com/apache/lucene/pull/14527#issuecomment-2841473318 The failed TestHnswFloatVectorGraph.testRamUsageEstimate detail: The seed is `-Dtests.seed=FEF2FF50C6824DEF` ``` Expected :1723096.0 Actual :3785393.0 jav

Re: [PR] Reduce NeighborArray heap memory [lucene]

2025-04-30 Thread via GitHub
weizijun commented on PR #14527: URL: https://github.com/apache/lucene/pull/14527#issuecomment-2841464168 I added the change log and changed the initial size of the list to maxSize/8 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to Gi

Re: [PR] Reduce NeighborArray heap memory [lucene]

2025-04-24 Thread via GitHub
msokolov commented on PR #14527: URL: https://github.com/apache/lucene/pull/14527#issuecomment-2828649522 re: concurrency; in theory it should be safe since we lock a row before inserting anything into it. Consider that even with fixed-size arrays we need to track the occupancy (how full t

Re: [PR] Reduce NeighborArray heap memory [lucene]

2025-04-24 Thread via GitHub
benwtrent commented on PR #14527: URL: https://github.com/apache/lucene/pull/14527#issuecomment-2828219136 > Ok, and about the OnHeapHnswGraph.ramBytesUsed, I didn't change to code. But it maybe need to change the logic. I think utilizing the underlying array estimations is what we ne

Re: [PR] Reduce NeighborArray heap memory [lucene]

2025-04-24 Thread via GitHub
benwtrent commented on code in PR #14527: URL: https://github.com/apache/lucene/pull/14527#discussion_r2058763024 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -32,13 +33,15 @@ public class NeighborArray { private final boolean scoresDescOrder

Re: [PR] Reduce NeighborArray heap memory [lucene]

2025-04-24 Thread via GitHub
weizijun commented on PR #14527: URL: https://github.com/apache/lucene/pull/14527#issuecomment-2828149196 > I left a comment on the underlying structures used. > > Please update CHANGES.txt under 10.3 optimizations for this nice optimization! It will be very nice to have better heap u

Re: [PR] Reduce NeighborArray heap memory [lucene]

2025-04-24 Thread via GitHub
weizijun commented on PR #14527: URL: https://github.com/apache/lucene/pull/14527#issuecomment-2827978234 @benwtrent I see the default parameters: ``` "numMergeWorker": (12,), "numMergeThread": (4,), ``` Is the current merger effective with these parameters? -- This

Re: [PR] Reduce NeighborArray heap memory [lucene]

2025-04-24 Thread via GitHub
benwtrent commented on PR #14527: URL: https://github.com/apache/lucene/pull/14527#issuecomment-2827970657 @weizijun I am running some benchmarking as well. But the key thing is to update the parameters in `knnPerfTest.py` making sure `numMergeWorker` and `numMergeThread` are greater than `

Re: [PR] Reduce NeighborArray heap memory [lucene]

2025-04-24 Thread via GitHub
weizijun commented on PR #14527: URL: https://github.com/apache/lucene/pull/14527#issuecomment-2827261485 > We need to make sure that there are no significant performance or concurrency bugs introduced with this. Could you test with https://github.com/mikemccand/luceneutil to verify recall,

Re: [PR] Reduce NeighborArray heap memory [lucene]

2025-04-22 Thread via GitHub
jainankitk commented on PR #14527: URL: https://github.com/apache/lucene/pull/14527#issuecomment-2822073026 > I am just being cautious. I know we provide locking, etc. for the concurrent graph builder, but maybe its taking advantage of arrays never growing by accident. Thanks for ela

Re: [PR] Reduce NeighborArray heap memory [lucene]

2025-04-22 Thread via GitHub
weizijun commented on PR #14527: URL: https://github.com/apache/lucene/pull/14527#issuecomment-2821532658 Yes, maybe it's not suitable for concurrent graph builders. I tested this case in elasticsearch and it worked fine. Probably because in elasticsearch, the merge of graph builders is i

Re: [PR] Reduce NeighborArray heap memory [lucene]

2025-04-22 Thread via GitHub
benwtrent commented on PR #14527: URL: https://github.com/apache/lucene/pull/14527#issuecomment-2821413727 > Do you have anything specific in mind, that I might be missing? I am just being cautious. I know we provide locking, etc. for the concurrent graph builder, but maybe its taking

Re: [PR] Reduce NeighborArray heap memory [lucene]

2025-04-22 Thread via GitHub
benwtrent commented on code in PR #14527: URL: https://github.com/apache/lucene/pull/14527#discussion_r2054172393 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -32,13 +33,15 @@ public class NeighborArray { private final boolean scoresDescOrder

Re: [PR] Reduce NeighborArray heap memory [lucene]

2025-04-21 Thread via GitHub
jainankitk commented on PR #14527: URL: https://github.com/apache/lucene/pull/14527#issuecomment-2820142632 > We need to make sure that there are no significant performance or concurrency bugs introduced with this. Could you test with https://github.com/mikemccand/luceneutil to verify recal

Re: [PR] Reduce NeighborArray heap memory [lucene]

2025-04-21 Thread via GitHub
jainankitk commented on code in PR #14527: URL: https://github.com/apache/lucene/pull/14527#discussion_r2053369115 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -32,13 +33,15 @@ public class NeighborArray { private final boolean scoresDescOrde

Re: [PR] Reduce NeighborArray heap memory [lucene]

2025-04-21 Thread via GitHub
benwtrent commented on PR #14527: URL: https://github.com/apache/lucene/pull/14527#issuecomment-2818418948 We need to make sure that there are no significant performance or concurrency bugs introduced with this. Could you test with https://github.com/mikemccand/luceneutil to verify recall,

Re: [PR] Reduce NeighborArray heap memory [lucene]

2025-04-21 Thread via GitHub
weizijun commented on PR #14527: URL: https://github.com/apache/lucene/pull/14527#issuecomment-2817996406 The TestHnswFloatVectorGraph.testRamUsageEstimate maybe failed, because the OnHeapHnswGraph.ramBytesUsed use the fixed array size to calculate the ram value. -- This is an automated

[PR] Reduce NeighborArray heap memory [lucene]

2025-04-21 Thread via GitHub
weizijun opened a new pull request, #14527: URL: https://github.com/apache/lucene/pull/14527 When bbq is used with lucene, one datanode can contain more data. So when more shards are merged concurrently, there will be a problem of very high heap memory size. I found that the NeighborAr