dungba88 commented on code in PR #12844: URL: https://github.com/apache/lucene/pull/12844#discussion_r1421782211
########## lucene/core/src/java/org/apache/lucene/util/ArrayUtil.java: ########## @@ -330,15 +330,36 @@ public static int[] growExact(int[] array, int newLength) { return copy; } + /** + * Returns an array whose size is at least {@code minLength}, generally over-allocating + * exponentially, but never allocating more than {@code maxLength} elements. + */ + public static int[] growInRange(int[] array, int minLength, int maxLength) { + assert minLength >= 0 + : "length must be positive (got " + minLength + "): likely integer overflow?"; + + if (minLength > maxLength) { Review Comment: I think this can be assert instead, as this class is marked as internal ########## lucene/core/src/test/org/apache/lucene/util/hnsw/HnswGraphTestCase.java: ########## @@ -757,6 +757,30 @@ public void testRamUsageEstimate() throws IOException { long estimated = RamUsageEstimator.sizeOfObject(hnsw); long actual = ramUsed(hnsw); + // The estimation assumes neighbor arrays are always max size. + // When this is not true, the estimate can be much larger than the actual value. + // In these cases, we compute how much we overestimated the neighbors arrays. + if (estimated > actual) { + long neighborsError = 0; + int numLevels = hnsw.numLevels(); + for (int level = 0; level < numLevels; ++level) { + NodesIterator nodesOnLevel = hnsw.getNodesOnLevel(level); + while (nodesOnLevel.hasNext()) { + int node = nodesOnLevel.nextInt(); + NeighborArray neighbors = hnsw.getNeighbors(level, node); + long maxNeighborsSize; + if (level == 0) { Review Comment: I think generally we would rather want to avoid having test to duplicate assumption or logic made on prod path? This seems to be a specific implementation decision that could be changed independently. I couldn't think of a better way though. But I'm unsure about the need of this newly added code. It seems we only compute it in a single test, and we want to have a better estimation? The test seems to verify that our over-estimation cannot be more than 30% of the actual size. If we provide a better estimation maybe we can lower the tolerant threshold? Still it seems strange that if we truly need this better estimation but it is only in test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org