mikemccand commented on PR #12738: URL: https://github.com/apache/lucene/pull/12738#issuecomment-1788744551
> But I ended up using a `List<byte[]>` where each item is a node instead of ByteBlockPool due to the following reasons: Hmm -- this is sizable added RAM overhead per entry. Added array header (16 or 20 bytes), added pointers to these arrays (4 or 8 bytes), tiny objects for GC to crawl, though maybe they mostly die young if the RAM budget is low enough. Yes, we can share some of these `byte[]` with both hashes but I suspect that's not a net/net win. > We automatically get the length of the node without any traversing We are paying 4 bytes per entry for this :) And it seems a shame since FST's node encoding is already self-delimiting, and, as a side effect of looking up that node we will already have computed that length. > With ByteBlockPool we have 2 unavoidable double byte-copies: (1) when write from BytesStore to the primary table and (2) when promote an entry from the fallback table to the primary table. In both situations we need to first write into a temporary byte[]. I don't think this added cost is so much. We freeze the node into the true FST appending byte store, then, we copy those "last N bytes" over to primary hash. Eventually it's moved to fallback, and, maybe it never gets promoted back (single copy), or, maybe it does (+1 copy) but that's "worth it" since we achieve some minimization, i.e. the cost correlates nicely with the win (the whole point of this double LRU hash). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org