thecoop commented on PR #11847: URL: https://github.com/apache/lucene/pull/11847#issuecomment-1301893972
This is ready to review. I've removed the threadlocal byte buffer. With testing in elasticsearch, this reduced the duplicated strings in these fields by 90%. For the linked tickets this would reduce the memory usage from these strings from 6GB to 600MB. There is no significant performance difference I can see from running this in lucenebench. In the cases where the new method is not called, this would just add an empty `HashMap` instance to each `DataInput` instance, with the same lifetime as the container. When `getCanonicalString` is called, this added a single hashmap lookup to the call compared to `getString`. It also requires memory for the backing `HashMap`, with the same lifecycle as the container, scaling with the number of distinct strings returned by that method. So this causes slightly more memory when deserializing, at the benefit of using drastically less memory once it is all deserialized and the data input & hashmap have been GCd. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org