[GitHub] [lucene] thecoop commented on pull request #11847: Add a method allowing canonical strings to be returned from DataInput

GitBox Thu, 03 Nov 2022 03:23:03 -0700


thecoop commented on PR #11847:
URL: https://github.com/apache/lucene/pull/11847#issuecomment-1301893972


   This is ready to review. I've removed the threadlocal byte buffer.
   
   With testing in elasticsearch, this reduced the duplicated strings in these 
fields by 90%. For the linked tickets this would reduce the memory usage from 
these strings from 6GB to 600MB. There is no significant performance difference 
I can see from running this in lucenebench.
   
   In the cases where the new method is not called, this would just add an 
empty `HashMap` instance to each `DataInput` instance, with the same lifetime 
as the container.
   When `getCanonicalString` is called, this added a single hashmap lookup to 
the call compared to `getString`. It also requires memory for the backing 
`HashMap`, with the same lifecycle as the container, scaling with the number of 
distinct strings returned by that method.
   
   So this causes slightly more memory when deserializing, at the benefit of 
using drastically less memory once it is all deserialized and the data input & 
hashmap have been GCd.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] thecoop commented on pull request #11847: Add a method allowing canonical strings to be returned from DataInput

Reply via email to