[ https://issues.apache.org/jira/browse/LUCENE-9816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292524#comment-17292524 ]
Adrien Grand commented on LUCENE-9816: -------------------------------------- +1 to the patch [~mikemccand] This is due to how the algorithm looks for duplicates, it stores a large hash table that maps 4-bytes sequences to offsets in the input. The high-speed variant uses packed ints to lower memory usage on short inputs, but maybe we could be smarter on the high-compression variant: as it not only records the last offset for every hash like the high-speed variant, but also the previous ones up to a window of 64k, a strong hash function only makes compression faster, not more effective. So maybe we could adapt the number of bits of the hash function depending on the size of the input in order to reduce memory usage, without hurting compression ratios and hopefully without hurting compression speed. > lazy-init LZ4-HC hashtable in blocktreewriter > --------------------------------------------- > > Key: LUCENE-9816 > URL: https://issues.apache.org/jira/browse/LUCENE-9816 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Robert Muir > Priority: Major > Attachments: LUCENE-9816.patch > > > Based upon the data for a field, blocktree may compress with LZ4-HC (or with > simple lowercase compression or none at all). > But we currently eagerly initialize HC hashtable (132k) for each field > regardless of whether it will be even "tried". This shows up as top cpu and > heap hotspot when profiling tests. It creates unnecessary overhead for small > flushes. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org