[ https://issues.apache.org/jira/browse/LUCENE-9816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17293141#comment-17293141 ]
Michael McCandless commented on LUCENE-9816: -------------------------------------------- {quote}[~mikemccand] This is due to how the algorithm looks for duplicates, it stores a large hash table that maps 4-bytes sequences to offsets in the input. {quote} +1, thanks for the explanation and musings about how we might further optimize it [~jpountz]! > lazy-init LZ4-HC hashtable in blocktreewriter > --------------------------------------------- > > Key: LUCENE-9816 > URL: https://issues.apache.org/jira/browse/LUCENE-9816 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Robert Muir > Priority: Major > Fix For: master (9.0) > > Attachments: LUCENE-9816.patch > > > Based upon the data for a field, blocktree may compress with LZ4-HC (or with > simple lowercase compression or none at all). > But we currently eagerly initialize HC hashtable (132k) for each field > regardless of whether it will be even "tried". This shows up as top cpu and > heap hotspot when profiling tests. It creates unnecessary overhead for small > flushes. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org