[jira] [Commented] (LUCENE-9816) lazy-init LZ4-HC hashtable in blocktreewriter

Adrien Grand (Jira) Sun, 28 Feb 2021 13:14:05 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-9816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292524#comment-17292524
 ]


Adrien Grand commented on LUCENE-9816:
--------------------------------------

+1 to the patch

[~mikemccand] This is due to how the algorithm looks for duplicates, it stores 
a large hash table that maps 4-bytes sequences to offsets in the input.

The high-speed variant uses packed ints to lower memory usage on short inputs, 
but maybe we could be smarter on the high-compression variant: as it not only 
records the last offset for every hash like the high-speed variant, but also 
the previous ones up to a window of 64k, a strong hash function only makes 
compression faster, not more effective. So maybe we could adapt the number of 
bits of the hash function depending on the size of the input in order to reduce 
memory usage, without hurting compression ratios and hopefully without hurting 
compression speed.

> lazy-init LZ4-HC hashtable in blocktreewriter
> ---------------------------------------------
>
>                 Key: LUCENE-9816
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9816
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Major
>         Attachments: LUCENE-9816.patch
>
>
> Based upon the data for a field, blocktree may compress with LZ4-HC (or with 
> simple lowercase compression or none at all).
> But we currently eagerly initialize HC hashtable (132k) for each field 
> regardless of whether it will be even "tried". This shows up as top cpu and 
> heap hotspot when profiling tests. It creates unnecessary overhead for small 
> flushes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9816) lazy-init LZ4-HC hashtable in blocktreewriter

Reply via email to