[jira] [Commented] (LUCENE-10677) Duplicate strings in FieldInfo#attributes contribute significantly to heap usage at scale

Armin Braun (Jira) Tue, 09 Aug 2022 05:32:04 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577393#comment-17577393
 ]


Armin Braun commented on LUCENE-10677:
--------------------------------------

This wouldn't necessarily need string interning here. Looking at the real world 
examples I have of this, simply deduplicating a few known strings like 
"PerFieldPostingsFormat.format" would already be a huge memory saving for this 
map. Couldn't we just special case some known strings when deserializing that 
map to deal with the biggest offenders?

It's not just about RAM outright, saving the GC for these strings would be 
quite helpful as well, especially when a lot of these eventually become only 
weakly referenced through a chain from the segment readers which makes it hard 
to quickly collect them under heap pressure (which is what caused trouble in 
the case motivated this).

> Duplicate strings in FieldInfo#attributes contribute significantly to heap 
> usage at scale
> -----------------------------------------------------------------------------------------
>
>                 Key: LUCENE-10677
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10677
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/codecs
>    Affects Versions: 9.3
>            Reporter: Armin Braun
>            Priority: Minor
>              Labels: heap, scalability
>         Attachments: lucene_duplicate_fields.png
>
>
> This has the same origin as issue LUCENE-10676 . Running a single process 
> with thousands of fields across many indexes will lead to a lot of duplicate 
> strings retained as keys and values in the `attributes` map. This can amount 
> to GBs of heap for thousands of fields across a few thousand segments. The 
> strings in the below heap dump analysis account for more than half  (roughly 
> 2/3 and the field names are somewhat unusually long in this example) the 
> duplicate strings from `FieldInfo` instances.
> If we could deduplicate theses obvious known strings when reading `FieldInfo` 
> we could save GBs of heap for use cases like this.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-10677) Duplicate strings in FieldInfo#attributes contribute significantly to heap usage at scale

Reply via email to