[
https://issues.apache.org/jira/browse/LUCENE-10677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577393#comment-17577393
]
Armin Braun commented on LUCENE-10677:
--------------------------------------
This wouldn't necessarily need string interning here. Looking at the real world
examples I have of this, simply deduplicating a few known strings like
"PerFieldPostingsFormat.format" would already be a huge memory saving for this
map. Couldn't we just special case some known strings when deserializing that
map to deal with the biggest offenders?
It's not just about RAM outright, saving the GC for these strings would be
quite helpful as well, especially when a lot of these eventually become only
weakly referenced through a chain from the segment readers which makes it hard
to quickly collect them under heap pressure (which is what caused trouble in
the case motivated this).
> Duplicate strings in FieldInfo#attributes contribute significantly to heap
> usage at scale
> -----------------------------------------------------------------------------------------
>
> Key: LUCENE-10677
> URL: https://issues.apache.org/jira/browse/LUCENE-10677
> Project: Lucene - Core
> Issue Type: Bug
> Components: core/codecs
> Affects Versions: 9.3
> Reporter: Armin Braun
> Priority: Minor
> Labels: heap, scalability
> Attachments: lucene_duplicate_fields.png
>
>
> This has the same origin as issue LUCENE-10676 . Running a single process
> with thousands of fields across many indexes will lead to a lot of duplicate
> strings retained as keys and values in the `attributes` map. This can amount
> to GBs of heap for thousands of fields across a few thousand segments. The
> strings in the below heap dump analysis account for more than half (roughly
> 2/3 and the field names are somewhat unusually long in this example) the
> duplicate strings from `FieldInfo` instances.
> If we could deduplicate theses obvious known strings when reading `FieldInfo`
> we could save GBs of heap for use cases like this.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]