Armin Braun created LUCENE-10677:
------------------------------------

             Summary: Duplicate strings in FieldInfo#attributes contribute 
significantly to heap usage at scale
                 Key: LUCENE-10677
                 URL: https://issues.apache.org/jira/browse/LUCENE-10677
             Project: Lucene - Core
          Issue Type: Bug
          Components: core/codecs
    Affects Versions: 9.3
            Reporter: Armin Braun
         Attachments: lucene_duplicate_fields.png

This has the same origin as issue LUCENE-10676 . Running a single process with 
thousands of fields across many indexes will lead to a lot of duplicate strings 
retained as keys and values in the `attributes` map. This can amount to GBs of 
heap for thousands of fields across a few thousand segments. The strings in the 
below heap dump analysis account for more than half  (roughly 2/3 and the field 
names are somewhat unusually long in this example) the duplicate strings from 
`FieldInfo` instances.

If we could deduplicate theses obvious known strings when reading `FieldInfo` 
we could save GBs of heap for use cases like this.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to