David Turner created LUCENE-10676:
-------------------------------------

             Summary: FieldInfo#name contributes significantly to heap usage at 
scale
                 Key: LUCENE-10676
                 URL: https://issues.apache.org/jira/browse/LUCENE-10676
             Project: Lucene - Core
          Issue Type: Bug
          Components: core/codecs
    Affects Versions: 9.3
         Environment: Seen in Lucene 9.3.0 running on Linux using JDK18 but 
seems independent of environment.
            Reporter: David Turner


We encountered an Elasticsearch user with high heap usage, a significant 
proportion of which was down to the contents of `FieldInfo#name`.

This user was certainly pushing some scalability boundaries: this single 
process had thousands of active Lucene indices, many with 10k+ fields, and many 
indices had hundreds of segments due to an excess of flushes, so in total they 
had an enormous number of `FieldInfo` instances. Still, the bulk of the heap 
usage was just field names, and the total number of distinct field names was 
fairly small. That's pretty common, especially for time-based data like logs. 
Some kind of interning or deduplication of these strings would have reduced 
their heap usage by many GBs.

Is there a way we could deduplicate these strings? Deduplicating them across 
segments within each index would already have helped, but ideally we'd like to 
deduplicate them across indices too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to