[ https://issues.apache.org/jira/browse/LUCENE-10676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Armin Braun updated LUCENE-10676: --------------------------------- Attachment: image-2022-08-08-13-23-37-050.png > FieldInfo#name contributes significantly to heap usage at scale > --------------------------------------------------------------- > > Key: LUCENE-10676 > URL: https://issues.apache.org/jira/browse/LUCENE-10676 > Project: Lucene - Core > Issue Type: Bug > Components: core/codecs > Affects Versions: 9.3 > Environment: Seen in Lucene 9.3.0 running on Linux using JDK18 but > seems independent of environment. > Reporter: David Turner > Priority: Minor > Labels: heap, scalability > Attachments: image-2022-08-08-13-23-37-050.png > > > We encountered an Elasticsearch user with high heap usage, a significant > proportion of which was down to the contents of `FieldInfo#name`. > This user was certainly pushing some scalability boundaries: this single > process had thousands of active Lucene indices, many with 10k+ fields, and > many indices had hundreds of segments due to an excess of flushes, so in > total they had an enormous number of `FieldInfo` instances. Still, the bulk > of the heap usage was just field names, and the total number of distinct > field names was fairly small. That's pretty common, especially for time-based > data like logs. Some kind of interning or deduplication of these strings > would have reduced their heap usage by many GBs. > Is there a way we could deduplicate these strings? Deduplicating them across > segments within each index would already have helped, but ideally we'd like > to deduplicate them across indices too. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org