David Turner created LUCENE-10676: ------------------------------------- Summary: FieldInfo#name contributes significantly to heap usage at scale Key: LUCENE-10676 URL: https://issues.apache.org/jira/browse/LUCENE-10676 Project: Lucene - Core Issue Type: Bug Components: core/codecs Affects Versions: 9.3 Environment: Seen in Lucene 9.3.0 running on Linux using JDK18 but seems independent of environment. Reporter: David Turner
We encountered an Elasticsearch user with high heap usage, a significant proportion of which was down to the contents of `FieldInfo#name`. This user was certainly pushing some scalability boundaries: this single process had thousands of active Lucene indices, many with 10k+ fields, and many indices had hundreds of segments due to an excess of flushes, so in total they had an enormous number of `FieldInfo` instances. Still, the bulk of the heap usage was just field names, and the total number of distinct field names was fairly small. That's pretty common, especially for time-based data like logs. Some kind of interning or deduplication of these strings would have reduced their heap usage by many GBs. Is there a way we could deduplicate these strings? Deduplicating them across segments within each index would already have helped, but ideally we'd like to deduplicate them across indices too. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org