[ https://issues.apache.org/jira/browse/LUCENE-10676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576678#comment-17576678 ]
Michael McCandless commented on LUCENE-10676: --------------------------------------------- Is each field name exotically long as well? Lucene used to do just this -- intern {{FieldInfo.name}} and then use `==` to compare field names everywhere. But we decided long ago that this was dangerous and not an important optimization. Still, that decision was maybe pre Java 7 days when the intern'd pool was stored in {{PermGen}} instead of "ordinary" heap and was more likely to cause {{OutOfMemoryError}}? Maybe dig into those long ago issues / dev list thread to see the motivation to stop interning? > FieldInfo#name contributes significantly to heap usage at scale > --------------------------------------------------------------- > > Key: LUCENE-10676 > URL: https://issues.apache.org/jira/browse/LUCENE-10676 > Project: Lucene - Core > Issue Type: Bug > Components: core/codecs > Affects Versions: 9.3 > Environment: Seen in Lucene 9.3.0 running on Linux using JDK18 but > seems independent of environment. > Reporter: David Turner > Priority: Minor > Labels: heap, scalability > > We encountered an Elasticsearch user with high heap usage, a significant > proportion of which was down to the contents of `FieldInfo#name`. > This user was certainly pushing some scalability boundaries: this single > process had thousands of active Lucene indices, many with 10k+ fields, and > many indices had hundreds of segments due to an excess of flushes, so in > total they had an enormous number of `FieldInfo` instances. Still, the bulk > of the heap usage was just field names, and the total number of distinct > field names was fairly small. That's pretty common, especially for time-based > data like logs. Some kind of interning or deduplication of these strings > would have reduced their heap usage by many GBs. > Is there a way we could deduplicate these strings? Deduplicating them across > segments within each index would already have helped, but ideally we'd like > to deduplicate them across indices too. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org