[ https://issues.apache.org/jira/browse/LUCENE-10677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577309#comment-17577309 ]
Robert Muir commented on LUCENE-10677: -------------------------------------- I'm opposed to the use of string.intern by the lucene library here. It is inappropriate for a library (versus an app), there are plenty of discussions you can find about the problems it causes for apps. If someone insists on having thousands of indexes with tens of thousands of fields, they can buy more RAM. > Duplicate strings in FieldInfo#attributes contribute significantly to heap > usage at scale > ----------------------------------------------------------------------------------------- > > Key: LUCENE-10677 > URL: https://issues.apache.org/jira/browse/LUCENE-10677 > Project: Lucene - Core > Issue Type: Bug > Components: core/codecs > Affects Versions: 9.3 > Reporter: Armin Braun > Priority: Minor > Labels: heap, scalability > Attachments: lucene_duplicate_fields.png > > > This has the same origin as issue LUCENE-10676 . Running a single process > with thousands of fields across many indexes will lead to a lot of duplicate > strings retained as keys and values in the `attributes` map. This can amount > to GBs of heap for thousands of fields across a few thousand segments. The > strings in the below heap dump analysis account for more than half (roughly > 2/3 and the field names are somewhat unusually long in this example) the > duplicate strings from `FieldInfo` instances. > If we could deduplicate theses obvious known strings when reading `FieldInfo` > we could save GBs of heap for use cases like this. > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org