[jira] [Commented] (LUCENE-10676) FieldInfo#name contributes significantly to heap usage at scale

Michael McCandless (Jira) Mon, 08 Aug 2022 03:13:24 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576678#comment-17576678
 ]


Michael McCandless commented on LUCENE-10676:
---------------------------------------------

Is each field name exotically long as well?

Lucene used to do just this -- intern {{FieldInfo.name}} and then use `==` to 
compare field names everywhere.  But we decided long ago that this was 
dangerous and not an important optimization.  Still, that decision was maybe 
pre Java 7 days when the intern'd pool was stored in {{PermGen}} instead of 
"ordinary" heap and was more likely to cause {{OutOfMemoryError}}?

Maybe dig into those long ago issues / dev list thread to see the motivation to 
stop interning?

> FieldInfo#name contributes significantly to heap usage at scale
> ---------------------------------------------------------------
>
>                 Key: LUCENE-10676
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10676
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/codecs
>    Affects Versions: 9.3
>         Environment: Seen in Lucene 9.3.0 running on Linux using JDK18 but 
> seems independent of environment.
>            Reporter: David Turner
>            Priority: Minor
>              Labels: heap, scalability
>
> We encountered an Elasticsearch user with high heap usage, a significant 
> proportion of which was down to the contents of `FieldInfo#name`.
> This user was certainly pushing some scalability boundaries: this single 
> process had thousands of active Lucene indices, many with 10k+ fields, and 
> many indices had hundreds of segments due to an excess of flushes, so in 
> total they had an enormous number of `FieldInfo` instances. Still, the bulk 
> of the heap usage was just field names, and the total number of distinct 
> field names was fairly small. That's pretty common, especially for time-based 
> data like logs. Some kind of interning or deduplication of these strings 
> would have reduced their heap usage by many GBs.
> Is there a way we could deduplicate these strings? Deduplicating them across 
> segments within each index would already have helped, but ideally we'd like 
> to deduplicate them across indices too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10676) FieldInfo#name contributes significantly to heap usage at scale

Reply via email to