[ 
https://issues.apache.org/jira/browse/LUCENE-10677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577853#comment-17577853
 ] 

David Turner edited comment on LUCENE-10677 at 8/10/22 9:42 AM:
----------------------------------------------------------------

> I'm opposed to the use of string.intern by the lucene library here. It is 
> inappropriate for a library (versus an app)

I think that's reasonable, `String#intern` is a pretty blunt tool to be using 
here. And yet it does seem awfully wasteful to burn so much heap on these 
things. "Buy more RAM" is not a great answer (implicitly this means "... or go 
and find a cheaper alternative elsewhere" and folks are indeed willing to do 
that). The next scaling limit in this dimension appears to be quite far off 
which is why we think this is worth addressing. (edit to add: these strings 
appear to roughly double the heap needed for each `SegmentReader` object)

Are there any other approaches you'd suggest? It looks like we might be able to 
intercept the relevant calls to `DataInput#readString` ourselves, although 
adding support for compound segments introduces an enormous amount of extra 
complexity to that approach. Would it work to introduce some simpler way for an 
application to hook in some kind of string deduplication mechanism even if it 
goes unused in pure Lucene by default?


was (Author: david turner):
> I'm opposed to the use of string.intern by the lucene library here. It is 
> inappropriate for a library (versus an app)

I think that's reasonable, `String#intern` is a pretty blunt tool to be using 
here. And yet it does seem awfully wasteful to burn so much heap on these 
things. "Buy more RAM" is not a great answer (implicitly this means "... or go 
and find a cheaper alternative elsewhere" and folks are indeed willing to do 
that). The next scaling limit in this dimension appears to be quite far off 
which is why we think this is worth addressing.

Are there any other approaches you'd suggest? It looks like we might be able to 
intercept the relevant calls to `DataInput#readString` ourselves, although 
adding support for compound segments introduces an enormous amount of extra 
complexity to that approach. Would it work to introduce some simpler way for an 
application to hook in some kind of string deduplication mechanism even if it 
goes unused in pure Lucene by default?

> Duplicate strings in FieldInfo#attributes contribute significantly to heap 
> usage at scale
> -----------------------------------------------------------------------------------------
>
>                 Key: LUCENE-10677
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10677
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/codecs
>    Affects Versions: 9.3
>            Reporter: Armin Braun
>            Priority: Minor
>              Labels: heap, scalability
>         Attachments: lucene_duplicate_fields.png
>
>
> This has the same origin as issue LUCENE-10676 . Running a single process 
> with thousands of fields across many indexes will lead to a lot of duplicate 
> strings retained as keys and values in the `attributes` map. This can amount 
> to GBs of heap for thousands of fields across a few thousand segments. The 
> strings in the below heap dump analysis account for more than half  (roughly 
> 2/3 and the field names are somewhat unusually long in this example) the 
> duplicate strings from `FieldInfo` instances.
> If we could deduplicate theses obvious known strings when reading `FieldInfo` 
> we could save GBs of heap for use cases like this.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to