[
https://issues.apache.org/jira/browse/KAFKA-17424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876766#comment-17876766
]
Greg Harris commented on KAFKA-17424:
-------------------------------------
Hi [~ajit97] Thanks for the ticket!
Can you provide some more supporting documentation? Perhaps some profiles with
evidence that this array-copy is the source of the problem?
As far as I can tell by reading the code, this should prevent an extra
8*max.poll.records byte reservation on each batch. For example, for this
reservation to require an additional 1GB of heap, the max.poll.records would
have to be >134217728. At that scale, the size of the SinkRecord becomes
significant, and I would expect would drown out any memory used for the
ArrayList itself.
> Memory optimisation for Kafka-connect
> -------------------------------------
>
> Key: KAFKA-17424
> URL: https://issues.apache.org/jira/browse/KAFKA-17424
> Project: Kafka
> Issue Type: Improvement
> Components: connect
> Affects Versions: 3.8.0
> Reporter: Ajit Singh
> Priority: Major
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> When Kafka connect gives sink task it's own copy of List<SinkRecords> that
> RAM utilisation shoots up and at that particular moment the there will be two
> lists and the original list gets cleared after the sink worker finishes the
> current batch.
>
> Originally the list is declared final and it's copy is provided to sink task
> as those can be custom and we let user process it however they want without
> any risk. But one of the most popular uses of kafka connect is OLTP - OLAP
> replication, and during initial copying/snapshots a lot of data is generated
> rapidly which fills the list to it's max batch size length, and we are prone
> to "Out of Memory" exceptions. And the only use of the list is to get filled
> > cloned for sink > get size > cleared > repeat. So I have taken the size of
> list before giving the original list to sink task and after sink has
> performed it's operations , set list = new ArrayList<>(). I did not use clear
> for just in case sink task has set our list to null.
> There is a time vs memory trade-off,
> In the original approach the jvm does not have spend time to find free memory
> In new approach the jvm will have to create new list by finding free memory
> addresses but this results in more free memory.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)