Declow opened a new issue, #2325:
URL: https://github.com/apache/iceberg-python/issues/2325
### Apache Iceberg version
None
### Please describe the bug 🐞
It seems like there is a memory leak in the avro/reader.py
I have a long running service that keeps crashing. I tried to replicate the
issue locally and it seems it also has this issue.
The following code creates a Memory catalog and generates some random data
for ingestion into iceberg.
```
from pyiceberg.catalog.memory import InMemoryCatalog
import tracemalloc
from datetime import datetime, timezone
import polars as pl
def generate_df():
df = pl.DataFrame(
{
"event_type": ["playback"] * 1000,
"event_origin": ["origin1"] * 1000,
"event_send_at": [datetime.now(timezone.utc)] * 1000,
"event_saved_at": [datetime.now(timezone.utc)] * 1000,
"data": [
{
"calendarKey": "calendarKey",
"id": str(i),
"referenceId": f"ref-{i}",
}
for i in range(1000)
],
}
)
return df
df = generate_df()
catalog = InMemoryCatalog("default", warehouse="/tmp/iceberg")
catalog.create_namespace("default")
df = generate_df()
catalog = InMemoryCatalog("default", warehouse="/tmp/iceberg")
catalog.create_namespace("default")
table = iceberg_table = catalog.create_table(
"default.leak", schema=df.to_arrow().schema, location="/tmp/iceberg/leak"
)
df = pl.DataFrame()
tracemalloc.start()
for i in range(1000):
df = generate_df()
df.write_iceberg(table, mode="append")
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics("lineno")
for stat in top_stats[:10]:
print(stat)
```
Slowly but steadily the outputs for the avro reader memory size increases
>
/Users/dits/git/play-recommendation-input-consumer/.venv/lib/python3.11/site-packages/pyiceberg/avro/reader.py:330:
size=370 KiB, count=3782, average=100 B
>
/Users/dits/git/play-recommendation-input-consumer/.venv/lib/python3.11/site-packages/pyiceberg/avro/reader.py:190:
size=222 KiB, count=1891, average=120 B
>
/Users/dits/git/play-recommendation-input-consumer/.venv/lib/python3.11/site-packages/pyiceberg/avro/reader.py:133:
size=184 KiB, count=5673, average=33 B
After some more writes the output looks like this
>
/Users/dits/git/play-recommendation-input-consumer/.venv/lib/python3.11/site-packages/pyiceberg/avro/reader.py:330:
size=420 KiB, count=4290, average=100 B
>
/Users/dits/git/play-recommendation-input-consumer/.venv/lib/python3.11/site-packages/pyiceberg/avro/reader.py:190:
size=251 KiB, count=2145, average=120 B
>
/Users/dits/git/play-recommendation-input-consumer/.venv/lib/python3.11/site-packages/pyiceberg/avro/reader.py:133:
size=208 KiB, count=6435, average=33 B
If we take a look at the AvroFile class it uses the __enter__ and __exit__
dunder methods. The enter method assigns the reader to a variable on the
instance but it seems like the different reader classes sticks around.
https://github.com/apache/iceberg-python/blob/main/pyiceberg/avro/file.py#L192
### Willingness to contribute
- [ ] I can contribute a fix for this bug independently
- [x] I would be willing to contribute a fix for this bug with guidance from
the Iceberg community
- [ ] I cannot contribute a fix for this bug at this time
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]