Fokko commented on issue #428:
URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-2106799664
Fixed in #444
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment
Fokko closed issue #428: Parallel Table.append
URL: https://github.com/apache/iceberg-python/issues/428
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail:
kevinjqliu commented on issue #428:
URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1967795309
Here's the script I used to run the `append()` function to use 8 threads to
write multiple parquet files
https://gist.github.com/kevinjqliu/e738641ec8f96de554c5ed39ead3f09a
kevinjqliu commented on issue #428:
URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1967788193
Thanks! It
thanks! I can see that its using 8 threads with
```
SELECT * FROM duckdb_settings();
```
I also ran
```
SET threads TO 8;
```
just i
bigluck commented on issue #428:
URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1967730771
@kevinjqliu nice, duckdb should use
https://duckdb.org/docs/sql/configuration.html
--
This is an automated message from the Apache Git Service.
To respond to the message, plea
kevinjqliu commented on issue #428:
URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1967707048
As a way to benchmark multithreaded writes to multiple parquet files, I've
noticed that Duckdb's COPY command has the `per_thread_output` and
`file_size_bytes` options.
bigluck commented on issue #428:
URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1963740242
@kevinjqliu, your latest changes are mind-blowing
(https://github.com/apache/iceberg-python/issues/428#issuecomment-1962460623
for reference)
I have tested your last cha
bigluck commented on issue #428:
URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1962855117
Hey @kevinjqliu , we're currently debugging the issue on Slack, but I
thought it would be helpful to report our findings here as well. In my tests,
the pyarrow table is generate
kevinjqliu commented on issue #428:
URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1962460623
hm. Looks like something weird is going on if the resulting parquet file is
1.6 GB. Each parquet file size should be at most 512 MB, if not less. See the
[bin packing
logic]
bigluck commented on issue #428:
URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1962308872
Ciao @kevinjqliu, thanks!
I've tested it on the same `c5ad.16xlarge` machine, but the results are
pretty similar, 27s vs 28s for this table:
```
$ pip install
git+h
kevinjqliu commented on issue #428:
URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1962142370
@bigluck #444 should allow parallelized write. Do you mind giving it a try?
--
This is an automated message from the Apache Git Service.
To respond to the message, please l
kevinjqliu commented on issue #428:
URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1953026131
Here's how it's done in Java. [`BaseTaskWriter.java` checks
`targetFileSize`](https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/io/BaseTaskWrit
kevinjqliu commented on issue #428:
URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1953010526
Oh interesting, the input is 1M records, 685.46 MB in memory. We bin-pack
the Arrow representation into 256MB chunks (`['224.61 MB', '236.23 MB', '224.62
MB']`), but writing
kevinjqliu commented on issue #428:
URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-195299
#444 something like this.
wrote out 3 files
```
0-0-a61f9655-0d76-45ca-b85d-4d8dc8dbcbd9.parquet
0-1-a61f9655-0d76-45ca-b85d-4d8dc8dbcbd9.parquet
00
kevinjqliu commented on issue #428:
URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1952969872
Integrating this with the write path, I have 2 approaches
1. refactoring `write_file` so that it can write multiple parquet files.
This means 1 `WriteTask` can produce
kevinjqliu commented on issue #428:
URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1952966533
thanks! I found it, had to fuzzy search in vscode :)
Here's an example of bin-packing an Arrow table.
https://colab.research.google.com/drive/1FM8mdr4j5KgsjBYmsp9_
Fokko commented on issue #428:
URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1952962173
@kevinjqliu It is under
[utils/bin_packing.py](https://github.com/apache/iceberg-python/blob/main/pyiceberg/utils/bin_packing.py).
--
This is an automated message from the Apach
kevinjqliu commented on issue #428:
URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1952912276
> we already have a bin-packing algorithm in the code
@Fokko can you point me to that? I couldn't find it
--
This is an automated message from the Apache Git Service
kevinjqliu commented on issue #428:
URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1952910799
It was for data generation only. I can't seem to reproduce the parallelism
issue for `append`, probably due to MacBook's huge IO.
--
This is an automated message f
Fokko commented on issue #428:
URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1952692768
Do we know if this is for data generation, or also when writing? In the end,
it would be good to be able to split the data into multiple files. The MacBooks
have huge IO, so it mi
bigluck commented on issue #428:
URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1951883375
Thanks @kevinjqliu
Last week, I didn't test the code on my MBP; I did all the tests directly on
the EC2 instance.
BTW it seems to use all the cores on my M2 Max:
kevinjqliu commented on issue #428:
URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1951397075
Also, @bigluck, while running the code to generate the data using faker, I
opened `htop` and saw that it was using 6 CPUs. I'm using a M1 Mac
--
This is an automated messa
kevinjqliu commented on issue #428:
URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1951395305
It seems like there's an upper bound to the size of the RecordBatch produced
by `to_batches`. I tried setting `max_chunksize` from `16 MB` to `256 MB`. All
the batches produc
kevinjqliu commented on issue #428:
URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1951390652
I took the above code and did some investigation.
Here's the notebook to see it in action
https://colab.research.google.com/drive/12O4ARckCwJqP2U6L4WxREZbyPv_AmWC2?usp=sha
Fokko commented on issue #428:
URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1943481392
@bigluck Thanks for raising this. This is on my list to look into!
Parallelization of this is always hard since it is hard to exactly know how
big the Parquet file will be.
25 matches
Mail list logo