Re: [I] Parallel Table.append [iceberg-python]

2024-05-13 Thread via GitHub
Fokko commented on issue #428: URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-2106799664 Fixed in #444 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment

Re: [I] Parallel Table.append [iceberg-python]

2024-05-13 Thread via GitHub
Fokko closed issue #428: Parallel Table.append URL: https://github.com/apache/iceberg-python/issues/428 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

Re: [I] Parallel Table.append [iceberg-python]

2024-02-27 Thread via GitHub
kevinjqliu commented on issue #428: URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1967795309 Here's the script I used to run the `append()` function to use 8 threads to write multiple parquet files https://gist.github.com/kevinjqliu/e738641ec8f96de554c5ed39ead3f09a

Re: [I] Parallel Table.append [iceberg-python]

2024-02-27 Thread via GitHub
kevinjqliu commented on issue #428: URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1967788193 Thanks! It thanks! I can see that its using 8 threads with ``` SELECT * FROM duckdb_settings(); ``` I also ran ``` SET threads TO 8; ``` just i

Re: [I] Parallel Table.append [iceberg-python]

2024-02-27 Thread via GitHub
bigluck commented on issue #428: URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1967730771 @kevinjqliu nice, duckdb should use https://duckdb.org/docs/sql/configuration.html -- This is an automated message from the Apache Git Service. To respond to the message, plea

Re: [I] Parallel Table.append [iceberg-python]

2024-02-27 Thread via GitHub
kevinjqliu commented on issue #428: URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1967707048 As a way to benchmark multithreaded writes to multiple parquet files, I've noticed that Duckdb's COPY command has the `per_thread_output` and `file_size_bytes` options.

Re: [I] Parallel Table.append [iceberg-python]

2024-02-26 Thread via GitHub
bigluck commented on issue #428: URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1963740242 @kevinjqliu, your latest changes are mind-blowing (https://github.com/apache/iceberg-python/issues/428#issuecomment-1962460623 for reference) I have tested your last cha

Re: [I] Parallel Table.append [iceberg-python]

2024-02-25 Thread via GitHub
bigluck commented on issue #428: URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1962855117 Hey @kevinjqliu , we're currently debugging the issue on Slack, but I thought it would be helpful to report our findings here as well. In my tests, the pyarrow table is generate

Re: [I] Parallel Table.append [iceberg-python]

2024-02-24 Thread via GitHub
kevinjqliu commented on issue #428: URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1962460623 hm. Looks like something weird is going on if the resulting parquet file is 1.6 GB. Each parquet file size should be at most 512 MB, if not less. See the [bin packing logic]

Re: [I] Parallel Table.append [iceberg-python]

2024-02-24 Thread via GitHub
bigluck commented on issue #428: URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1962308872 Ciao @kevinjqliu, thanks! I've tested it on the same `c5ad.16xlarge` machine, but the results are pretty similar, 27s vs 28s for this table: ``` $ pip install git+h

Re: [I] Parallel Table.append [iceberg-python]

2024-02-23 Thread via GitHub
kevinjqliu commented on issue #428: URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1962142370 @bigluck #444 should allow parallelized write. Do you mind giving it a try? -- This is an automated message from the Apache Git Service. To respond to the message, please l

Re: [I] Parallel Table.append [iceberg-python]

2024-02-19 Thread via GitHub
kevinjqliu commented on issue #428: URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1953026131 Here's how it's done in Java. [`BaseTaskWriter.java` checks `targetFileSize`](https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/io/BaseTaskWrit

Re: [I] Parallel Table.append [iceberg-python]

2024-02-19 Thread via GitHub
kevinjqliu commented on issue #428: URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1953010526 Oh interesting, the input is 1M records, 685.46 MB in memory. We bin-pack the Arrow representation into 256MB chunks (`['224.61 MB', '236.23 MB', '224.62 MB']`), but writing

Re: [I] Parallel Table.append [iceberg-python]

2024-02-19 Thread via GitHub
kevinjqliu commented on issue #428: URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-195299 #444 something like this. wrote out 3 files ``` 0-0-a61f9655-0d76-45ca-b85d-4d8dc8dbcbd9.parquet 0-1-a61f9655-0d76-45ca-b85d-4d8dc8dbcbd9.parquet 00

Re: [I] Parallel Table.append [iceberg-python]

2024-02-19 Thread via GitHub
kevinjqliu commented on issue #428: URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1952969872 Integrating this with the write path, I have 2 approaches 1. refactoring `write_file` so that it can write multiple parquet files. This means 1 `WriteTask` can produce

Re: [I] Parallel Table.append [iceberg-python]

2024-02-19 Thread via GitHub
kevinjqliu commented on issue #428: URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1952966533 thanks! I found it, had to fuzzy search in vscode :) Here's an example of bin-packing an Arrow table. https://colab.research.google.com/drive/1FM8mdr4j5KgsjBYmsp9_

Re: [I] Parallel Table.append [iceberg-python]

2024-02-19 Thread via GitHub
Fokko commented on issue #428: URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1952962173 @kevinjqliu It is under [utils/bin_packing.py](https://github.com/apache/iceberg-python/blob/main/pyiceberg/utils/bin_packing.py). -- This is an automated message from the Apach

Re: [I] Parallel Table.append [iceberg-python]

2024-02-19 Thread via GitHub
kevinjqliu commented on issue #428: URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1952912276 > we already have a bin-packing algorithm in the code @Fokko can you point me to that? I couldn't find it -- This is an automated message from the Apache Git Service

Re: [I] Parallel Table.append [iceberg-python]

2024-02-19 Thread via GitHub
kevinjqliu commented on issue #428: URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1952910799 It was for data generation only. I can't seem to reproduce the parallelism issue for `append`, probably due to MacBook's huge IO. -- This is an automated message f

Re: [I] Parallel Table.append [iceberg-python]

2024-02-19 Thread via GitHub
Fokko commented on issue #428: URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1952692768 Do we know if this is for data generation, or also when writing? In the end, it would be good to be able to split the data into multiple files. The MacBooks have huge IO, so it mi

Re: [I] Parallel Table.append [iceberg-python]

2024-02-18 Thread via GitHub
bigluck commented on issue #428: URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1951883375 Thanks @kevinjqliu Last week, I didn't test the code on my MBP; I did all the tests directly on the EC2 instance. BTW it seems to use all the cores on my M2 Max:

Re: [I] Parallel Table.append [iceberg-python]

2024-02-18 Thread via GitHub
kevinjqliu commented on issue #428: URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1951397075 Also, @bigluck, while running the code to generate the data using faker, I opened `htop` and saw that it was using 6 CPUs. I'm using a M1 Mac -- This is an automated messa

Re: [I] Parallel Table.append [iceberg-python]

2024-02-18 Thread via GitHub
kevinjqliu commented on issue #428: URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1951395305 It seems like there's an upper bound to the size of the RecordBatch produced by `to_batches`. I tried setting `max_chunksize` from `16 MB` to `256 MB`. All the batches produc

Re: [I] Parallel Table.append [iceberg-python]

2024-02-18 Thread via GitHub
kevinjqliu commented on issue #428: URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1951390652 I took the above code and did some investigation. Here's the notebook to see it in action https://colab.research.google.com/drive/12O4ARckCwJqP2U6L4WxREZbyPv_AmWC2?usp=sha

Re: [I] Parallel Table.append [iceberg-python]

2024-02-14 Thread via GitHub
Fokko commented on issue #428: URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1943481392 @bigluck Thanks for raising this. This is on my list to look into! Parallelization of this is always hard since it is hard to exactly know how big the Parquet file will be.