Re: [PR] Flink: Custom partitioner for bucket partitions [iceberg]

2024-09-13 Thread via GitHub
stevenzwu commented on PR #7161: URL: https://github.com/apache/iceberg/pull/7161#issuecomment-2350416841 @binshuohu Currently, there is no plan to reapply this change to the main branch. We have a more general range distribution available now (guided by statistics collection): https://ice

Re: [PR] Flink: Custom partitioner for bucket partitions [iceberg]

2024-09-12 Thread via GitHub
binshuohu commented on PR #7161: URL: https://github.com/apache/iceberg/pull/7161#issuecomment-2347381724 @stevenzwu Is there any plan to reapply this change to the main branch? Has there been any follow up since https://github.com/apache/iceberg/pull/8848 ? -- This is an automated messag

Re: [PR] Flink: Custom partitioner for bucket partitions [iceberg]

2023-12-18 Thread via GitHub
stevenzwu commented on PR #7161: URL: https://github.com/apache/iceberg/pull/7161#issuecomment-1862079413 @bendevera here is our presentation: https://www.youtube.com/watch?v=GJplmOO7ULA&t=18s. here is the design doc: https://docs.google.com/document/d/13N8cMqPi-ZPSKbkXGOBMPOzbv2Fua59j8bIj

Re: [PR] Flink: Custom partitioner for bucket partitions [iceberg]

2023-12-18 Thread via GitHub
bendevera commented on PR #7161: URL: https://github.com/apache/iceberg/pull/7161#issuecomment-1861920740 @stevenzwu thank you for the quick response! Okay, will run some `BucketPartitioner` tests for our use case by copying code manually. Smart shuffling sounds interesting and would

Re: [PR] Flink: Custom partitioner for bucket partitions [iceberg]

2023-12-18 Thread via GitHub
stevenzwu commented on PR #7161: URL: https://github.com/apache/iceberg/pull/7161#issuecomment-1861850713 It is reverted because there are users depending on the previous behavior of keyBy all partition columns. https://github.com/apache/iceberg/pull/7161#issuecomment-1761169778 We w

Re: [PR] Flink: Custom partitioner for bucket partitions [iceberg]

2023-12-18 Thread via GitHub
bendevera commented on PR #7161: URL: https://github.com/apache/iceberg/pull/7161#issuecomment-1861295520 @stevenzwu I see defaulting to `BucketPartitioner` was reverted here: https://github.com/apache/iceberg/pull/8848 We've found performance issues with `DistributionMode.HASH`, and

Re: [PR] Flink: Custom partitioner for bucket partitions [iceberg]

2023-10-16 Thread via GitHub
chenwyi2 commented on PR #7161: URL: https://github.com/apache/iceberg/pull/7161#issuecomment-1765530967 In normal conditition, only the data of current minute will be written. However, if the data is delayed, for example, at 11:50, the data has not been written until 11:55, then at 11:56

Re: [PR] Flink: Custom partitioner for bucket partitions [iceberg]

2023-10-15 Thread via GitHub
stevenzwu commented on PR #7161: URL: https://github.com/apache/iceberg/pull/7161#issuecomment-1763706777 is the partition time an event time or ingestion/processing time? or asking in a different way, how many active minutes do the Flink writer job process for every commit cycle? I

Re: [PR] Flink: Custom partitioner for bucket partitions [iceberg]

2023-10-15 Thread via GitHub
chenwyi2 commented on PR #7161: URL: https://github.com/apache/iceberg/pull/7161#issuecomment-1763604926 yes, I am creating very fine grained partitions, because i want to query and comput some business metrics between minutes ss fast as possible. As for bucket number, i use a fomula QPS *

Re: [PR] Flink: Custom partitioner for bucket partitions [iceberg]

2023-10-13 Thread via GitHub
stevenzwu commented on PR #7161: URL: https://github.com/apache/iceberg/pull/7161#issuecomment-1762592732 @chenwyi2 Is your point that we shouldn't only consider bucketing column (like did in this PR). you just want a plain keyBy in this case? that would be a fair point. Do you get balanced

Re: [PR] Flink: Custom partitioner for bucket partitions [iceberg]

2023-10-13 Thread via GitHub
chenwyi2 commented on PR #7161: URL: https://github.com/apache/iceberg/pull/7161#issuecomment-1761169778 Hi @stevenzwu @kengtin this PR can be create too many small files when parition with dt,hout,minute and bucekt(id), suppose paralisim is 120 and bucke number is 8, then 15 writes can wri