zuston commented on issue #378: URL: https://github.com/apache/incubator-uniffle/issues/378#issuecomment-1342637715
> In our production environment the disk is enough in most of the time. And it is difficult to set the threshold of partition size, and this will result in reduced disk usage. So we prefer to write hdfs when the disk is full. Yes, this is difficult to set the threshold of partition size. I still think existing mechanism is hard to solve the huge single partition, especially in our scenario mentioned by @ChengbingLiu . I want to share some practices that wants to solve this problem but failed. ## First case: I enable the storage type of MEMORY_LOCALFILE_HDFS and the conf of `rss.server.flush.cold.storage.threshold.size` is 100G. The buffer capacity is 10GB. The storage fallback strategy is `LocalStorageFallbackStrategy.` When a huge single partition came and quickly it made the disk reach the high-watermark, it started to flush HDFS directly. However, unfortunately, because this huge partition occupies a large space in memory(6GB), the shuffle dataFlushEvent is large, which leads to the need to spend a longer time to consume. And this will cause other jobs to be unable to write data normally. The picture is shown below.  ## Second case: After above practice, I set the `rss.server.flush.cold.storage.threshold.size` to 128M, which means the partition buffer once reaches the 128M, it will directly flush to HDFS. However, unfortunately, this still is useless. Although the huge partition will generate several dataFlushEvents waiting to be flushed to a single HDFS file, but its writerHandler will hold the write lock to write one by one. 1. So this will cause the flush thread pool busy but the writing speed is still slow. And this also will cause other jobs to be unable to write data normally. 2. Especially, when in one shuffle-server, there are not many partitions, it will cause the local disk always empty. This is unreasonable. The case's picture is shown below.  ## Possible solution The above problem is hard to avoid in production environment. I think we need to introduce new strategy to solve. The following solutions are just draft 1. Introduce the blacklist of a app based on its flushed size, once reached, shuffle-server will let it allow to get limited memory and flush data to HDFS directly 2. Client could get the multiple candidates(multiple shuffle-servers for one partition, not multiple replica). Once written failed in one shuffle-server, it could write data to another shuffle-server, which will distribute the huge data pressure to uniffle cluster. After all, the overall disk utilization rate of the cluster is relatively low -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
