[GitHub] [incubator-uniffle] zuston commented on issue #378: [Improvement] Introduce partition size based strategy to flush single huge partition data to HDFS

GitBox Thu, 08 Dec 2022 04:14:01 -0800


zuston commented on issue #378:
URL: 
https://github.com/apache/incubator-uniffle/issues/378#issuecomment-1342637715

> In our production environment the disk is enough in most of the time. And
it is difficult to set the threshold of partition size, and this will result in
reduced disk usage. So we prefer to write hdfs when the disk is full.

Yes, this is difficult to set the threshold of partition size. I still think
existing mechanism is hard to solve the huge single partition, especially in
our scenario mentioned by @ChengbingLiu . I want to share some practices that
wants to solve this problem but failed.

## First case:

I enable the storage type of MEMORY_LOCALFILE_HDFS and the conf of
`rss.server.flush.cold.storage.threshold.size` is 100G. The buffer capacity is
10GB. The storage fallback strategy is `LocalStorageFallbackStrategy.`

When a huge single partition came and quickly it made the disk reach the
high-watermark, it started to flush HDFS directly. However, unfortunately,
because this huge partition occupies a large space in memory(6GB), the shuffle
dataFlushEvent is large, which leads to the need to spend a longer time to
consume. And this will cause other jobs to be unable to write data normally.
The picture is shown below.

![ss1](https://user-images.githubusercontent.com/8609142/206440724-358272bd-14db-434a-93a7-4c79af39414d.png)

## Second case:

After above practice, I set the
`rss.server.flush.cold.storage.threshold.size` to 128M, which means the
partition buffer once reaches the 128M, it will directly flush to HDFS.

However, unfortunately, this still is useless. Although the huge partition
will generate several dataFlushEvents waiting to be flushed to a single HDFS
file, but its writerHandler will hold the write lock to write one by one.

1. So this will cause the flush thread pool busy but the writing speed is
still slow. And this also will cause other jobs to be unable to write data
normally.
2. Especially, when in one shuffle-server, there are not many partitions,
it will cause the local disk always empty. This is unreasonable.

The case's picture is shown below.

![ss2](https://user-images.githubusercontent.com/8609142/206442011-56beb683-c088-4ef1-8a7d-4f2c384d6d4d.png)

## Possible solution
The above problem is hard to avoid in production environment. I think we
need to introduce new strategy to solve. The following solutions are just draft

1. Introduce the blacklist of a app based on its flushed size, once reached,
shuffle-server will let it allow to get limited memory and flush data to HDFS
directly
2. Client could get the multiple candidates(multiple shuffle-servers for one
partition, not multiple replica). Once written failed in one shuffle-server, it
could write data to another shuffle-server, which will distribute the huge data
pressure to uniffle cluster. After all, the overall disk utilization rate of
the cluster is relatively low

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [incubator-uniffle] zuston commented on issue #378: [Improvement] Introduce partition size based strategy to flush single huge partition data to HDFS

Reply via email to