[Improvement] Optimize data flushing and memory usage for huge partitions to improve stability · Issue #378 · apache/uniffle · GitHub
Skip to content

[Improvement] Optimize data flushing and memory usage for huge partitions to improve stability  #378

Description

@zuston

Code of Conduct

Search before asking

  • I have searched in the issues and found no similar issues.

What would you like to be improved?

Background

In our internal uniffle cluster, when using the storage type of MEMORY_LOCALFILE, the partial single huge partitions make disk reach high watermark, which make whole shuffle-server out-of-service due to full memory occupation.

Possible solutions

  1. Using the storage type of MEMORY_LOCALFILE_HDFS with the fallback strategy of LocalStorageManagerFallbackStrategy. [ISSUE-163][FEATURE] Write to hdfs when local disk can't be write #235

After testing this mechanism, I found some bugs which make fallback invalid.

  1. Introduce the new strategy of dynamic disk selection [Improvement] Optimize local disk selection strategy #373
  2. Introduce partition size based strategy to flush single huge partition data to HDFS

Currently, I prefer 3th solution. In which, we could use cold storage(HDFS) when cumulative size of a particular partition is above a specific threshold(like 50g?). Actually, if exclude the partial huge partitions, the disk free ratio of whole shuffle-server is 20%-30%.

Reasons

  1. We could avoid hotspot of a single shuffle server, which could use HDFS to distribute pressure.
  2. Especially useful when disk spaces of shuffle servers are limited, which cannot hold a super large shuffle partition.
  3. Compared with 1th solution, it is more restrained for HDFS use and will not cause greater pressure. (Actually we don't want to accept much shuffle-data to HDFS with 3 replicas)
  4. Compared with 2th solution, it is more effective and have better performance.
  5. Only downgrade storage of partial huge partition jobs to HDFS, which is effective, especially when the HDFS cluster performance is not good.

Final solution

The solution of handling huge partitions is to make it flush to HDFS directly and limit memory usage, all subtasks are as follows.

  1. Speed up flushing partition data to HDFS.
  2. Introduce the memory usage limitation for huge partitions

How should we improve?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions