iframe-proxy

zuston · 2023-01-11T10:40:12Z

What changes were proposed in this pull request?

Introduce memory usage limit for huge partition to keep the regular partition writing stable
Once partition is marked as huge-partition, when its buffer size is greater than rss.server.single.buffer.flush.threshold value, single-buffer flush will be triggered whatever the single buffer flush is enabled or not

Why are the changes needed?

To solve the problems mentioned by [Improvement] Optimize data flushing and memory usage for huge partitions to improve stability #378

Does this PR introduce any user-facing change?

Yes.

How was this patch tested?

UTs

advancedxy · 2023-01-12T02:20:38Z

advancedxy · 2023-01-12T02:22:36Z

I think these two similar configurations will bring more confusion to end user.

zuston · 2023-01-12T03:02:43Z

I don't think we should introduce a new configuration to control buffer flush. rss.server.single.buffer.flush.enabled can be used for the huge partition buffer flush purpose. If the huge partition limit is enabled, the single buffer flush could be enabled automatically.

If single buffer flush could be enabled automatically, how to set the flush threshold size? I don't hope the rss.server.single.buffer.flush.enabled is enabled for regular partitions, which could be flushed to ssd/hdd directly instead of cold storage. And huge partition could be flushed to HDFS.

I think these two similar configurations will bring more confusion to end user.

Emm. Yes.

zuston · 2023-01-12T03:05:52Z

It seems I got your point. The conf of rss.server.single.buffer.flush.threshold should be specified especially for huge partition whatever the single buffer flush is enabled or not. Right?

codecov-commenter · 2023-01-12T06:45:12Z

Codecov Report

Merging #471 (cfffd7a) into master (19a8bac) will increase coverage by 0.07%.
The diff coverage is 75.47%.

@@             Coverage Diff              @@
##             master     #471      +/-   ##
============================================
+ Coverage     58.78%   58.86%   +0.07%     
- Complexity     1704     1714      +10     
============================================
  Files           206      206              
  Lines         11471    11517      +46     
  Branches       1024     1033       +9     
============================================
+ Hits           6743     6779      +36     
- Misses         4317     4324       +7     
- Partials        411      414       +3

Impacted Files	Coverage Δ
...pache/uniffle/server/ShuffleServerGrpcService.java	`0.79% <0.00%> (-0.02%)`	⬇️
.../org/apache/uniffle/server/ShuffleTaskManager.java	`76.50% <76.92%> (-0.06%)`	⬇️
...he/uniffle/server/buffer/ShuffleBufferManager.java	`83.21% <90.47%> (+0.45%)`	⬆️
...a/org/apache/uniffle/server/ShuffleServerConf.java	`99.31% <100.00%> (+0.02%)`	⬆️
...rg/apache/uniffle/server/ShuffleServerMetrics.java	`97.05% <0.00%> (-0.14%)`	⬇️
...org/apache/uniffle/server/ShuffleFlushManager.java	`84.04% <0.00%> (+0.08%)`	⬆️
...g/apache/uniffle/server/ShuffleDataFlushEvent.java	`83.67% <0.00%> (+0.34%)`	⬆️
...ava/org/apache/uniffle/server/ShuffleTaskInfo.java	`100.00% <0.00%> (+5.55%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

advancedxy · 2023-01-12T11:18:54Z

I don't think we should introduce a new configuration to control buffer flush. rss.server.single.buffer.flush.enabled can be used for the huge partition buffer flush purpose. If the huge partition limit is enabled, the single buffer flush could be enabled automatically.

If single buffer flush could be enabled automatically, how to set the flush threshold size? I don't hope the rss.server.single.buffer.flush.enabled is enabled for regular partitions, which could be flushed to ssd/hdd directly instead of cold storage. And huge partition could be flushed to HDFS.

I think these two similar configurations will bring more confusion to end user.

Emm. Yes.

Sorry, I forgot to reply this comment.

If I understand the code and design correctly, single.buffer.flushed.enabled is added to support flushing big buffer to cold storage, such as hdfs directly, which I think it perfectly matches you intention when serving huge partitions.

So, I think we could reuse the rss.server.single.buffer.flush.threshold settings, which is 64MB by default. We don't have to introduce rss.server.huge-partition.memory.limit.ratio ?

zuston · 2023-01-12T11:28:51Z

So, I think we could reuse the rss.server.single.buffer.flush.threshold settings, which is 64MB by default. We don't have to introduce rss.server.huge-partition.memory.limit.ratio ?

Memory limit is required. Usually the huge partition writing speed is fast, and the HDFS flushing speed is slower than writing. If missing memory limit, the regular partitions will be affected.

By the way, this design has been introduced into our internal version, it works well. Before this PR, sometimes one huge partition will make other regular partitions buffer require fail.

advancedxy · 2023-01-12T12:16:05Z

So, I think we could reuse the rss.server.single.buffer.flush.threshold settings, which is 64MB by default. We don't have to introduce rss.server.huge-partition.memory.limit.ratio ?

Memory limit is required. Usually the huge partition writing speed is fast, and the HDFS flushing speed is slower than writing. If missing memory limit, the regular partitions will be affected.

By the way, this design has been introduced into our internal version, it works well. Before this PR, sometimes one huge partition will make other regular partitions buffer require fail.

I'm ok to add rss.server.huge-partition.size.threshold setting, which is used to add memory limit. I just think maybe we should reuse rss.server.single.buffer.flush.threshold instead of another rss.server.huge-partition.memory.limit.ratio settings?

zuston · 2023-01-12T12:32:22Z

I'm ok to add rss.server.huge-partition.size.threshold setting, which is used to add memory limit. I just think maybe we should reuse rss.server.single.buffer.flush.threshold instead of another rss.server.huge-partition.memory.limit.ratio settings?

This suggestion has been accepted 😁 (Here: #471 (comment)).

Please review the latest code.

advancedxy · 2023-01-12T12:55:25Z

I'm ok to add rss.server.huge-partition.size.threshold setting, which is used to add memory limit. I just think maybe we should reuse rss.server.single.buffer.flush.threshold instead of another rss.server.huge-partition.memory.limit.ratio settings?

This suggestion has been accepted 😁 (Here: #471 (comment)).

Please review the latest code.

en. I will take a look in details by tomorrow. But on the surface, the rss.server.huge-partition.memory.limit.ratio setting still existed?

zuston · 2023-01-12T13:23:51Z

rss.server.huge-partition.memory.limit.ratio is the conf for memory limitation rather than buffer flush threshold. The original conf to control flush for huge partition has been removed. 49c2400

zuston · 2023-01-16T02:33:06Z

PTAL @advancedxy. After this PR is merged, I will introduce some metrics about huge partition.

advancedxy

LGTM.

zuston · 2023-01-17T02:02:52Z

[ISSUE-378][HugePartition][Part-2] Introduce memory usage limit

c7e3959

zuston force-pushed the memoryLimit branch from 0061178 to c7e3959 Compare January 11, 2023 10:42

Add data flush for huge partition

49c2400

zuston changed the title ~~[ISSUE-378][HugePartition][Part-2] Introduce memory usage limit~~ [ISSUE-378][HugePartition][Part-2] Introduce memory usage limit and data flush Jan 11, 2023

zuston requested a review from roryqi January 11, 2023 11:37

fix checkstyle

192ddae

reuse conf of single buffer flush

9907b6f

zuston requested a review from advancedxy January 12, 2023 06:26

zuston linked an issue Jan 12, 2023 that may be closed by this pull request

[Improvement] Optimize data flushing and memory usage for huge partitions to improve stability #378

Closed

8 tasks

zuston removed a link to an issue Jan 12, 2023

[Improvement] Optimize data flushing and memory usage for huge partitions to improve stability #378

Closed

8 tasks

zuston mentioned this pull request Jan 12, 2023

[Improvement] Optimize data flushing and memory usage for huge partitions to improve stability #378

Closed

8 tasks

zuston added 2 commits January 12, 2023 15:56

add tests for data flush

a64a30c

add doc

e412e59

advancedxy reviewed Jan 13, 2023

View reviewed changes

Comment thread server/src/main/java/org/apache/uniffle/server/ShuffleServerGrpcService.java Outdated

Comment thread server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java

Comment thread server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java Outdated

fix

cfffd7a

zuston requested a review from advancedxy January 16, 2023 02:33

advancedxy reviewed Jan 16, 2023

View reviewed changes

Comment thread server/src/main/java/org/apache/uniffle/server/ShuffleServerConf.java

Comment thread server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java

zuston requested a review from advancedxy January 16, 2023 06:18

advancedxy approved these changes Jan 16, 2023

View reviewed changes

zuston merged commit e802b93 into apache:master Jan 17, 2023

zuston deleted the memoryLimit branch January 17, 2023 02:02

leixm mentioned this pull request Mar 23, 2023

[Improvement] ShuffleBufferManager.flushSingleBufferIfNecessary unreasonable #756

Closed

3 tasks

zuston mentioned this pull request Jan 19, 2024

[#1356] feat(server): improve expired buffers metric and log #1469

Merged

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

zuston commented Jan 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

advancedxy commented Jan 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

advancedxy commented Jan 12, 2023

Uh oh!

zuston commented Jan 12, 2023

Uh oh!

zuston commented Jan 12, 2023

Uh oh!

codecov-commenter commented Jan 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

advancedxy commented Jan 12, 2023

Uh oh!

zuston commented Jan 12, 2023

Uh oh!

advancedxy commented Jan 12, 2023

Uh oh!

zuston commented Jan 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

advancedxy commented Jan 12, 2023

Uh oh!

zuston commented Jan 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zuston commented Jan 16, 2023

Uh oh!

Uh oh!

Uh oh!

advancedxy left a comment

Choose a reason for hiding this comment

Uh oh!

zuston commented Jan 17, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zuston commented Jan 11, 2023 •

edited

Loading

advancedxy commented Jan 12, 2023 •

edited

Loading

codecov-commenter commented Jan 12, 2023 •

edited

Loading

zuston commented Jan 12, 2023 •

edited

Loading

zuston commented Jan 12, 2023 •

edited

Loading