Introduce startup-silent-period mechanism to avoid partial assignments by zuston · Pull Request #247 · apache/uniffle · GitHub
Skip to content

Introduce startup-silent-period mechanism to avoid partial assignments#247

Merged
roryqi merged 3 commits into
apache:masterfrom
zuston:rejection
Oct 10, 2022
Merged

Introduce startup-silent-period mechanism to avoid partial assignments#247
roryqi merged 3 commits into
apache:masterfrom
zuston:rejection

Conversation

@zuston

@zuston zuston commented Sep 28, 2022

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

Introduce startup-silent-period mechanism to avoid partial assignments

Why are the changes needed?

When changing some coordinator's conf and then restart, coordinator will accept client getAssignment request immediately, but it will serve for jobs request based on the partial registered shuffle-servers, which will make some jobs gotten not enough required shuffle-servers and then slow the running speed.

I think we should make coordinator wait for more than one shuffle-server heartbeat interval before serving for client. During out-of-service, requests from client will fallback to slave coordinator.

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

UTs

@zuston zuston requested a review from roryqi September 28, 2022 05:49
@zuston

zuston commented Sep 28, 2022

Copy link
Copy Markdown
Member Author

@codecov-commenter

codecov-commenter commented Sep 28, 2022

Copy link
Copy Markdown

Codecov Report

Merging #247 (63af987) into master (c89f95c) will increase coverage by 0.02%.
The diff coverage is 66.66%.

@@             Coverage Diff              @@
##             master     #247      +/-   ##
============================================
+ Coverage     59.16%   59.18%   +0.02%     
- Complexity     1340     1343       +3     
============================================
  Files           163      163              
  Lines          8810     8837      +27     
  Branches        833      835       +2     
============================================
+ Hits           5212     5230      +18     
- Misses         3332     3339       +7     
- Partials        266      268       +2     
Impacted Files Coverage Δ
...he/uniffle/coordinator/CoordinatorGrpcService.java 2.31% <0.00%> (-0.03%) ⬇️
...ache/uniffle/coordinator/SimpleClusterManager.java 86.61% <53.33%> (-4.46%) ⬇️
...rg/apache/uniffle/coordinator/CoordinatorConf.java 97.26% <100.00%> (+0.20%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@zuston zuston changed the title [FEATUER] Introduce startup-silent-period mechanism to avoid partial assignments Introduce startup-silent-period mechanism to avoid partial assignments Sep 28, 2022
@roryqi

roryqi commented Sep 29, 2022

Copy link
Copy Markdown
Contributor

Is startup-silent-period a good name? What's name of similar mechanism of Yarn?

@zuston

zuston commented Sep 29, 2022

Copy link
Copy Markdown
Member Author

Is startup-silent-period a good name? What's name of similar mechanism of Yarn?

No such config in Yarn. Safe mode is for HDFS. I think it's not proper in Uniffle scenarios

Do u have any ideas?

@roryqi

roryqi commented Oct 9, 2022

Copy link
Copy Markdown
Contributor

Is startup-silent-period a good name? What's name of similar mechanism of Yarn?

No such config in Yarn. Safe mode is for HDFS. I think it's not proper in Uniffle scenarios

Do u have any ideas?

It's ok for Uniffle. I don't have another good ideas.

@zuston

zuston commented Oct 9, 2022

Copy link
Copy Markdown
Member Author

It's ok for Uniffle. I don't have another good ideas.

Got it. I will rename to safemode.

@roryqi

roryqi commented Oct 9, 2022

Copy link
Copy Markdown
Contributor

It's ok for Uniffle. I don't have another good ideas.

Got it. I will rename to safemode.

I means that startup-silent-period is ok.

@zuston

zuston commented Oct 9, 2022

Copy link
Copy Markdown
Member Author

OK. Do u have any other advice? @jerqi

.key("rss.coordinator.startup-silent-period.enabled")
.booleanType()
.defaultValue(false)
.withDescription("Enable the startup-silent-period to reject the assignment requests "

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add more description to explain why we shouldn't use true as default value?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@zuston

zuston commented Oct 10, 2022

Copy link
Copy Markdown
Member Author

Updated.

  1. Add the doc of these configs
  2. Explain why make the startup-slient-period stop default.

@roryqi roryqi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @zuston , wait for CI

@zuston

zuston commented Oct 10, 2022

Copy link
Copy Markdown
Member Author
Error:  Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 17.998 s <<< FAILURE! - in org.apache.uniffle.test.CoordinatorGrpcTest
Error:  rpcMetricsTest  Time elapsed: 1.035 s  <<< FAILURE!
org.opentest4j.AssertionFailedError: expected: <0.0> but was: <1.0>

CI failed. https://github.com/apache/incubator-uniffle/actions/runs/3216468616/jobs/5258387024 #244

Rerun it.

@roryqi

roryqi commented Oct 10, 2022

Copy link
Copy Markdown
Contributor

@roryqi roryqi merged commit 5875eb3 into apache:master Oct 10, 2022
@zuston zuston deleted the rejection branch October 10, 2022 06:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants