[ISSUE-106][IMPROVEMENT] Set rpc timeout for all rpc interface by xianjingfeng · Pull Request #113 · apache/uniffle · GitHub
Skip to content

[ISSUE-106][IMPROVEMENT] Set rpc timeout for all rpc interface#113

Merged
roryqi merged 9 commits into
apache:masterfrom
xianjingfeng:issue_106
Aug 3, 2022
Merged

[ISSUE-106][IMPROVEMENT] Set rpc timeout for all rpc interface#113
roryqi merged 9 commits into
apache:masterfrom
xianjingfeng:issue_106

Conversation

@xianjingfeng

@xianjingfeng xianjingfeng commented Jul 30, 2022

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

Solve issue #106 Set rpc timeout for all rpc interface

Why are the changes needed?

If oom encountered in shuffle server, rpc client will blocked until shuffle server was killed

Does this PR introduce any user-facing change?

No

How was this patch tested?

No need

@xianjingfeng xianjingfeng changed the title Issue 106 [Improvement] Set rpc timeout for all rpc interface Jul 30, 2022
@xianjingfeng

Copy link
Copy Markdown
Member Author

@xianjingfeng

Copy link
Copy Markdown
Member Author

Should we modify the value of rss.server.app.expired.withoutHeartbeat in all ut? @jerqi

@roryqi

roryqi commented Aug 1, 2022

Copy link
Copy Markdown
Contributor

Should we modify the value of rss.server.app.expired.withoutHeartbeat in all ut? @jerqi

Could you just modify the rpc timeout configuration to solve this problem?

@xianjingfeng

Copy link
Copy Markdown
Member Author

Should we modify the value of rss.server.app.expired.withoutHeartbeat in all ut? @jerqi

Could you just modify the rpc timeout configuration to solve this problem?

This ut not fail every time, it is affected by host load. Maybe it will be normal to run again

@roryqi

roryqi commented Aug 1, 2022

Copy link
Copy Markdown
Contributor

Should we modify the value of rss.server.app.expired.withoutHeartbeat in all ut? @jerqi

Could you just modify the rpc timeout configuration to solve this problem?

This ut not fail every time, it is affected by host load. Maybe it will be normal to run again

You can trigger the test when you close the pr and reopen the pr again. According to my experience, the test won't be flaky test, if there are multiple failure pipelines.

@xianjingfeng xianjingfeng reopened this Aug 1, 2022
@roryqi

roryqi commented Aug 1, 2022

Copy link
Copy Markdown
Contributor

cc @colinmjj

@xianjingfeng

Copy link
Copy Markdown
Member Author

Error: writeTest Time elapsed: 51.138 s <<< FAILURE! org.opentest4j.AssertionFailedError: Unexpected flush process at org.apache.uniffle.server.ShuffleFlushManagerTest.waitForFlush(ShuffleFlushManagerTest.java:315) at org.apache.uniffle.server.ShuffleFlushManagerTest.writeTest(ShuffleFlushManagerTest.java:123)
Maybe something's wrong with CI ? @jerqi

@xianjingfeng xianjingfeng reopened this Aug 1, 2022
@roryqi

roryqi commented Aug 1, 2022

Copy link
Copy Markdown
Contributor

Error: writeTest Time elapsed: 51.138 s <<< FAILURE! org.opentest4j.AssertionFailedError: Unexpected flush process at org.apache.uniffle.server.ShuffleFlushManagerTest.waitForFlush(ShuffleFlushManagerTest.java:315) at org.apache.uniffle.server.ShuffleFlushManagerTest.writeTest(ShuffleFlushManagerTest.java:123) Maybe something's wrong with CI ? @jerqi

It's a flaky test. If you rerun, it will succeed.

@roryqi

roryqi commented Aug 2, 2022

Copy link
Copy Markdown
Contributor

If you want to see the log, you can download https://github.com/apache/incubator-uniffle/actions/runs/2779258464
企业微信截图_ddffa6a0-c008-4150-bfa7-3fa21a08d420

@xianjingfeng

Copy link
Copy Markdown
Member Author

I can't see the contents like this. I just can see simple maven log.

@roryqi

roryqi commented Aug 2, 2022

Copy link
Copy Markdown
Contributor

I can't see the contents like this. I just can see simple maven log.

Click the link that I give you, and search Artifacts

@xianjingfeng

Copy link
Copy Markdown
Member Author

Artifacts

I found it, thanks

@codecov-commenter

codecov-commenter commented Aug 2, 2022

Copy link
Copy Markdown

Codecov Report

Merging #113 (561d557) into master (deb7fe4) will increase coverage by 0.59%.
The diff coverage is n/a.

@@             Coverage Diff              @@
##             master     #113      +/-   ##
============================================
+ Coverage     56.52%   57.11%   +0.59%     
- Complexity     1183     1199      +16     
============================================
  Files           149      150       +1     
  Lines          8019     8176     +157     
  Branches        767      773       +6     
============================================
+ Hits           4533     4670     +137     
- Misses         3242     3259      +17     
- Partials        244      247       +3     
Impacted Files Coverage Δ
.../apache/uniffle/coordinator/ClientConfManager.java 91.54% <0.00%> (-1.41%) ⬇️
...he/uniffle/coordinator/CoordinatorGrpcService.java 2.33% <0.00%> (-0.03%) ⬇️
.../java/org/apache/hadoop/mapreduce/RssMRConfig.java 66.66% <0.00%> (ø)
...apache/hadoop/mapreduce/v2/app/RssMRAppMaster.java 0.00% <0.00%> (ø)
...org/apache/spark/shuffle/RssSparkShuffleUtils.java 49.20% <0.00%> (ø)
...ava/org/apache/uniffle/common/util/RetryUtils.java 71.42% <0.00%> (ø)
...e/uniffle/coordinator/BasicAssignmentStrategy.java 96.77% <0.00%> (+0.34%) ⬆️
...che/spark/shuffle/writer/BufferManagerOptions.java 75.67% <0.00%> (+1.48%) ⬆️
...oordinator/PartitionBalanceAssignmentStrategy.java 98.46% <0.00%> (+4.91%) ⬆️
.../java/org/apache/spark/shuffle/RssSparkConfig.java 96.49% <0.00%> (+21.49%) ⬆️
... and 1 more

Help us with your feedback. Take ten seconds to tell us how you rate us.

@roryqi

roryqi commented Aug 2, 2022

Copy link
Copy Markdown
Contributor

LGTM @colinmjj Do you have any other suggestion?

@roryqi roryqi changed the title [Improvement] Set rpc timeout for all rpc interface [ISSUE-106][IMPROVEMENT] Set rpc timeout for all rpc interface Aug 2, 2022
@colinmjj

colinmjj commented Aug 3, 2022

Copy link
Copy Markdown

@roryqi roryqi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution @xianjingfeng

@roryqi roryqi merged commit 503f796 into apache:master Aug 3, 2022
@xianjingfeng xianjingfeng deleted the issue_106 branch March 1, 2023 13:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants