iframe-proxy

smallzhongfeng · 2022-08-06T16:01:50Z

What changes were proposed in this pull request?

To solve issue #127

Why are the changes needed?

Avoid some memory shortage situations, and retry to ensure that the tasks run in the RSS cluster as much as possible.

Does this PR introduce any user-facing change?

Two new parameters are added on the client side, spark.rss.client.access.retry.times the number of retry reconnection and spark.rss.client.access.retry.interval.ms the reconnection interval. The user can set these two parameters within his expected time to make the task run in the RSS cluster as much as possible.

How was this patch tested?

No need.

…quest of AccessCluster

codecov-commenter · 2022-08-06T16:18:56Z

roryqi · 2022-08-07T03:39:34Z

-
+
+  public static final ConfigEntry<Long> RSS_CLIENT_FALLBACK_RETRY_INTERVAL = createLongBuilder(
+      new ConfigBuilder("spark.rss.client.fallback.retry.interval")


Could we use spark.rss.client.access.retry.interval.ms? Because we don't attempt to fallback, we want to access RSS.The variable's name should have unit, it improves readability.

roryqi · 2022-08-07T03:49:17Z

Does this PR introduce any user-facing change?

No.

This is a user-facing change. It add some config options, you should add some documents to explain how to use it.

roryqi · 2022-08-07T03:46:04Z

          .createWithDefault(RssClientConfig.RSS_CLIENT_ASSIGNMENT_RETRY_TIMES_DEFAULT_VALUE);
-
+
+  public static final ConfigEntry<Long> RSS_CLIENT_FALLBACK_RETRY_INTERVAL = createLongBuilder(


Could we usespark.rss.client.access.retry.interval.ms? Because we want to access RSS instead of fallback, the variable's name contains unit, it will improve readability.

Could we change RSS_CLIENT_FALLBACK_RETRY_INTERVAL to RSS_CLIENT_ACCESS_RETRY_INTERVAL_MS together?

roryqi · 2022-08-07T03:46:23Z

+          .doc("Interval between retries fallback to SortShuffleManager"))
+      .createWithDefault(20000L);
+
+  public static final ConfigEntry<Integer> RSS_CLIENT_FALLBACK_RETRY_TIMES = createIntegerBuilder(


roryqi · 2022-08-07T03:47:59Z

+  public static final ConfigEntry<Integer> RSS_CLIENT_FALLBACK_RETRY_TIMES = createIntegerBuilder(
+      new ConfigBuilder("spark.rss.client.fallback.retry.times")
+          .doc("Number of retries fallback to SortShuffleManager"))
+      .createWithDefault(3);


Could we use default value 0? Because we want to keep consistent with the previous behaviour.

Agree with you, users can change it on their own if necessary.

smallzhongfeng · 2022-08-07T07:41:06Z

Does this PR introduce any user-facing change?

No.

This is a user-facing change. It add some config options, you should add some documents to explain how to use it.

Added.

roryqi · 2022-08-07T08:18:20Z

          .createWithDefault(RssClientConfig.RSS_CLIENT_ASSIGNMENT_RETRY_TIMES_DEFAULT_VALUE);
-
+
+  public static final ConfigEntry<Long> RSS_CLIENT_FALLBACK_RETRY_INTERVAL = createLongBuilder(


Could we change RSS_CLIENT_FALLBACK_RETRY_INTERVAL to RSS_CLIENT_ACCESS_RETRY_INTERVAL_MS together?

roryqi · 2022-08-07T08:18:57Z

+      .createWithDefault(20000L);
+
+  public static final ConfigEntry<Integer> RSS_CLIENT_FALLBACK_RETRY_TIMES = createIntegerBuilder(
+      new ConfigBuilder("spark.rss.client.access.retry.times")


Could we change RSS_CLIENT_FALLBACK_RETRY_TIMES to RSS_CLIENT_ACCESS_RETRY_TIMES together?

I forgot to change it. Updated.

roryqi · 2022-08-07T08:20:39Z

Could we add some test cases for this pr? Could you add documents in https://github.com/apache/incubator-uniffle/blob/master/docs/client_guide.md?

smallzhongfeng · 2022-08-07T09:17:02Z

Could we add some test cases for this pr? Could you add documents in https://github.com/apache/incubator-uniffle/blob/master/docs/client_guide.md?

I think this PR has more logic of retry, and I think it is enough to have the test class RetryUtilsTest, or do you have any better suggestions?

roryqi · 2022-08-07T09:22:15Z

Could we add some test cases for this pr? Could you add documents in https://github.com/apache/incubator-uniffle/blob/master/docs/client_guide.md?

I think this PR has more logic of retry, and I think it is enough to have the test class RetryUtilsTest, or do you have any better suggestions?

Could we add a test case in DelegationRssShuffleManagerTest? If you change the logic, we'd better have test case.

smallzhongfeng · 2022-08-07T14:46:57Z

Could we add some test cases for this pr? Could you add documents in https://github.com/apache/incubator-uniffle/blob/master/docs/client_guide.md?

I think this PR has more logic of retry, and I think it is enough to have the test class RetryUtilsTest, or do you have any better suggestions?

Could we add a test case in DelegationRssShuffleManagerTest? If you change the logic, we'd better have test case.

I added a simple test. Do you have a good idea?

roryqi · 2022-08-07T15:11:57Z


+  @Test
+  public void testTryAccessCluster() {
+    SparkConf conf = new SparkConf();


Could we mock coordinator client? Could we use the method tryAccessCluster in this case?

I thought so at first, but tryAccessCluster is not directly called by the CoordinatorClient, and the tryAccessClustermethod depends on the list of global object coordinatorClients, so I don't have a better way for the time being.

coordinatorClients is created by RssSparkShuffleUtils.createCoordinatorClients(sparkConf). Could we imitate https://github.com/apache/incubator-uniffle/blob/79804c544b560ae3e872964a428d328dd71489a7/client-spark/spark2/src/test/java/org/apache/spark/shuffle/DelegationRssShuffleManagerTest.java#L79

roryqi · 2022-08-08T03:43:20Z

+    conf.set(RssSparkConfig.RSS_ACCESS_ID.key(), "mockId");
+    conf.set(RssSparkConfig.RSS_COORDINATOR_QUORUM.key(), "m1:8001,m2:8002");
+    conf.set("spark.rss.storage.type", StorageType.LOCALFILE.name());
+    assertCreateRssShuffleManager(conf);


Could we add a case that the the access fail 4 times and we need to create sort shuffle manager?

roryqi

Thanks @smallzhongfeng LGTM

Add timeout reconnection when DelegationRssShuffleManager send the re…

b2aa545

…quest of AccessCluster

roryqi reviewed Aug 7, 2022

View reviewed changes

change params' name

eb99951

roryqi reviewed Aug 7, 2022

View reviewed changes

add doc

56b7470

smallzhongfeng added 3 commits August 7, 2022 22:13

add test for TryAccessCluster

4c95fd8

add test for spark3

9d7133e

fix sth

c483ead

roryqi reviewed Aug 7, 2022

View reviewed changes

add test

249b8eb

roryqi reviewed Aug 8, 2022

View reviewed changes

add test

2a84fdf

roryqi approved these changes Aug 8, 2022

View reviewed changes

roryqi merged commit 2c6705c into apache:master Aug 8, 2022

roryqi mentioned this pull request Aug 8, 2022

[Improvement] Add timeout reconnection in DelegationRssShuffleManager #127

Closed



		public static final ConfigEntry<Long> RSS_CLIENT_FALLBACK_RETRY_INTERVAL = createLongBuilder(
		new ConfigBuilder("spark.rss.client.fallback.retry.interval")

		.createWithDefault(RssClientConfig.RSS_CLIENT_ASSIGNMENT_RETRY_TIMES_DEFAULT_VALUE);


		public static final ConfigEntry<Long> RSS_CLIENT_FALLBACK_RETRY_INTERVAL = createLongBuilder(

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

smallzhongfeng commented Aug 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

codecov-commenter commented Aug 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smallzhongfeng Aug 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

roryqi commented Aug 7, 2022

Does this PR introduce any user-facing change?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smallzhongfeng commented Aug 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Does this PR introduce any user-facing change?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

roryqi commented Aug 7, 2022

Uh oh!

smallzhongfeng commented Aug 7, 2022

Uh oh!

roryqi commented Aug 7, 2022

Uh oh!

smallzhongfeng commented Aug 7, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smallzhongfeng Aug 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

roryqi left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

smallzhongfeng commented Aug 6, 2022 •

edited

Loading

codecov-commenter commented Aug 6, 2022 •

edited

Loading

smallzhongfeng Aug 7, 2022 •

edited

Loading

smallzhongfeng commented Aug 7, 2022 •

edited

Loading

smallzhongfeng Aug 7, 2022 •

edited

Loading