[Improvement] Add timeout reconnection when DelegationRssShuffleManager send the request of AccessCluster by smallzhongfeng · Pull Request #139 · apache/uniffle · GitHub
Skip to content

[Improvement] Add timeout reconnection when DelegationRssShuffleManager send the request of AccessCluster#139

Merged
roryqi merged 8 commits into
apache:masterfrom
smallzhongfeng:DelegationRssShuffleManager-retry
Aug 8, 2022
Merged

[Improvement] Add timeout reconnection when DelegationRssShuffleManager send the request of AccessCluster#139
roryqi merged 8 commits into
apache:masterfrom
smallzhongfeng:DelegationRssShuffleManager-retry

Conversation

@smallzhongfeng

@smallzhongfeng smallzhongfeng commented Aug 6, 2022

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

To solve issue #127

Why are the changes needed?

Avoid some memory shortage situations, and retry to ensure that the tasks run in the RSS cluster as much as possible.

Does this PR introduce any user-facing change?

Two new parameters are added on the client side, spark.rss.client.access.retry.times the number of retry reconnection and spark.rss.client.access.retry.interval.ms the reconnection interval. The user can set these two parameters within his expected time to make the task run in the RSS cluster as much as possible.

How was this patch tested?

No need.

@codecov-commenter

codecov-commenter commented Aug 6, 2022

Copy link
Copy Markdown



public static final ConfigEntry<Long> RSS_CLIENT_FALLBACK_RETRY_INTERVAL = createLongBuilder(
new ConfigBuilder("spark.rss.client.fallback.retry.interval")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use spark.rss.client.access.retry.interval.ms? Because we don't attempt to fallback, we want to access RSS.The variable's name should have unit, it improves readability.

@smallzhongfeng smallzhongfeng Aug 7, 2022

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

@roryqi

roryqi commented Aug 7, 2022

Copy link
Copy Markdown
Contributor

Does this PR introduce any user-facing change?

No.

This is a user-facing change. It add some config options, you should add some documents to explain how to use it.

.createWithDefault(RssClientConfig.RSS_CLIENT_ASSIGNMENT_RETRY_TIMES_DEFAULT_VALUE);


public static final ConfigEntry<Long> RSS_CLIENT_FALLBACK_RETRY_INTERVAL = createLongBuilder(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we usespark.rss.client.access.retry.interval.ms? Because we want to access RSS instead of fallback, the variable's name contains unit, it will improve readability.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we change RSS_CLIENT_FALLBACK_RETRY_INTERVAL to RSS_CLIENT_ACCESS_RETRY_INTERVAL_MS together?

.doc("Interval between retries fallback to SortShuffleManager"))
.createWithDefault(20000L);

public static final ConfigEntry<Integer> RSS_CLIENT_FALLBACK_RETRY_TIMES = createIntegerBuilder(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

public static final ConfigEntry<Integer> RSS_CLIENT_FALLBACK_RETRY_TIMES = createIntegerBuilder(
new ConfigBuilder("spark.rss.client.fallback.retry.times")
.doc("Number of retries fallback to SortShuffleManager"))
.createWithDefault(3);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use default value 0? Because we want to keep consistent with the previous behaviour.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with you, users can change it on their own if necessary.

@smallzhongfeng

smallzhongfeng commented Aug 7, 2022

Copy link
Copy Markdown
Contributor Author

Does this PR introduce any user-facing change?

No.

This is a user-facing change. It add some config options, you should add some documents to explain how to use it.

Added.

.createWithDefault(RssClientConfig.RSS_CLIENT_ASSIGNMENT_RETRY_TIMES_DEFAULT_VALUE);


public static final ConfigEntry<Long> RSS_CLIENT_FALLBACK_RETRY_INTERVAL = createLongBuilder(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we change RSS_CLIENT_FALLBACK_RETRY_INTERVAL to RSS_CLIENT_ACCESS_RETRY_INTERVAL_MS together?

.createWithDefault(20000L);

public static final ConfigEntry<Integer> RSS_CLIENT_FALLBACK_RETRY_TIMES = createIntegerBuilder(
new ConfigBuilder("spark.rss.client.access.retry.times")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we change RSS_CLIENT_FALLBACK_RETRY_TIMES to RSS_CLIENT_ACCESS_RETRY_TIMES together?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to change it. Updated.

@roryqi

roryqi commented Aug 7, 2022

Copy link
Copy Markdown
Contributor

Could we add some test cases for this pr? Could you add documents in https://github.com/apache/incubator-uniffle/blob/master/docs/client_guide.md?

@smallzhongfeng

Copy link
Copy Markdown
Contributor Author

Could we add some test cases for this pr? Could you add documents in https://github.com/apache/incubator-uniffle/blob/master/docs/client_guide.md?

I think this PR has more logic of retry, and I think it is enough to have the test class RetryUtilsTest, or do you have any better suggestions?

@roryqi

roryqi commented Aug 7, 2022

Copy link
Copy Markdown
Contributor

Could we add some test cases for this pr? Could you add documents in https://github.com/apache/incubator-uniffle/blob/master/docs/client_guide.md?

I think this PR has more logic of retry, and I think it is enough to have the test class RetryUtilsTest, or do you have any better suggestions?

Could we add a test case in DelegationRssShuffleManagerTest? If you change the logic, we'd better have test case.

@smallzhongfeng

Copy link
Copy Markdown
Contributor Author

Could we add some test cases for this pr? Could you add documents in https://github.com/apache/incubator-uniffle/blob/master/docs/client_guide.md?

I think this PR has more logic of retry, and I think it is enough to have the test class RetryUtilsTest, or do you have any better suggestions?

Could we add a test case in DelegationRssShuffleManagerTest? If you change the logic, we'd better have test case.

I added a simple test. Do you have a good idea?


@Test
public void testTryAccessCluster() {
SparkConf conf = new SparkConf();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we mock coordinator client? Could we use the method tryAccessCluster in this case?

@smallzhongfeng smallzhongfeng Aug 7, 2022

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought so at first, but tryAccessCluster is not directly called by the CoordinatorClient, and the tryAccessClustermethod depends on the list of global object coordinatorClients, so I don't have a better way for the time being.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a case that the the access fail 4 times and we need to create sort shuffle manager?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@roryqi roryqi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @smallzhongfeng LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants