[Feature] Support user's app quota level limit by smallzhongfeng · Pull Request #311 · apache/uniffle · GitHub
Skip to content

[Feature] Support user's app quota level limit#311

Merged
roryqi merged 13 commits into
apache:masterfrom
smallzhongfeng:app-quota
Nov 22, 2022
Merged

[Feature] Support user's app quota level limit#311
roryqi merged 13 commits into
apache:masterfrom
smallzhongfeng:app-quota

Conversation

@smallzhongfeng

@smallzhongfeng smallzhongfeng commented Nov 11, 2022

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

For issue #211 and the design document https://docs.google.com/document/d/1MApSMFQgoS1VAoKbZjomqSRm0iTbSuKG1yvKNlWW65c/edit?usp=sharing

Why are the changes needed?

Better isolation of resources between different users.

Does this PR introduce any user-facing change?

Add config rss.coordinator.quota.default.app.num to set default app number each user and rss.coordinator.quota.default.path to set a path to record the number of apps that each user can run.

How was this patch tested?

Add uts.

@roryqi

roryqi commented Nov 11, 2022

Copy link
Copy Markdown
Contributor

Comment thread proto/src/main/proto/Rss.proto Outdated

message AppHeartBeatRequest {
string appId = 1;
string user = 2;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do the heartbeat need the user?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we register the shuffle, we put the user information to the shuffle server. Is it enough?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it isn't enough. Though user information to the shuffle server, we have to send server heartbeat to coordinator.So we use this app heartBeat to refresh each user' s app number which stored in coordinator.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For coordinator, could we pass the user information to the coordinator?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the heartbeat of the driver is used to send to the coordinator, the implementation is simple. If the heartbeat of the shuffleServer is used, you need to add a collection attribute in the heartbeat request of the shuffleServer, record the user and the corresponding app list, and then summarize them in the coordinator. Right ? And I haven't found the advantages of doing this for the time being. At present, this pr and our current production environment deployment have enough restrictions.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't get your point. I feel it's unnecessary and strange that we put the user information to the heartbeat. We have register the user information to shuffle server. Why do we need heartbeat? We hope we can reduce the repeated information.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got your point. The coordinator needs to rely on the user to update the life cycle of the app, so the user is added to the heartbeat of the app. As for the repetitive problem you mentioned, do you have a better solution?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the heartbeat of the driver is used to send to the coordinator, the implementation is simple. If the heartbeat of the shuffleServer is used, you need to add a collection attribute in the heartbeat request of the shuffleServer, record the user and the corresponding app list, and then summarize them in the coordinator. Right ? And I haven't found the advantages of doing this for the time being. At present, this pr and our current production environment deployment have enough restrictions.

What's your restrictions?

@smallzhongfeng smallzhongfeng Nov 12, 2022

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At present,restrictions is the number of apps that each user can submit.

@codecov-commenter

codecov-commenter commented Nov 11, 2022

Copy link
Copy Markdown

Codecov Report

Merging #311 (25b4b78) into master (79d2f54) will decrease coverage by 2.93%.
The diff coverage is 70.87%.

@@             Coverage Diff              @@
##             master     #311      +/-   ##
============================================
- Coverage     61.32%   58.39%   -2.94%     
- Complexity     1526     1552      +26     
============================================
  Files           186      193       +7     
  Lines          9441    10761    +1320     
  Branches        924      937      +13     
============================================
+ Hits           5790     6284     +494     
- Misses         3341     4103     +762     
- Partials        310      374      +64     
Impacted Files Coverage Δ
...apache/hadoop/mapreduce/v2/app/RssMRAppMaster.java 0.00% <0.00%> (ø)
...he/uniffle/client/impl/ShuffleWriteClientImpl.java 28.38% <0.00%> (-1.76%) ⬇️
...he/uniffle/coordinator/CoordinatorGrpcService.java 2.09% <0.00%> (-0.21%) ⬇️
.../org/apache/uniffle/coordinator/AccessManager.java 94.28% <75.00%> (-5.72%) ⬇️
...a/org/apache/uniffle/coordinator/QuotaManager.java 88.63% <88.63%> (ø)
...apache/uniffle/coordinator/ApplicationManager.java 84.93% <91.17%> (+1.13%) ⬆️
...apache/uniffle/coordinator/AccessQuotaChecker.java 95.83% <95.83%> (ø)
.../java/org/apache/uniffle/common/util/RssUtils.java 63.35% <100.00%> (+0.28%) ⬆️
.../apache/uniffle/coordinator/AccessCheckResult.java 92.30% <100.00%> (+8.97%) ⬆️
...ava/org/apache/uniffle/coordinator/AccessInfo.java 92.30% <100.00%> (+1.39%) ⬆️
... and 13 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@smallzhongfeng

Copy link
Copy Markdown
Contributor Author

Could you modify the description to attach your design document? Do the shuffle server have quota concept? cc @Gustfh

For now, shuffle server don't have quota concept.

smallzhongfeng added 2 commits November 12, 2022 16:23
public <K, V, C> ShuffleHandle registerShuffle(int shuffleId, int numMaps, ShuffleDependency<K, V, C> dependency) {
// If yarn enable retry ApplicationMaster, appId will be not unique and shuffle data will be incorrect,
// appId + timestamp can avoid such problem,
// appId + uuid can avoid such problem,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need uuid?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't get the appId when we try Access, because the appId is generated after the RssManager is created. In order to support push down, we maintain the uuid as a substitute for the appId, and replace the uuid with the appId after the app heartbeat is reported to the coordinator.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can put the appId to the AcessInfo when we try to access coordinator.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to generate the uuid on the driver?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My fault. We ignore that I can't get the appId in the construct method. Let me think twice.

@roryqi

roryqi commented Nov 12, 2022

Copy link
Copy Markdown
Contributor

If we have shuffle server quota, shuffle server will report the some metrics to the coordinator. The metrics will contain user quota info. Maybe we can reuse the information. @Gustfh WDTY?

@roryqi

roryqi commented Nov 12, 2022

Copy link
Copy Markdown
Contributor

Maybe we could add a rpc method called registerApplicationInfo for Coordinator.

@smallzhongfeng

smallzhongfeng commented Nov 12, 2022

Copy link
Copy Markdown
Contributor Author

Maybe we could add a rpc method called registerApplicationInfo for Coordinator.

You mean to add a request different from RssAppHeartbeat, right?

@roryqi

roryqi commented Nov 13, 2022

Copy link
Copy Markdown
Contributor

Maybe we could add a rpc method called registerApplicationInfo for Coordinator.

You mean to add a request different from RssAppHeartbeat, right?

Yes.

@smallzhongfeng

Copy link
Copy Markdown
Contributor Author

Maybe we could add a rpc method called registerApplicationInfo for Coordinator.

PTAL. @jerqi

public void appHeartbeat(
AppHeartBeatRequest request,
StreamObserver<AppHeartBeatResponse> responseObserver) {
public void registerApplicationInfo(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we rename appHeartbeat to registerApplicationInfo? I mean that we can add a method registerApplicationInfo.

@smallzhongfeng smallzhongfeng Nov 18, 2022

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could add a rpc method called registerApplicationInfo for Coordinator.

Didn't it say to add a new rpc method? If not, the shuffle server still get appHeartbeat contained the user's attribute

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CoordinatorCrpcService should have heartbeat and registerApplicationInfo methods at the same time.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean to keep sendAppHeartbeat, and use registerApplicationInfo to send a request to the coordinator only once for recording. And still use sendAppHeartbeat later, right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean to keep sendAppHeartbeat, and use registerApplicationInfo to send a request to the coordinator only once for recording. And still use sendAppHeartbeat later, right?

Yes. we use the registerApplicationInfo to coordinator and then send heartbeat to coordinator.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return storageHost;
}

public void detectUserResource() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the application manager involve the quota logic?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we can configure rss.coordinator.quota.default.path to controls whether to execute the logic of detectUserResource. If not configured, the logic will not be executed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a quota manager? Or will application manager manage both quota and application?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe quota manager will be better.

Sets.newHashSet(request.getTagsList()),
request.getExtraPropertiesMap()
request.getExtraPropertiesMap(),
request.getUser()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a compatible feature? What will happen if a old client request a new coordinator service?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 36fb720

Comment thread client-spark/spark3/pom.xml Outdated
<artifactId>spark-core_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
<exclusions>

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to remove this.

long heartbeatInterval = conf.getLong(RssMRConfig.RSS_HEARTBEAT_INTERVAL,
RssMRConfig.RSS_HEARTBEAT_INTERVAL_DEFAULT_VALUE);
long heartbeatTimeout = conf.getLong(RssMRConfig.RSS_HEARTBEAT_TIMEOUT, heartbeatInterval / 2);
client.registerApplicationInfo(appId, heartbeatTimeout, "user");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we remove this if mr don't need to register application info?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this because applicationManager remove expired app need user. When our Spark does not use the AppQuota checker, it also needs register application info, so it will not have much impact here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this isn't compatible feature. If we use old client to connect new service, something wrong will happen.

@smallzhongfeng smallzhongfeng Nov 21, 2022

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because the old client does not have user information or method of registerApplicationInfo, app can only be updated at refreshApp, which I am compatible with.

}

@Override
public RssApplicationInfoResponse sendApplicationInfo(RssApplicationInfoRequest request) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sendApplicationInfo -> registerApplicationInfo?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@roryqi roryqi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, wait for CI

@roryqi roryqi changed the title [Feature] User's app quota [Feature] Support user's app quota limit Nov 21, 2022
@roryqi roryqi changed the title [Feature] Support user's app quota limit [Feature] Support user's app quota level limit Nov 21, 2022
@smallzhongfeng

Copy link
Copy Markdown
Contributor Author

Because in the discussion of the last meeting, the shuffle server also needs to be limited, but for the number of apps, the number of apps on a sever may not directly determine the load, so I have not developed this place. WDYT? @jerqi

@roryqi

roryqi commented Nov 21, 2022

Copy link
Copy Markdown
Contributor

Because in the discussion of the last meeting, the shuffle server also needs to be limited, but for the number of apps, the number of apps on a sever may not directly determine the load, so I have not developed this place. WDYT? @jerqi

It's ok for me. Wait for a moment. Let me see @Gustfh whether have another suggestion. If he don't reply us, I'll merge this pr tomorrow.

@roryqi

roryqi commented Nov 22, 2022

Copy link
Copy Markdown
Contributor

merged. Thanks @smallzhongfeng

@roryqi roryqi merged commit 8ff41a5 into apache:master Nov 22, 2022
@xianjingfeng

Copy link
Copy Markdown
Member

@jerqi @smallzhongfeng I think this feature should be an option. WDYT?

@roryqi

roryqi commented Nov 23, 2022

Copy link
Copy Markdown
Contributor

@xianjingfeng

Copy link
Copy Markdown
Member

Yes, this is an option now. You can see

But QuotaManager always be created, and it will create a thread pool.

@roryqi

roryqi commented Nov 23, 2022

Copy link
Copy Markdown
Contributor

Yes, this is an option now. You can see

But QuotaManager always be created, and it will create a thread pool.

You can raise a pr to fix this issue.

@smallzhongfeng

Copy link
Copy Markdown
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants