[Improvement] Add hdfs path health check to AppBalanceSelectStorageStrategy by smallzhongfeng · Pull Request #210 · apache/uniffle · GitHub
Skip to content

[Improvement] Add hdfs path health check to AppBalanceSelectStorageStrategy#210

Merged
roryqi merged 24 commits into
apache:masterfrom
smallzhongfeng:add-unhealthyPathCheck
Oct 9, 2022
Merged

[Improvement] Add hdfs path health check to AppBalanceSelectStorageStrategy#210
roryqi merged 24 commits into
apache:masterfrom
smallzhongfeng:add-unhealthyPathCheck

Conversation

@smallzhongfeng

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

The detection logic of abnormal paths is also added to the APP_BALANCE strategy.

Why are the changes needed?

Anomaly detection can be performed when selecting a remote path to avoid selecting a problematic path.

Does this PR introduce any user-facing change?

The parameters of the original IO_SAMPLE strategy are reused and the names are changed.

  1. rss.coordinator.remote.storage.select.strategy support APP_BALANCE and IO_SAMPLE, APP_BALANCE selection strategy based on the number of apps, IO_SAMPLE selection strategy based on time consumption of reading and writing files.
  2. rss.coordinator.remote.storage.schedule.time , if user choose IO_SAMPLE, file will be read and written at regular intervals.
  3. rss.coordinator.remote.storage.schedule.file.size , the size of each read / write HDFS file.
  4. rss.coordinator.remote.storage.schedule.access.times, number of times to read and write HDFS files.

How was this patch tested?

Passed uts.

@codecov-commenter

codecov-commenter commented Sep 11, 2022

Copy link
Copy Markdown

smallzhongfeng added 2 commits September 11, 2022 23:12
List<Map.Entry<String, RankValue>> sortPathByRankValue(String path, Path testPath, long time, boolean isHealthy);

Map<String, RankValue> getRemoteStoragePathRankValue();
void setFs(FileSystem fs);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would better not assume our remote storage is only FileSystem.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate more, or give me some other examples ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to support S3 , COS, Redis or some DBs, it's not suitable for us.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I can remove this method, thinking that if I want to support redis or other databases, I need to re implement the sortPathByRankValue method. It's a huge job.

@roryqi

roryqi commented Sep 14, 2022

Copy link
Copy Markdown
Contributor

Like discussion at dev mail list, we will freeze the code and cut 0.6 version branch in September 15, we will not merge this pr before I cut 0.6 version branch, are you ok?

1 similar comment
@roryqi

roryqi commented Sep 14, 2022

Copy link
Copy Markdown
Contributor

Like discussion at dev mail list, we will freeze the code and cut 0.6 version branch in September 15, we will not merge this pr before I cut 0.6 version branch, are you ok?

@smallzhongfeng

Copy link
Copy Markdown
Contributor Author

Like discussion at dev mail list, we will freeze the code and cut 0.6 version branch in September 15, we will not merge this pr before I cut 0.6 version branch, are you ok?

No problem, I got it.

Map<String, RemoteStorageInfo> getAppIdToRemoteStorageInfo();

Map<String, RankValue> getRemoteStoragePathRankValue();
List<Map.Entry<String, RankValue>> sortPathByRankValue(String path, Path testPath, long time, boolean isHealthy);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path seems to be a concept of filesystem. It may be not general.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can support fileSystem at the beginning. If we need to support db, then there may be more things to consider, or we can add a parameter. If it is a file supported by fileSystem, we will turn on this switch.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For S3 , COS, it's OK, right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least We need consider the object Storage.

@smallzhongfeng smallzhongfeng Sep 16, 2022

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

�IIRC, object storage such as s3 also has the concept of Path.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But cos doesn't seem to have this concept.

@zuston zuston Sep 21, 2022

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I know, the object storage also can be accessed by Hadoop Filesystem Interface (HCFS), like s3/oss/cos. So this is not a problem

But if we want to involve more storages which are not implemented by HCFS or speed up specified api which is only compatible with original storage api, maybe we should implement uniffle's own filesystem, it can refer to Flink.

Please let me know if I'm wrong.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for your reminder. Although I am not very impressed with this, I have used a similar method to read files in s3.

@smallzhongfeng

Copy link
Copy Markdown
Contributor Author

For object storage, there may be more S3, but for cos, I don't know this very well. Do you have any good suggestions? @jerqi

@roryqi

roryqi commented Sep 20, 2022

Copy link
Copy Markdown
Contributor

For object storage, there may be more S3, but for cos, I don't know this very well. Do you have any good suggestions? @jerqi

Do the S3 have the concept of path? It only have the concept of bucket.

@smallzhongfeng

Copy link
Copy Markdown
Contributor Author

Could we support anomaly detection of hdfs files first?

@roryqi

roryqi commented Sep 21, 2022

Copy link
Copy Markdown
Contributor

Could we support anomaly detection of hdfs files first?

We should have more general interface. We can only implement part of them.

@zuston

zuston commented Sep 21, 2022

Copy link
Copy Markdown
Member

For object storage, there may be more S3, but for cos, I don't know this very well. Do you have any good suggestions? @jerqi

Do the S3 have the concept of path? It only have the concept of bucket.

Yes, the s3 compatible filesystem of Hadoop has implemented the abstraction of Path, but it's slower than the s3 original api when using list files/dirs. I think cos is the same with s3.

By the way, I implemented the hadoop filesystem api for internal object store in the past, so have some experience on it.

@roryqi

roryqi commented Sep 21, 2022

Copy link
Copy Markdown
Contributor

For object storage, there may be more S3, but for cos, I don't know this very well. Do you have any good suggestions? @jerqi

Do the S3 have the concept of path? It only have the concept of bucket.

Yes, the s3 compatible filesystem of Hadoop has implemented the abstraction of Path, but it's slower than the s3 original api when using list files/dirs. I think cos is the same with s3.

By the way, I implemented the hadoop filesystem api for internal object store in the past, so have some experience on it.

We shouldn't relay on the filesystem of object storage too much. Some operations of object storage is slow. For example, list operation, rename operation and append operation.

smallzhongfeng added 2 commits September 22, 2022 02:25
@smallzhongfeng

Copy link
Copy Markdown
Contributor Author

We should have more general interface. We can only implement part of them.

I think we only need to provide a path that includes the remote, and the rest of the methods can be implemented by judging different storage systems.

@roryqi

roryqi commented Sep 22, 2022

Copy link
Copy Markdown
Contributor

We should have more general interface. We can only implement part of them.

I think we only need to provide a path that includes the remote, and the rest of the methods can be implemented by judging different storage systems.

My doubt point is that the concept of path is not general one for the storage.

@smallzhongfeng

Copy link
Copy Markdown
Contributor Author

So I did not use the concept of path for the time being, but used a string to represent.

@smallzhongfeng

smallzhongfeng commented Sep 22, 2022

Copy link
Copy Markdown
Contributor Author

So I did not use the concept of path for the time being, but used a string to represent.

The concept of subsequent use of buckets or paths can be implemented through the readAndWrite interface.

@roryqi

roryqi commented Sep 22, 2022

Copy link
Copy Markdown
Contributor

So I did not use the concept of path for the time being, but used a string to represent.

The concept of subsequent use of buckets or paths can be implemented through the readAndWrite interface.

It's ok.

Map<String, RemoteStorageInfo> getAppIdToRemoteStorageInfo();

Map<String, RankValue> getRemoteStoragePathRankValue();
List<Map.Entry<String, RankValue>> readAndWrite(String path);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a little doubt, why do the readAndWrite become the only interface method for SelectStorageStrategy? I can rank by many methods, it's not necessary to read and write.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The meaning of the readAndWrite interface is to perform anomaly detection of remote paths. For hdfs, reading and writing files is a more direct way to detect, so I took this name, but the real comparison method comes from sortPathByRankValue, which I do not have. The reason for defining it as an interface is because the objects to be sorted may be different, and this is left to different storage methods for implementation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The meaning of the readAndWrite interface is to perform anomaly detection of remote paths. For hdfs, reading and writing files is a more direct way to detect, so I took this name, but the real comparison method comes from sortPathByRankValue, which I do not have. The reason for defining it as an interface is because the objects to be sorted may be different, and this is left to different storage methods for implementation.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readAndWrite is not a good name. It's strange for the interface SelectStorageStrategy.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, updated.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a method pickStorage or selectStorage for this interface?

@smallzhongfeng smallzhongfeng Sep 23, 2022

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that might make more sense for this interface, I will add.

@smallzhongfeng

Copy link
Copy Markdown
Contributor Author

PTAL @jerqi

public interface SelectStorageStrategy {

RemoteStorageInfo pickRemoteStorage(String appId);
List<Map.Entry<String, RankValue>> detectStorage(String path);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

path -> uri?
Path is not general concept for storage. Maybe we should use uri.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, good suggestion.

Map<String, RemoteStorageInfo> getAppIdToRemoteStorageInfo();

Map<String, RankValue> getRemoteStoragePathRankValue();
List<Map.Entry<String, RankValue>> pickStorage(List<Map.Entry<String, RankValue>> pathList);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we give pathList a better name?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any good advice?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about uriRankings?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uriList or uris may be better?
Why do the method pickStorage return a list?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified it to let pickStorage return to a selected path.

Map<String, RemoteStorageInfo> getAppIdToRemoteStorageInfo();

Map<String, RankValue> getRemoteStoragePathRankValue();
String pickStorage(List<Map.Entry<String, RankValue>> uris, String appId);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we return RemoteStorageInfo?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need, because the return value of pickStorage only needs to be passed to incRemoteStorageCounter, incRemoteStorageCounter only needs String type, and String type is more general than RemoteStorageInfo.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need, because the return value of pickStorage only needs to be passed to incRemoteStorageCounter, incRemoteStorageCounter only needs String type, and String type is more general than RemoteStorageInfo.

String is too general to describe what we want to return. RemoteStorageInfo is more accurate.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, updated.

Path testPath = new Path(rssTest);
try {
FileSystem fs = HadoopFilesystemProvider.getFilesystem(remotePath, hdfsConf);
for (int j = 0; j < readAndWriteTimes; j++) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we just detect the health of storage, it seems unnecessary for us write and read them multiple times.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For appBalance, it can indeed be changed once, but for io sampling, the number of times can be increased, perhaps we can set the parameter to 1 by default.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For appBalance, it shouldn't be a configuration option.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

private final Configuration hdfsConf;
private final int fileSize;
private final int readAndWriteTimes;
private boolean remotePathIsHealthy = true;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this variable remotePathIsHealthy mean?
This class is a strategy. I can't get the meaning of this varibale.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable is used to determine whether the path is normal, and to pass the parameter into the method sortPathByRankValue.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable is used to determine whether the path is normal, and to pass the parameter into the method sortPathByRankValue.

One strategy should have multiple paths, every path is healthy, the variable is true?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that there are many paths, but this attribute is applied to each path. The healthy path is processed by sortPathByRankValue according to remotePathIsHealthy as true, and corresponding logic is processed when it is false.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can look at the logic of sortPathByRankValue. When we judge by this parameter, it is a normal path, and then deal with it accordingly.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the variable necessary to become member field?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

uris = uris.stream().filter(rv -> rv.getValue().getReadAndWriteTime().get() != Long.MAX_VALUE).sorted(
Comparator.comparingInt(entry -> entry.getValue().getAppNum().get())).collect(Collectors.toList());
} else {
// If all paths are unhealthy, assign paths according to the number of apps

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add some logs and metrics when all paths are unhealthy?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will add.

conf.getLong(CoordinatorConf.COORDINATOR_REMOTE_STORAGE_SCHEDULE_TIME), TimeUnit.MILLISECONDS);
}

public void checkReadAndWrite() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we put this logic into the class SelectStorageStrategy?

@roryqi

roryqi commented Sep 27, 2022

Copy link
Copy Markdown
Contributor

Maybe we should have a abstract class to define the common behaviors of SelectStorageStrategy.

@smallzhongfeng

smallzhongfeng commented Oct 8, 2022

Copy link
Copy Markdown
Contributor Author

This modification has done three things

  1. An abstract class AbstractSelectStorageStrategy is added to provide a method for reading and writing hdfs paths.
  2. Added logs of the sorted storage list.
  3. Removed a global variable of isHealthy and changed it to a local variable.

@smallzhongfeng

This comment was marked as resolved.

@smallzhongfeng

Copy link
Copy Markdown
Contributor Author

PTAL @jerqi

}
}

@Override

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't implement the method detectStorage and pickStorage, we can remove them in the abstract class.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@roryqi roryqi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @smallzhongfeng , wait for CI

@roryqi roryqi merged commit c89f95c into apache:master Oct 9, 2022
@roryqi

roryqi commented Oct 9, 2022

Copy link
Copy Markdown
Contributor

@smallzhongfeng

Copy link
Copy Markdown
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants