enhance download_and_extract by Jerome-Hsieh · Pull Request #8216 · Project-MONAI/MONAI · GitHub
Skip to content

enhance download_and_extract#8216

Merged
KumoLiu merged 12 commits into
Project-MONAI:devfrom
Jerome-Hsieh:issue5463
Dec 21, 2024
Merged

enhance download_and_extract#8216
KumoLiu merged 12 commits into
Project-MONAI:devfrom
Jerome-Hsieh:issue5463

Conversation

@Jerome-Hsieh

@Jerome-Hsieh Jerome-Hsieh commented Nov 16, 2024

Copy link
Copy Markdown
Contributor

Fixes #5463

Description

According to issue, the error messages are not very intuitive.
I think maybe we can check if the file name matches the downloaded file’s base name before starting the download.
If it doesn’t match, it will notify user.

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Integration tests passed locally by running ./runtests.sh -f -u --net --coverage.
  • Quick tests passed locally by running ./runtests.sh --quick --unittests --disttests.
  • In-line docstrings updated.
  • Documentation updated, tested make html command in the docs/ folder.

@ericspod

Copy link
Copy Markdown
Member

Comment thread monai/apps/utils.py Outdated
Comment thread monai/apps/utils.py Outdated
Comment thread monai/apps/utils.py Outdated
Comment thread monai/apps/utils.py Outdated
Signed-off-by: jerome_Hsieh <jerome910810@gmail.com>
Signed-off-by: jerome_Hsieh <jerome910810@gmail.com>
Signed-off-by: jerome_Hsieh <jerome910810@gmail.com>
Signed-off-by: jerome_Hsieh <jerome910810@gmail.com>
@Jerome-Hsieh

Copy link
Copy Markdown
Contributor Author

@ericspod @KumoLiu
Here's newest change. like we discussed before. I check the filepath every time at the very beginning.
If no extension is provided, the system will automatically add the appropriate file extension with a warning.
A warning will also appear if provided extension doesn't match the downloaded file's extension .

@KumoLiu

KumoLiu commented Dec 5, 2024

Copy link
Copy Markdown
Contributor

@ericspod @KumoLiu Here's newest change. like we discussed before. I check the filepath every time at the very beginning. If no extension is provided, the system will automatically add the appropriate file extension with a warning. A warning will also appear if provided extension doesn't match the downloaded file's extension .

Hi @Jerome-Hsieh, thank you for the quick update. I tested your latest changes, but neither "." nor "./test" as the file path seems to work. Is this the expected behavior? In my opinion, both should work. The logic could be adjusted so that if the file path is a directory and not empty, it automatically captures the name and downloads the file into the specified directory. What do you think?

url = "https://github.com/Project-MONAI/MONAI-extra-test-data/releases/download/0.8.1/MedNIST.tar.gz"
monai.apps.utils.download_and_extract(url, filepath="./test")

@Jerome-Hsieh

Copy link
Copy Markdown
Contributor Author

Hi @KumoLiu my test results:

>>> url = "https://github.com/Project-MONAI/MONAI-extra-test-data/releases/download/0.8.1/MedNIST.tar.gz"
>>> monai.apps.utils.download_and_extract(url, filepath=".")
2024-12-06 18:06:56,333 - INFO - Expected md5 is None, skip md5 check for file ..
2024-12-06 18:06:56,334 - INFO - File exists: ., skipped downloading.
2024-12-06 18:06:56,334 - INFO - Non-empty folder exists in ., skipped extracting.

and

>>> monai.apps.utils.download_and_extract(url, filepath="./test")
2024-12-06 18:08:41,246 - WARNING - filepath=./test, which missing file extension. Auto-appending extension to: ./test.tar.gz
test.tar.gz: 59.0MB [00:10, 6.07MB/s]
2024-12-06 18:08:51,443 - INFO - Downloaded: test.tar.gz
2024-12-06 18:08:51,443 - INFO - Expected md5 is None, skip md5 check for file test.tar.gz.
2024-12-06 18:08:51,443 - INFO - Writing into directory: ..

If the results like above, that's expected behavior.

Your opinion sounds like
url=example.com/file.tar.gz, filepath=./test
the program will work :./test/file.tar.gz.
If I don't misunderstanding your thoughts, this method although can automatically captures the name, not only capture the strange filename make user confused if the URL don't have appropriate naming, but also can't allow user set the filename they want.

@mingxin-zheng

Copy link
Copy Markdown
Collaborator

Regarding file downloading from URL, sometimes we can infer the name from the content deposition. Here is one example snippet: https://github.com/Project-MONAI/VLM/blob/7be688aa457a1806f908eb758f2f3ee816fea017/m3/demo/experts/utils.py#L135 @KumoLiu

@Jerome-Hsieh

Copy link
Copy Markdown
Contributor Author

Thank @mingxin-zheng provide information to find file name.
But I want to say that at first my initial approach to this issue that filepath should be a compressed file rather than a directory.
I wanted to focus on making filepath valid. Method provided by @KumoLiu it's great, but which somewhat deviates from my original intent.
Sorry for not providing a clear explanation.

Signed-off-by: jerome_Hsieh <jerome910810@gmail.com>
@Jerome-Hsieh

Copy link
Copy Markdown
Contributor Author

Hi @KumoLiu the newest commit I change the logic that if the file path is a directory and not empty, it will automatically capture the name from content deposition, if it doses't have any file name, will use url basename and download the file into the specified directory then gives a warning.
Besides, I also add if set the wrong file extension, it will raise the error and stop downloading.
Please let me know if you have any suggestions. Thx!

@KumoLiu

KumoLiu commented Dec 16, 2024

Copy link
Copy Markdown
Contributor

Hi @KumoLiu the newest commit I change the logic that if the file path is a directory and not empty, it will automatically capture the name from content deposition, if it doses't have any file name, will use url basename and download the file into the specified directory then gives a warning. Besides, I also add if set the wrong file extension, it will raise the error and stop downloading. Please let me know if you have any suggestions. Thx!

Hi @Jerome-Hsieh, thank you for the update. After our discussion in the development meeting, I realized that there is an output_dir argument in download_and_extract. I apologize for any confusion my previous comments may have caused. It would be great to proceed by considering only the extension as before. Sorry again for the misunderstanding.

Comment thread monai/apps/utils.py Outdated
@Jerome-Hsieh

Jerome-Hsieh commented Dec 16, 2024

Copy link
Copy Markdown
Contributor Author

Hi @KumoLiu the newest commit I change the logic that if the file path is a directory and not empty, it will automatically capture the name from content deposition, if it doses't have any file name, will use url basename and download the file into the specified directory then gives a warning. Besides, I also add if set the wrong file extension, it will raise the error and stop downloading. Please let me know if you have any suggestions. Thx!

Hi @Jerome-Hsieh, thank you for the update. After our discussion in the development meeting, I realized that there is an output_dir argument in download_and_extract. I apologize for any confusion my previous comments may have caused. It would be great to proceed by considering only the extension as before. Sorry again for the misunderstanding.

Hi @KumoLiu I see, and what do you think about previous version a9a0171 I commit ?
Could you please identify any areas that require improvement or could be further enhance ? Thx!

@KumoLiu

KumoLiu commented Dec 16, 2024

Copy link
Copy Markdown
Contributor

Hi @KumoLiu I see, and what do you think about previous version a9a0171 I commit ? Could you please identify any areas that require improvement or could be further enhance ? Thx!

Hi @Jerome-Hsieh, a9a0171 looks good. The only update needed is how to extract the file extension from the URL.

For example, given a URL like "https://drive.google.com/u/1/uc?id=1KntZge40tAHgyXmHYVqZZ5d2p_4Qr2l5&export=download", you can use get_filename_from_url to determine the exact extension based on the filename. What do you think? Please also include the tests, thanks!

@KumoLiu KumoLiu closed this Dec 16, 2024
@KumoLiu KumoLiu reopened this Dec 16, 2024
Signed-off-by: jerome_Hsieh <jerome910810@gmail.com>
@Jerome-Hsieh

Copy link
Copy Markdown
Contributor Author

Hi @Jerome-Hsieh, a9a0171 looks good. The only update needed is how to extract the file extension from the URL.

For example, given a URL like "https://drive.google.com/u/1/uc?id=1KntZge40tAHgyXmHYVqZZ5d2p_4Qr2l5&export=download", you can use get_filename_from_url to determine the exact extension based on the filename. What do you think? Please also include the tests, thanks!

@KumoLiu the URL you provide can download but can't extract, I extract manually still fail.
And I found that download large file from google drive there's no content deposition to get filename, so I use web parsing to find the filename, when download small file just use the same way to get filename.
Here's my test URL "https://drive.google.com/uc?id=18u7FlNtqn5K1H308bGZwp6YNAKxpJz8-"

Signed-off-by: jerome_Hsieh <jerome910810@gmail.com>
Signed-off-by: jerome_Hsieh <jerome910810@gmail.com>
Signed-off-by: jerome_Hsieh <jerome910810@gmail.com>
Comment thread monai/apps/utils.py
Signed-off-by: jerome_Hsieh <jerome910810@gmail.com>
@Jerome-Hsieh

Copy link
Copy Markdown
Contributor Author

@KumoLiu Thanks a lot for your assistance!It looks more cleaner than before!

@KumoLiu

KumoLiu commented Dec 19, 2024

Copy link
Copy Markdown
Contributor

/build

@KumoLiu KumoLiu requested a review from ericspod December 19, 2024 10:01
@KumoLiu

KumoLiu commented Dec 19, 2024

Copy link
Copy Markdown
Contributor

@KumoLiu KumoLiu enabled auto-merge (squash) December 21, 2024 14:18
@KumoLiu KumoLiu merged commit efff647 into Project-MONAI:dev Dec 21, 2024
@Jerome-Hsieh Jerome-Hsieh deleted the issue5463 branch December 21, 2024 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

parameter filepath of download_and_extract

4 participants