feat: Add micro-benchmarks for reads comparing standard (regional) vs rapid (zonal) buckets.#1697
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds a collection of benchmarking and testing scripts. My review focuses on improving code quality, fixing bugs, and increasing efficiency. Key findings include a critical bug in a time calculation, several instances of inefficient client creation within loops, and various opportunities for code cleanup such as removing unused imports and commented-out code. I've also suggested improvements for better script structure and more informative benchmark reporting.
I am having trouble creating individual review comments. Click here to see my feedback.
parallel_downloader.py (60)
There is a critical bug in the time calculation due to operator precedence. start_time / 10**9 is calculated before the subtraction. You need to wrap end_time - start_time in parentheses to get the correct duration.
f"\nFinished all download attempts for {num_objects} objects: took - {(end_time - start_time) / 10**9}s"
parallel_downloader.py (16)
Creating a new AsyncGrpcClient for each object download is inefficient as it involves setting up a new connection each time. The client should be created once per worker process and reused for all downloads within that process. Consider creating the client in download_worker and passing it to this function.
threaded_downloader.py (14)
Creating a new AsyncGrpcClient for each download is inefficient. Since each download_worker runs in its own thread and creates its own event loop, the client should be created once per thread in download_worker and passed to download_object_async.
tests/perf/microbenchmarks/test_reads.py (66)
The output_buffer is initialized here and then immediately re-initialized on line 69. This first initialization is redundant and can be removed.
parallel_uploader.py (17)
Creating a new AsyncGrpcClient for each object upload is inefficient. This creates a new connection for every task. The client should be instantiated once per worker process in upload_worker and passed to this function to be reused.
parallel_downloader.py (23)
The object size is hardcoded here. It would be better to define it as a constant at the top of the file (e.g., OBJECT_SIZE = 100 * 1024 * 1024) for better readability and maintainability.
parallel_uploader.py (31)
The bucket name is hardcoded. It's better to define this as a constant at the top of the file for easier configuration.
async_tasks_downloader.py (65)
There is a typo in FINSHED. It should be FINISHED.
f"FINISHED: total bytes downloaded - {num_objects*OBJECT_SIZE} in time {(end_time - start_time) / (10**9)}s"
move_blob_test.py (3-9)
The script's logic is executed at the module level. It's a best practice to encapsulate the main logic within a main() function and call it under an if __name__ == "__main__": guard. This prevents the code from running when the module is imported elsewhere and improves reusability and testability.
def main():
gcs = storage.Client()
bucket = gcs.bucket("chandrasiri-us-west1-hns-soft-del")
# print(bucket.name)
blob = bucket.blob("test/blob.csv")
blob.upload_from_string("")
print("Uploaded blob:", blob.name)
bucket.move_blob(blob, new_name="test/blob2.csv")
if __name__ == "__main__":
main()
move_blob_test.py (4)
The bucket name is hardcoded. For better portability and reusability, consider defining it as a constant at the top of the file or passing it as a command-line argument.
parallel_downloader.py (3)
The os module is imported but never used. It should be removed.
async_tasks_downloader.py (4)
The ThreadPoolExecutor is imported but not used in this file. It should be removed to keep the code clean.
async_tasks_downloade_mp.py (4)
The ThreadPoolExecutor is imported but never used in this file. It's good practice to remove unused imports to keep the code clean.
parallel_downloader.py (34)
The bucket name is hardcoded. Consider defining this as a constant at the top of the file to make it easier to change.
async_tasks_downloade_mp.py (83-84)
There's a typo in 'Throuput'. It should be 'Throughput'.
Additionally, the throughput calculation uses 10**-6 to convert from bytes to megabytes, which assumes 1 MB = 1,000,000 bytes. However, OBJECT_SIZE is defined using binary prefixes (1024*1024). For consistency, you should use (1024*1024) for the conversion to MiB/s.
f"Throughput: {num_object*OBJECT_SIZE /((end_time_proc - start_time_proc) / (10**9))/(1024*1024)} MiB/s"
async_tasks_downloade_mp.py (76)
The results variable is assigned but its value is never used. It should be removed to avoid confusion.
pool.starmap(async_runner, args)
async_tasks_downloader.py (36)
This docstring is empty. Please add a description of what the function does.
"""Creates and runs asyncio tasks to download a range of objects."""
random_number.py (12-20)
The script's logic is executed at the module level. It's a best practice to encapsulate this logic within a main() function and call it under an if __name__ == "__main__": guard. This prevents the code from running when the module is imported elsewhere.
def main():
# Generate 1000 unique IDs
# A set is the easiest way to guarantee uniqueness in the batch.
request_ids = set()
while len(request_ids) < 1000:
request_ids.add(generate_random_64bit_int())
# You can convert it to a list if needed
id_list = list(request_ids)
print(f"Generated {len(id_list)} unique 64-bit IDs.")
print("First 5 IDs:", id_list[:5])
if __name__ == "__main__":
main()
tests/perf/microbenchmarks/test_reads.py (16-21)
These modules (math, google.api_core.exceptions, google.cloud.storage.blob.Blob) are imported but not used in the file. They should be removed to keep the code clean.
tests/perf/microbenchmarks/test_reads.py (42)
The comment indicates the object size is 1 GiB, but 100 * (1024 ** 2) is 100 MiB. The comment should be corrected to avoid confusion.
OBJECT_SIZE = 100 * (1024 ** 2) # 100 MiB
async_tasks_downloade_mp.py (71-73)
These lines appear to be leftover debugging code. They should be removed before merging to keep the codebase clean.
tests/perf/microbenchmarks/test_reads.py (102-109)
This docstring appears to be a list of implementation notes rather than a description of the function's purpose. It should be updated to be a proper docstring that explains what my_setup does, its parameters, and what it returns.
tests/perf/microbenchmarks/test_reads.py (210)
The benchmark summary prints 'N/A' for standard deviation. pytest-benchmark calculates this value and it's available in benchmark.stats['stddev']. It would be more informative to include this statistic in the report. Note that the other metrics are in MiB/s, so you may need to convert the time-based standard deviation or adjust the table header.
tests/perf/microbenchmarks/test_reads.py (219-234)
This large block of commented-out code should be removed to improve readability and reduce clutter.
async_tasks_downloade_mp.py (38)
This docstring is empty. Please add a meaningful description of what the download_objects_pool function does, its parameters, and what it returns, or remove the docstring if it's not needed.
"""Downloads a pool of objects asynchronously within a single process."""
threaded_downloader.py (32)
The bucket name is hardcoded. It's better to define this as a constant at the top of the file for easier configuration and maintenance.
…e analysis - Adjusted file size and rounds in config.yaml for better benchmarking accuracy. - Enhanced download object tests in test_reads.py to include random chunk downloading. - Refactored upload and download worker functions for improved clarity and performance.
…ersisted_size_sync to accept command-line arguments
…ve CRC32 performance measurement
…ads , writes folder
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive micro-benchmarking suite to compare read performance between regional (standard storage, JSON API) and zonal (rapid storage, gRPC API) buckets. The implementation is well-structured, effectively using pytest parameterization, multiprocessing, and asyncio to cover various scenarios. The inclusion of a resource monitor and a script to convert results to CSV are excellent additions. I have a few minor suggestions to enhance code clarity and maintainability.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…ameter calculation
PR created by the Librarian CLI to initialize a release. Merging this PR will auto trigger a release. Librarian Version: v1.0.2-0.20251119154421-36c3e21ad3ac Language Image: us-central1-docker.pkg.dev/cloud-sdk-librarian-prod/images-prod/python-librarian-generator@sha256:8e2c32496077054105bd06c54a59d6a6694287bc053588e24debe6da6920ad91 <details><summary>google-cloud-storage: 3.9.0</summary> ## [3.9.0](v3.8.0...v3.9.0) (2026-02-02) ### Features * update generation for MRD (#1730) ([08bc708](08bc7082)) * add get_object method for async grpc client (#1735) ([0e5ec29](0e5ec29b)) * Add micro-benchmarks for reads comparing standard (regional) vs rapid (zonal) buckets. (#1697) ([1917649](1917649f)) * Add support for opening via `write_handle` and fix `write_handle` type (#1715) ([2bc15fa](2bc15fa5)) * add samples for appendable objects writes and reads ([2e1a1eb](2e1a1eb5)) * add samples for appendable objects writes and reads (#1705) ([2e1a1eb](2e1a1eb5)) * add context manager to mrd (#1724) ([5ac2808](5ac2808a)) * Move Zonal Buckets features of `_experimental` (#1728) ([74c9ecc](74c9ecc5)) * add default user agent for grpc (#1726) ([7b31946](7b319469)) * expose finalized_time in blob.py applicable for GET_OBJECT in ZB (#1719) ([8e21a7f](8e21a7fe)) * expose `DELETE_OBJECT` in `AsyncGrpcClient` (#1718) ([c8dd7a0](c8dd7a0b)) * send `user_agent` to grpc channel (#1712) ([cdb2486](cdb2486b)) * integrate writes strategy and appendable object writer (#1695) ([dbd162b](dbd162b3)) * Add micro-benchmarks for writes comparing standard (regional) vs rapid (zonal) buckets. (#1707) ([dbe9d8b](dbe9d8b8)) * add support for `generation=0` to avoid overwriting existing objects and add `is_stream_open` support (#1709) ([ea0f5bf](ea0f5bf8)) * add support for `generation=0` to prevent overwriting existing objects ([ea0f5bf](ea0f5bf8)) * add `is_stream_open` property to AsyncAppendableObjectWriter for stream status check ([ea0f5bf](ea0f5bf8)) ### Bug Fixes * receive eof while closing reads stream (#1733) ([2ef6339](2ef63396)) * update write handle on every recv() (#1716) ([5d9fafe](5d9fafe1)) * implement requests_done method to signal end of requests in async streams. Gracefully close streams. (#1700) ([6c16079](6c160794)) * implement requests_done method to signal end of requests in async streams. Gracefully close streams. ([6c16079](6c160794)) * instance grpc client once per process in benchmarks (#1725) ([721ea2d](721ea2dd)) * Fix formatting in setup.py dependencies list (#1713) ([cc4831d](cc4831d7)) * Change contructors of MRD and AAOW AsyncGrpcClient.grpc_client to AsyncGrpcClient (#1727) ([e730bf5](e730bf50)) </details>

Uh oh!
There was an error while loading. Please reload this page.