feat: Add micro-benchmarks for reads comparing standard (regional) vs rapid (zonal) buckets. by chandra-siri · Pull Request #1697 · googleapis/python-storage · GitHub
Skip to content
This repository was archived by the owner on Mar 31, 2026. It is now read-only.

feat: Add micro-benchmarks for reads comparing standard (regional) vs rapid (zonal) buckets.#1697

Merged
chandra-siri merged 44 commits into
mainfrom
bench
Jan 21, 2026
Merged

feat: Add micro-benchmarks for reads comparing standard (regional) vs rapid (zonal) buckets.#1697
chandra-siri merged 44 commits into
mainfrom
bench

Conversation

@chandra-siri

@chandra-siri chandra-siri commented Dec 27, 2025

Copy link
Copy Markdown
Collaborator
  • Add performance microbenchmarking suite for Sequential and Random Reads
  • This compares Regional (Standard Storage) with Zonal (Rapid Storage) buckets.
  • Regional uses JSON wheres Zonal uses gRPC bidi

@product-auto-label product-auto-label Bot added size: l Pull request size is large. api: storage Issues related to the googleapis/python-storage API. labels Dec 27, 2025
@gemini-code-assist

Copy link
Copy Markdown
Contributor

@product-auto-label product-auto-label Bot added size: xl Pull request size is extra large. and removed size: l Pull request size is large. labels Dec 27, 2025

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a collection of benchmarking and testing scripts. My review focuses on improving code quality, fixing bugs, and increasing efficiency. Key findings include a critical bug in a time calculation, several instances of inefficient client creation within loops, and various opportunities for code cleanup such as removing unused imports and commented-out code. I've also suggested improvements for better script structure and more informative benchmark reporting.

I am having trouble creating individual review comments. Click here to see my feedback.

parallel_downloader.py (60)

critical

There is a critical bug in the time calculation due to operator precedence. start_time / 10**9 is calculated before the subtraction. You need to wrap end_time - start_time in parentheses to get the correct duration.

        f"\nFinished all download attempts for {num_objects} objects: took - {(end_time - start_time) / 10**9}s"

parallel_downloader.py (16)

high

Creating a new AsyncGrpcClient for each object download is inefficient as it involves setting up a new connection each time. The client should be created once per worker process and reused for all downloads within that process. Consider creating the client in download_worker and passing it to this function.

threaded_downloader.py (14)

high

Creating a new AsyncGrpcClient for each download is inefficient. Since each download_worker runs in its own thread and creates its own event loop, the client should be created once per thread in download_worker and passed to download_object_async.

tests/perf/microbenchmarks/test_reads.py (66)

high

The output_buffer is initialized here and then immediately re-initialized on line 69. This first initialization is redundant and can be removed.

parallel_uploader.py (17)

high

Creating a new AsyncGrpcClient for each object upload is inefficient. This creates a new connection for every task. The client should be instantiated once per worker process in upload_worker and passed to this function to be reused.

parallel_downloader.py (23)

medium

The object size is hardcoded here. It would be better to define it as a constant at the top of the file (e.g., OBJECT_SIZE = 100 * 1024 * 1024) for better readability and maintainability.

parallel_uploader.py (31)

medium

The bucket name is hardcoded. It's better to define this as a constant at the top of the file for easier configuration.

async_tasks_downloader.py (65)

medium

There is a typo in FINSHED. It should be FINISHED.

        f"FINISHED: total bytes downloaded - {num_objects*OBJECT_SIZE} in time {(end_time - start_time) / (10**9)}s"

move_blob_test.py (3-9)

medium

The script's logic is executed at the module level. It's a best practice to encapsulate the main logic within a main() function and call it under an if __name__ == "__main__": guard. This prevents the code from running when the module is imported elsewhere and improves reusability and testability.

def main():
    gcs = storage.Client()
    bucket = gcs.bucket("chandrasiri-us-west1-hns-soft-del")
    # print(bucket.name)
    blob = bucket.blob("test/blob.csv")
    blob.upload_from_string("")
    print("Uploaded blob:", blob.name)
    bucket.move_blob(blob, new_name="test/blob2.csv")

if __name__ == "__main__":
    main()

move_blob_test.py (4)

medium

The bucket name is hardcoded. For better portability and reusability, consider defining it as a constant at the top of the file or passing it as a command-line argument.

parallel_downloader.py (3)

medium

The os module is imported but never used. It should be removed.

async_tasks_downloader.py (4)

medium

The ThreadPoolExecutor is imported but not used in this file. It should be removed to keep the code clean.

async_tasks_downloade_mp.py (4)

medium

The ThreadPoolExecutor is imported but never used in this file. It's good practice to remove unused imports to keep the code clean.

parallel_downloader.py (34)

medium

The bucket name is hardcoded. Consider defining this as a constant at the top of the file to make it easier to change.

async_tasks_downloade_mp.py (83-84)

medium

There's a typo in 'Throuput'. It should be 'Throughput'.

Additionally, the throughput calculation uses 10**-6 to convert from bytes to megabytes, which assumes 1 MB = 1,000,000 bytes. However, OBJECT_SIZE is defined using binary prefixes (1024*1024). For consistency, you should use (1024*1024) for the conversion to MiB/s.

        f"Throughput: {num_object*OBJECT_SIZE /((end_time_proc - start_time_proc) / (10**9))/(1024*1024)} MiB/s"

async_tasks_downloade_mp.py (76)

medium

The results variable is assigned but its value is never used. It should be removed to avoid confusion.

        pool.starmap(async_runner, args)

async_tasks_downloader.py (36)

medium

This docstring is empty. Please add a description of what the function does.

    """Creates and runs asyncio tasks to download a range of objects."""

random_number.py (12-20)

medium

The script's logic is executed at the module level. It's a best practice to encapsulate this logic within a main() function and call it under an if __name__ == "__main__": guard. This prevents the code from running when the module is imported elsewhere.

def main():
    # Generate 1000 unique IDs
    # A set is the easiest way to guarantee uniqueness in the batch.
    request_ids = set()
    while len(request_ids) < 1000:
        request_ids.add(generate_random_64bit_int())

    # You can convert it to a list if needed
    id_list = list(request_ids)

    print(f"Generated {len(id_list)} unique 64-bit IDs.")
    print("First 5 IDs:", id_list[:5])

if __name__ == "__main__":
    main()

tests/perf/microbenchmarks/test_reads.py (16-21)

medium

These modules (math, google.api_core.exceptions, google.cloud.storage.blob.Blob) are imported but not used in the file. They should be removed to keep the code clean.

tests/perf/microbenchmarks/test_reads.py (42)

medium

The comment indicates the object size is 1 GiB, but 100 * (1024 ** 2) is 100 MiB. The comment should be corrected to avoid confusion.

OBJECT_SIZE = 100 * (1024 ** 2)  # 100 MiB

async_tasks_downloade_mp.py (71-73)

medium

These lines appear to be leftover debugging code. They should be removed before merging to keep the codebase clean.

tests/perf/microbenchmarks/test_reads.py (102-109)

medium

This docstring appears to be a list of implementation notes rather than a description of the function's purpose. It should be updated to be a proper docstring that explains what my_setup does, its parameters, and what it returns.

tests/perf/microbenchmarks/test_reads.py (210)

medium

The benchmark summary prints 'N/A' for standard deviation. pytest-benchmark calculates this value and it's available in benchmark.stats['stddev']. It would be more informative to include this statistic in the report. Note that the other metrics are in MiB/s, so you may need to convert the time-based standard deviation or adjust the table header.

tests/perf/microbenchmarks/test_reads.py (219-234)

medium

This large block of commented-out code should be removed to improve readability and reduce clutter.

async_tasks_downloade_mp.py (38)

medium

This docstring is empty. Please add a meaningful description of what the download_objects_pool function does, its parameters, and what it returns, or remove the docstring if it's not needed.

    """Downloads a pool of objects asynchronously within a single process."""

threaded_downloader.py (32)

medium

The bucket name is hardcoded. It's better to define this as a constant at the top of the file for easier configuration and maintenance.

@chandra-siri chandra-siri changed the title Bench feat: Add Micro Benchmarks for reads and writes comparing standard (regional) vs rapid (zonal) buckets. Jan 11, 2026
@chandra-siri chandra-siri changed the title feat: Add Micro Benchmarks for reads and writes comparing standard (regional) vs rapid (zonal) buckets. feat: Add micro-benchmarks for reads and writes comparing standard (regional) vs rapid (zonal) buckets. Jan 11, 2026
@chandra-siri chandra-siri changed the title feat: Add micro-benchmarks for reads and writes comparing standard (regional) vs rapid (zonal) buckets. feat: Add micro-benchmarks for reads comparing standard (regional) vs rapid (zonal) buckets. Jan 15, 2026
@chandra-siri

Copy link
Copy Markdown
Collaborator Author

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive micro-benchmarking suite to compare read performance between regional (standard storage, JSON API) and zonal (rapid storage, gRPC API) buckets. The implementation is well-structured, effectively using pytest parameterization, multiprocessing, and asyncio to cover various scenarios. The inclusion of a resource monitor and a script to convert results to CSV are excellent additions. I have a few minor suggestions to enhance code clarity and maintainability.

Comment thread tests/perf/microbenchmarks/_utils.py
Comment thread tests/perf/microbenchmarks/_utils.py Outdated
Comment thread tests/perf/microbenchmarks/conftest.py Outdated
Comment thread tests/perf/microbenchmarks/reads/test_reads.py Outdated
Comment thread tests/perf/microbenchmarks/reads/test_reads.py Outdated
Comment thread tests/perf/microbenchmarks/reads/test_reads.py
jasha26
jasha26 previously approved these changes Jan 21, 2026
Comment thread tests/perf/microbenchmarks/__init__.py
Comment thread tests/perf/microbenchmarks/conftest.py
@chandra-siri chandra-siri enabled auto-merge (squash) January 21, 2026 11:06
@chandra-siri chandra-siri merged commit 1917649 into main Jan 21, 2026
20 of 21 checks passed
@chandra-siri chandra-siri deleted the bench branch January 21, 2026 11:21
chandra-siri added a commit that referenced this pull request Feb 2, 2026
PR created by the Librarian CLI to initialize a release. Merging this PR
will auto trigger a release.

Librarian Version: v1.0.2-0.20251119154421-36c3e21ad3ac
Language Image:
us-central1-docker.pkg.dev/cloud-sdk-librarian-prod/images-prod/python-librarian-generator@sha256:8e2c32496077054105bd06c54a59d6a6694287bc053588e24debe6da6920ad91
<details><summary>google-cloud-storage: 3.9.0</summary>

##
[3.9.0](v3.8.0...v3.9.0)
(2026-02-02)

### Features

* update generation for MRD (#1730)
([08bc708](08bc7082))

* add get_object method for async grpc client (#1735)
([0e5ec29](0e5ec29b))

* Add micro-benchmarks for reads comparing standard (regional) vs rapid
(zonal) buckets. (#1697)
([1917649](1917649f))

* Add support for opening via `write_handle` and fix `write_handle` type
(#1715)
([2bc15fa](2bc15fa5))

* add samples for appendable objects writes and reads
([2e1a1eb](2e1a1eb5))

* add samples for appendable objects writes and reads (#1705)
([2e1a1eb](2e1a1eb5))

* add context manager to mrd (#1724)
([5ac2808](5ac2808a))

* Move Zonal Buckets features of `_experimental` (#1728)
([74c9ecc](74c9ecc5))

* add default user agent for grpc (#1726)
([7b31946](7b319469))

* expose finalized_time in blob.py applicable for GET_OBJECT in ZB
(#1719)
([8e21a7f](8e21a7fe))

* expose `DELETE_OBJECT` in `AsyncGrpcClient` (#1718)
([c8dd7a0](c8dd7a0b))

* send `user_agent` to grpc channel (#1712)
([cdb2486](cdb2486b))

* integrate writes strategy and appendable object writer (#1695)
([dbd162b](dbd162b3))

* Add micro-benchmarks for writes comparing standard (regional) vs rapid
(zonal) buckets. (#1707)
([dbe9d8b](dbe9d8b8))

* add support for `generation=0` to avoid overwriting existing objects
and add `is_stream_open` support (#1709)
([ea0f5bf](ea0f5bf8))

* add support for `generation=0` to prevent overwriting existing objects
([ea0f5bf](ea0f5bf8))

* add `is_stream_open` property to AsyncAppendableObjectWriter for
stream status check
([ea0f5bf](ea0f5bf8))

### Bug Fixes

* receive eof while closing reads stream (#1733)
([2ef6339](2ef63396))

* update write handle on every recv() (#1716)
([5d9fafe](5d9fafe1))

* implement requests_done method to signal end of requests in async
streams. Gracefully close streams. (#1700)
([6c16079](6c160794))

* implement requests_done method to signal end of requests in async
streams. Gracefully close streams.
([6c16079](6c160794))

* instance grpc client once per process in benchmarks (#1725)
([721ea2d](721ea2dd))

* Fix formatting in setup.py dependencies list (#1713)
([cc4831d](cc4831d7))

* Change contructors of MRD and AAOW AsyncGrpcClient.grpc_client to
AsyncGrpcClient (#1727)
([e730bf5](e730bf50))

</details>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

api: storage Issues related to the googleapis/python-storage API. size: xl Pull request size is extra large.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants