Add s3_upload_checksum_algorithm setting by sebver · Pull Request #107318 · ClickHouse/ClickHouse · GitHub
Skip to content

Add s3_upload_checksum_algorithm setting#107318

Open
sebver wants to merge 35 commits into
ClickHouse:masterfrom
sebver:s3-upload-checksum-latest-master
Open

Add s3_upload_checksum_algorithm setting#107318
sebver wants to merge 35 commits into
ClickHouse:masterfrom
sebver:s3-upload-checksum-latest-master

Conversation

@sebver

@sebver sebver commented Jun 12, 2026

Copy link
Copy Markdown

Closes: #106106

This adds a s3_upload_checksum_algorithm setting to force a flexible checksum algorithm (CRC32 or SHA256) on S3 upload requests. Until now ClickHouse only forced flexible checksums for S3Express buckets; for everything else it relied on Content-MD5, or nothing if s3_disable_checksum was set.

That's a problem when Content-MD5 isn't an option. For example, on a FIPS image MD5 is unavailable, so the upload goes out with no integrity header at all. AWS S3 buckets with Object Lock / default retention then reject it:

Content-MD5 OR x-amz-checksum- HTTP header is required for Put Object requests with Object Lock parameters

With the setting, ClickHouse attaches the corresponding x-amz-checksum-* header and marks the checksum as required, so these uploads go through. It applies to all upload paths: PutObject, CreateMultipartUpload, UploadPart and CompleteMultipartUpload. The empty default keeps the current behavior. For non-S3Express buckets, s3_disable_checksum takes precedence and suppresses this setting.

Implementation-wise, the S3Express-only checksum code in Requests.h was generalized into a RequestChecksum::Algorithm enum, and the request hooks (RequestChecksumRequired, ShouldComputeContentMd5, GetChecksumAlgorithmName) now check a hasRequestChecksum() helper that's true for both S3Express and an explicitly requested algorithm. S3Express buckets keep their mandatory CRC32, and invalid values are rejected during settings validation.

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Added the s3_upload_checksum_algorithm setting to choose MD5, CRC32, or SHA256 for S3 uploads. CRC32 and SHA256 are sent as AWS flexible x-amz-checksum-* headers; MD5 uses the SDK Content-MD5 path. For non-S3Express buckets, s3_disable_checksum takes precedence and suppresses this setting. The empty (default) value keeps the SDK Content-MD5, except in FIPS mode where it now sends SHA256 instead of no checksum.

@CLAassistant

CLAassistant commented Jun 12, 2026

Copy link
Copy Markdown

@sebver sebver marked this pull request as ready for review June 12, 2026 12:43
@murphy-4o murphy-4o added the can be tested Allows running workflows for external contributors label Jun 16, 2026
@clickhouse-gh

clickhouse-gh Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [b9de102]

Summary:


AI Review

Summary

This PR adds s3_upload_checksum_algorithm, threads it through the upload paths, and special-cases S3Express/GCS. The remaining gap is the new GCS bypass: it avoids the header breakage, but it also sidesteps the runtime validation path when s3_validate_request_settings = 0, so invalid or unsupported explicit values can now be silently accepted on storage.googleapis.com uploads instead of being rejected.

PR Metadata

💡 Changelog category is correct (Improvement), but the current Changelog entry no longer matches the code for GCS endpoints: the setting is ignored there and the empty FIPS case does not switch to flexible SHA256. A closer replacement would be:
Added the s3_upload_checksum_algorithm setting to choose MD5, CRC32, or SHA256 for S3 uploads. CRC32 and SHA256 use AWS flexible x-amz-checksum-* headers, while MD5 uses the SDK Content-MD5 path. s3_disable_checksum suppresses upload checksums on non-S3Express buckets. In FIPS mode, the empty value switches to SHA256 for AWS S3/S3-compatible uploads; GCS endpoints keep their previous behavior because they do not accept AWS flexible checksum headers.

Findings

⚠️ Majors

  • [src/IO/WriteBufferFromS3.cpp:512, src/IO/S3/copyS3File.cpp:98] The new GCS short-circuit returns nullopt before calling RequestChecksum::getUploadChecksumAlgorithm, which is now the runtime validation layer for s3_validate_request_settings = 0. That reopens the original validation hole for storage.googleapis.com: upload_checksum_algorithm = 'MD4' is silently ignored, and in FIPS mode even an explicit MD5 is no longer rejected on the upload path. Parse and validate the non-empty setting first, then decide whether GCS should ignore or reject flexible algorithms.

💡 Nits

  • [src/Core/Settings.cpp:713] The public setting docs still describe the FIPS -> SHA256 fallback and explicit algorithm choice as universal, but the new isClientForGCS() branch makes GCS a documented exception. Please mention that here, and mirror the same caveat in the PR changelog entry.
Final Verdict

⚠️ Needs changes before merge: the GCS workaround fixed the header incompatibility, but it reintroduced a runtime-validation bypass for this setting.

@clickhouse-gh clickhouse-gh Bot added the pr-improvement Pull request with some product improvements label Jun 16, 2026
Comment thread src/IO/WriteBufferFromS3.cpp Outdated
Comment thread src/IO/S3/copyS3File.cpp Outdated
@sebver

sebver commented Jun 22, 2026

Copy link
Copy Markdown
Author

hi @nickitat (apologies if you're not the best person to tag, I tried to look for a maintainer with s3 knowledge),

I was wondering if this change makes sense from your PoV? Or if you think something needs to be done differently?
I am still working on getting the build green, but wanted to check in with a maintainer first.

@nickitat nickitat self-assigned this Jun 22, 2026
@nickitat

Copy link
Copy Markdown
Member
  For S3, the Content-MD5 header is computed by the bundled AWS SDK, which routes through aws-c-cal. On Linux that calls OpenSSL directly:

  // contrib/aws-c-cal/source/unix/opensslcrypto_hash.c:72
  if (!g_aws_openssl_evp_md_ctx_table->init_ex_fn(ctx, EVP_md5(), NULL)) {
      s_destroy(hash);
      aws_raise_error(AWS_ERROR_UNKNOWN);
      return NULL;
  }

  What FIPS-CH actually does on a normal bucket: silently sends no checksum

  The MD5 computation happens here:

  // contrib/aws/src/aws-cpp-sdk-core/source/client/AWSClient.cpp:902
  auto md5HashResult = Aws::Utils::Crypto::CreateMD5Implementation()->Calculate(*body);
  body->clear();
  if (md5HashResult.IsSuccess())   // <-- only set on success
      httpRequest->SetHeaderValue(Http::CONTENT_MD5_HEADER, HashingUtils::Base64Encode(md5HashResult.GetResult()));

  Under FIPS, aws_md5_default_new fails at init_ex_fn(ctx, EVP_md5(), NULL) (opensslcrypto_hash.c:72) and returns NULL. The CRT Hash wraps the null handle, ComputeOneShot
  fails, and CRTHash::Calculate returns false (CRTHash.cpp). So md5HashResult.IsSuccess() is false → the Content-MD5 header is simply omitted, no exception thrown.

It doesn't look right to silently skip checksumming if it is enabled. Also, it doesn't look right to deliberately fall back to CRC. So, I think yes, the change makes sense in general. I'd probably condition the default algorithm on isFIPSEnabled: for normal builds, use MD5 (we should have it as an option), and for FIPS, default to SHA.
I will take a closer look at the code later this week.

Comment thread src/Core/SettingsChangesHistory.cpp Outdated
Comment thread src/IO/tests/gtest_writebuffer_s3.cpp Outdated
@sebver

sebver commented Jun 22, 2026

Copy link
Copy Markdown
Author
  For S3, the Content-MD5 header is computed by the bundled AWS SDK, which routes through aws-c-cal. On Linux that calls OpenSSL directly:

  // contrib/aws-c-cal/source/unix/opensslcrypto_hash.c:72
  if (!g_aws_openssl_evp_md_ctx_table->init_ex_fn(ctx, EVP_md5(), NULL)) {
      s_destroy(hash);
      aws_raise_error(AWS_ERROR_UNKNOWN);
      return NULL;
  }

  What FIPS-CH actually does on a normal bucket: silently sends no checksum

  The MD5 computation happens here:

  // contrib/aws/src/aws-cpp-sdk-core/source/client/AWSClient.cpp:902
  auto md5HashResult = Aws::Utils::Crypto::CreateMD5Implementation()->Calculate(*body);
  body->clear();
  if (md5HashResult.IsSuccess())   // <-- only set on success
      httpRequest->SetHeaderValue(Http::CONTENT_MD5_HEADER, HashingUtils::Base64Encode(md5HashResult.GetResult()));

  Under FIPS, aws_md5_default_new fails at init_ex_fn(ctx, EVP_md5(), NULL) (opensslcrypto_hash.c:72) and returns NULL. The CRT Hash wraps the null handle, ComputeOneShot
  fails, and CRTHash::Calculate returns false (CRTHash.cpp). So md5HashResult.IsSuccess() is false → the Content-MD5 header is simply omitted, no exception thrown.

It doesn't look right to silently skip checksumming if it is enabled. Also, it doesn't look right to deliberately fall back to CRC. So, I think yes, the change makes sense in general. I'd probably condition the default algorithm on isFIPSEnabled: for normal builds, use MD5 (we should have it as an option), and for FIPS, default to SHA. I will take a closer look at the code later this week.

Thanks for checking! I will make some more changes and add MD5/the FIPS default like you suggested (I didn't know isFIPSEnabled was a thing), and also get it AI-approved and green.

Comment thread src/IO/S3/Requests.cpp

@nickitat nickitat left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, a couple of cosmetic changes that I think would be better explained in code: https://pastila.clickhouse.com/?005b9765/c3a0323f6ce7f3726842098165b72996#zJQf6bFvxVcJvJTJx/7PBw==GCM

Comment thread src/Core/Settings.cpp Outdated
Do not calculate a checksum when sending a file to S3. This speeds up writes by avoiding excessive processing passes on a file. It is mostly safe as the data of MergeTree tables is checksummed by ClickHouse anyway, and when S3 is accessed with HTTPS, the TLS layer already provides integrity while transferring through the network. While additional checksums on S3 give defense in depth.
)", 0) \
DECLARE(String, s3_upload_checksum_algorithm, "", R"(
The checksum algorithm used for `S3` upload requests. `CRC32` and `SHA256` are sent as flexible `x-amz-checksum-*` headers, while `MD5` is sent as a `Content-MD5` header. `CRC32` and `SHA256` take precedence over `s3_disable_checksum`.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, "disable" should imply no checksumming regardless of the s3_upload_checksum_algorithm value.

@sebver sebver Jun 25, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I changed it.

I did keep the "forced" checksum for s3 express. Although I am actually not entirely sure it's necessary. AFAICT s3 express doesn't support MD5, but it does support an empty checksum (except for DeleteObjects, where it requires one and also supports MD5). This felt a bit like a separate task, so might make sense to keep the original behaviour for s3 express?

Comment thread src/IO/S3/copyS3File.cpp
Comment thread src/IO/S3/copyS3File.cpp Outdated
Comment thread src/IO/S3/copyS3File.cpp Outdated
Comment thread src/IO/S3/copyS3File.cpp Outdated
Comment thread src/IO/S3/copyS3File.cpp Outdated
Comment thread src/IO/WriteBufferFromS3.cpp Outdated
Comment thread src/IO/tests/gtest_writebuffer_s3.cpp
Comment thread src/IO/WriteBufferFromS3.cpp Outdated
Comment thread src/Core/Settings.cpp
@sebver

sebver commented Jun 25, 2026

Copy link
Copy Markdown
Author

Also, a couple of cosmetic changes that I think would be better explained in code: https://pastila.clickhouse.com/?005b9765/c3a0323f6ce7f3726842098165b72996#zJQf6bFvxVcJvJTJx/7PBw==GCM

Unfortunately I can't access this page.

I did try to fix most of the other comments.

Comment thread src/IO/S3RequestSettings.cpp
@nickitat

Copy link
Copy Markdown
Member

Also, a couple of cosmetic changes that I think would be better explained in code: pastila.clickhouse.com?005b9765/c3a0323f6ce7f3726842098165b72996#zJQf6bFvxVcJvJTJx/7PBw==GCM

Unfortunately I can't access this page.

I did try to fix most of the other comments.

Sorry, my bad. Another link: https://pastila.nl/?00385e67/7687f6486c8860490b4e4c48ff4c16cd#KZaixV40HSjsxKbFU9nFOg==GCM

Comment thread src/IO/S3/copyS3File.cpp
Comment thread src/Core/SettingsChangesHistory.cpp Outdated
Comment thread src/IO/S3/copyS3File.cpp Outdated
Comment thread src/IO/S3/copyS3File.cpp
Comment thread src/IO/S3/Requests.h Outdated
Comment thread src/IO/S3/Requests.h Outdated
Comment thread src/IO/S3/Requests.cpp
Comment thread src/Core/Settings.cpp
The checksum algorithm used for `S3` upload requests. `CRC32` and `SHA256` are sent as flexible `x-amz-checksum-*` headers, while `MD5` is sent as a `Content-MD5` header. `s3_disable_checksum` suppresses this setting for non-`S3Express` buckets.

By default the value is empty and ClickHouse lets the AWS SDK compute `Content-MD5`. In FIPS mode `MD5` is unavailable, so an empty value uses `SHA256` instead and an explicit `MD5` is rejected. To send no checksum for non-`S3Express` buckets, set `s3_disable_checksum`.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GCS is now a deliberate exception to this setting contract, but this doc string still reads as if the explicit algorithm choice and the FIPS SHA256 fallback apply to every S3 endpoint. In the new isClientForGCS() short-circuit we return nullopt, so storage.googleapis.com uploads ignore s3_upload_checksum_algorithm and keep the old no-flexible-checksum behavior. Please mention that provider exception here (and in the PR changelog entry), otherwise the public docs promise behavior the code intentionally does not provide.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot pls address this suggestion

@clickhouse-gh

clickhouse-gh Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.50% 85.60% +0.10%
Functions 92.70% 92.70% +0.00%
Branches 77.70% 77.70% +0.00%

Changed lines: Changed C/C++ lines covered: 444/486 (91.36%) · Uncovered code

Full report · Diff report

@nickitat

nickitat commented Jul 4, 2026

Copy link
Copy Markdown
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-improvement Pull request with some product improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow forcing x-amz-checksum-* headers for S3 uploads

5 participants