Add s3_upload_checksum_algorithm setting#107318
Conversation
|
Workflow [PR], commit [b9de102] Summary: ✅
AI ReviewSummaryThis PR adds PR Metadata💡 Findings
💡 Nits
Final Verdict
|
|
hi @nickitat (apologies if you're not the best person to tag, I tried to look for a maintainer with s3 knowledge), I was wondering if this change makes sense from your PoV? Or if you think something needs to be done differently? |
It doesn't look right to silently skip checksumming if it is enabled. Also, it doesn't look right to deliberately fall back to CRC. So, I think yes, the change makes sense in general. I'd probably condition the default algorithm on |
Thanks for checking! I will make some more changes and add MD5/the FIPS default like you suggested (I didn't know |
nickitat
left a comment
There was a problem hiding this comment.
Also, a couple of cosmetic changes that I think would be better explained in code: https://pastila.clickhouse.com/?005b9765/c3a0323f6ce7f3726842098165b72996#zJQf6bFvxVcJvJTJx/7PBw==GCM
| Do not calculate a checksum when sending a file to S3. This speeds up writes by avoiding excessive processing passes on a file. It is mostly safe as the data of MergeTree tables is checksummed by ClickHouse anyway, and when S3 is accessed with HTTPS, the TLS layer already provides integrity while transferring through the network. While additional checksums on S3 give defense in depth. | ||
| )", 0) \ | ||
| DECLARE(String, s3_upload_checksum_algorithm, "", R"( | ||
| The checksum algorithm used for `S3` upload requests. `CRC32` and `SHA256` are sent as flexible `x-amz-checksum-*` headers, while `MD5` is sent as a `Content-MD5` header. `CRC32` and `SHA256` take precedence over `s3_disable_checksum`. |
There was a problem hiding this comment.
IMO, "disable" should imply no checksumming regardless of the s3_upload_checksum_algorithm value.
There was a problem hiding this comment.
I agree, I changed it.
I did keep the "forced" checksum for s3 express. Although I am actually not entirely sure it's necessary. AFAICT s3 express doesn't support MD5, but it does support an empty checksum (except for DeleteObjects, where it requires one and also supports MD5). This felt a bit like a separate task, so might make sense to keep the original behaviour for s3 express?
Unfortunately I can't access this page. I did try to fix most of the other comments. |
Sorry, my bad. Another link: https://pastila.nl/?00385e67/7687f6486c8860490b4e4c48ff4c16cd#KZaixV40HSjsxKbFU9nFOg==GCM |
| The checksum algorithm used for `S3` upload requests. `CRC32` and `SHA256` are sent as flexible `x-amz-checksum-*` headers, while `MD5` is sent as a `Content-MD5` header. `s3_disable_checksum` suppresses this setting for non-`S3Express` buckets. | ||
|
|
||
| By default the value is empty and ClickHouse lets the AWS SDK compute `Content-MD5`. In FIPS mode `MD5` is unavailable, so an empty value uses `SHA256` instead and an explicit `MD5` is rejected. To send no checksum for non-`S3Express` buckets, set `s3_disable_checksum`. | ||
|
|
There was a problem hiding this comment.
GCS is now a deliberate exception to this setting contract, but this doc string still reads as if the explicit algorithm choice and the FIPS SHA256 fallback apply to every S3 endpoint. In the new isClientForGCS() short-circuit we return nullopt, so storage.googleapis.com uploads ignore s3_upload_checksum_algorithm and keep the old no-flexible-checksum behavior. Please mention that provider exception here (and in the PR changelog entry), otherwise the public docs promise behavior the code intentionally does not provide.
LLVM Coverage Report
Changed lines: Changed C/C++ lines covered: 444/486 (91.36%) · Uncovered code |

Closes: #106106
This adds a
s3_upload_checksum_algorithmsetting to force a flexible checksum algorithm (CRC32orSHA256) on S3 upload requests. Until now ClickHouse only forced flexible checksums for S3Express buckets; for everything else it relied onContent-MD5, or nothing ifs3_disable_checksumwas set.That's a problem when
Content-MD5isn't an option. For example, on a FIPS image MD5 is unavailable, so the upload goes out with no integrity header at all. AWS S3 buckets with Object Lock / default retention then reject it:With the setting, ClickHouse attaches the corresponding
x-amz-checksum-*header and marks the checksum as required, so these uploads go through. It applies to all upload paths:PutObject,CreateMultipartUpload,UploadPartandCompleteMultipartUpload. The empty default keeps the current behavior. For non-S3Expressbuckets,s3_disable_checksumtakes precedence and suppresses this setting.Implementation-wise, the S3Express-only checksum code in
Requests.hwas generalized into aRequestChecksum::Algorithmenum, and the request hooks (RequestChecksumRequired,ShouldComputeContentMd5,GetChecksumAlgorithmName) now check ahasRequestChecksum()helper that's true for both S3Express and an explicitly requested algorithm. S3Express buckets keep their mandatoryCRC32, and invalid values are rejected during settings validation.Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Added the
s3_upload_checksum_algorithmsetting to chooseMD5,CRC32, orSHA256forS3uploads.CRC32andSHA256are sent as AWS flexiblex-amz-checksum-*headers;MD5uses the SDKContent-MD5path. For non-S3Expressbuckets,s3_disable_checksumtakes precedence and suppresses this setting. The empty (default) value keeps the SDKContent-MD5, except in FIPS mode where it now sendsSHA256instead of no checksum.