Fix flaky 01710_projections by pinning index_granularity#107698
Conversation
The test creates projection_test without an explicit index_granularity, so CI's --random-merge-tree-settings injects a small value (e.g. 3). With many small granules the projection part's pre-aggregated states are read across multiple concurrent streams (notably on s3 storage), so the float aggregates avg(block_count / duration) etc. are summed in a different order, producing a trailing-digit difference from the reference (1.2352100180287104 vs ...118). Pinning index_granularity = 8192 keeps the data in a single granule, making the read and merge order deterministic. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
cc @CurtizJ — could you review this? It pins |
|
Workflow [PR], commit [3c9ef2a] Summary: ❌
AI ReviewSummary
Final Verdict
|
Algunenano
left a comment
There was a problem hiding this comment.
@groeneai When linking failures, please provide full links. The one you shared is invalid: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=107634 does not resolve to a html page, and even if it did it would require finding the exact commit and check that triggered the fail:
Instead share https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=107634&sha=5a342b69fe28bf9c7cedb4b0a5ace9fa8160c469&name_0=PR&name_1=Stateless%20tests%20%28amd_tsan%2C%20s3%20storage%2C%20parallel%2C%202%2F2%29 This points to the exact failure.
| drop table if exists projection_test; | ||
|
|
||
| create table projection_test (`sum(block_count)` UInt64, domain_alias UInt64 alias length(domain), datetime DateTime, domain LowCardinality(String), x_id String, y_id String, block_count Int64, retry_count Int64, duration Int64, kbytes Int64, buffer_time Int64, first_time Int64, total_bytes Nullable(UInt64), valid_bytes Nullable(UInt64), completed_bytes Nullable(UInt64), fixed_bytes Nullable(UInt64), force_bytes Nullable(UInt64), projection p (select toStartOfMinute(datetime) dt_m, countIf(first_time = 0) / count(), avg((kbytes * 8) / duration), count(), sum(block_count) / sum(duration), avg(block_count / duration), sum(buffer_time) / sum(duration), avg(buffer_time / duration), sum(valid_bytes) / sum(total_bytes), sum(completed_bytes) / sum(total_bytes), sum(fixed_bytes) / sum(total_bytes), sum(force_bytes) / sum(total_bytes), sum(valid_bytes) / sum(total_bytes), sum(retry_count) / sum(duration), avg(retry_count / duration), countIf(block_count > 0) / count(), countIf(first_time = 0) / count(), uniqHLL12(x_id), uniqHLL12(y_id) group by dt_m, domain)) engine MergeTree partition by toDate(datetime) order by toStartOfTenMinutes(datetime) settings index_granularity_bytes = 10000000, add_minmax_index_for_numeric_columns=0; | ||
| create table projection_test (`sum(block_count)` UInt64, domain_alias UInt64 alias length(domain), datetime DateTime, domain LowCardinality(String), x_id String, y_id String, block_count Int64, retry_count Int64, duration Int64, kbytes Int64, buffer_time Int64, first_time Int64, total_bytes Nullable(UInt64), valid_bytes Nullable(UInt64), completed_bytes Nullable(UInt64), fixed_bytes Nullable(UInt64), force_bytes Nullable(UInt64), projection p (select toStartOfMinute(datetime) dt_m, countIf(first_time = 0) / count(), avg((kbytes * 8) / duration), count(), sum(block_count) / sum(duration), avg(block_count / duration), sum(buffer_time) / sum(duration), avg(buffer_time / duration), sum(valid_bytes) / sum(total_bytes), sum(completed_bytes) / sum(total_bytes), sum(fixed_bytes) / sum(total_bytes), sum(force_bytes) / sum(total_bytes), sum(valid_bytes) / sum(total_bytes), sum(retry_count) / sum(duration), avg(retry_count / duration), countIf(block_count > 0) / count(), countIf(first_time = 0) / count(), uniqHLL12(x_id), uniqHLL12(y_id) group by dt_m, domain)) engine MergeTree partition by toDate(datetime) order by toStartOfTenMinutes(datetime) settings index_granularity = 8192, index_granularity_bytes = 10000000, add_minmax_index_for_numeric_columns=0; |
There was a problem hiding this comment.
Please leave a comment when you add settings, explaining why you did it. See the example add_minmax_index_for_numeric_columns above
There was a problem hiding this comment.
Done in 3c9ef2a. Added an inline comment above the setting matching the add_minmax_index_for_numeric_columns style: index_granularity = 8192 is pinned because --random-merge-tree-settings otherwise injects a small value that splits the projection part across concurrent read streams, changing float accumulation order and making the result flaky.
|
@Algunenano Done. Updated the PR description to the full report URL pointing to the exact commit and check: |
Add an inline comment next to the index_granularity = 8192 setting, matching the existing add_minmax_index_for_numeric_columns note, per review feedback on PR ClickHouse#107698. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
LLVM Coverage Report
Changed lines: No C/C++ source files changed — skipping uncovered code analysis. Newly covered by added/modified tests: 415 line(s), 42 function(s) across 123 file(s) · Details Top files
|
229abc8

No open issue found specifically for
01710_projections. Related (different test, closed): #104219.Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
...
Description
Fixes flaky
01710_projectionsonStateless tests (amd_tsan, s3 storage, parallel). The table is created without an explicitindex_granularity, so--random-merge-tree-settingsinjects a small value (the CI minimizer isolated--index_granularity 3). Small granules let the projection part's pre-aggregated states be read across multiple concurrent streams (frequent on s3), changing the summation order of float aggregates likeavg(block_count / duration), which yields a trailing-digit difference from the reference (1.2352100180287104vs...118).Pinning
index_granularity = 8192keeps the projection in a single granule so the read/merge order is deterministic. The randomizer skips settings already present in the CREATE, so the pin wins.0 master hits in 30d; flaky across 5 unrelated PRs. CI report for the exact failure: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=107634&sha=5a342b69fe28bf9c7cedb4b0a5ace9fa8160c469&name_0=PR&name_1=Stateless%20tests%20%28amd_tsan%2C%20s3%20storage%2C%20parallel%2C%202%2F2%29
Version info
26.6.1.929