docs: document ClickBench setup details#23315
Conversation
ffb74bf to
851ad22
Compare
|
@kosiew FYI |
| The runner registers the parquet data as `hits_raw`, then creates a | ||
| `hits` view that casts `EventDate` through `INTEGER` to `DATE` for the | ||
| benchmark queries. | ||
| - The full and partitioned ClickBench datasets may store string columns |
There was a problem hiding this comment.
This only applies to the partitioned ClickBench dataset -- what happens is that the string columns don't have a "string" logical type annotation in the parquet files
Maybe a better description is:
The source partitioned ClickBench datasets has string columns without
the "string" Parquet logical type annotation. These must be treated as
strings to correctly run the query, so the runner enables the parquet binary_as_string
option.
| ```sql | ||
| CREATE EXTERNAL TABLE hits_raw | ||
| STORED AS PARQUET | ||
| LOCATION 'benchmarks/data/hits.parquet' |
There was a problem hiding this comment.
I m pretty sure this is only necessary for hits_partititoned
|
|
||
| ```shell | ||
| ./benchmarks/bench.sh data clickbench_1 | ||
| cargo run --release --bin dfbench -- clickbench \ |
There was a problem hiding this comment.
I don't think anyone would run a command like that -- instead if they want to run all the queries they would use bench.sh run

Which issue does this PR close?
Rationale for this change
Docs consolidation. Explained in the issue.
What changes are included in this PR?
Only documentation.
Are these changes tested?
N/A. No code changes.
Are there any user-facing changes?
None. Only documentations.
LLM-generated code disclosure
This PR includes LLM-generated content. All of which was manually reviewed.