Fix ClickBench EventDate handling by casting UInt16 days-since-epoch to DATE via hits view#19881
Conversation
…gister_hits - Added a CREATE VIEW statement that selects all columns except "EventDate" and casts "EventDate" from UInt16 to Date32. - Updated the external table creation logic to use 'hits_raw' instead of 'hits'. - Refactored the registration process to include the new view for EventDate encoding.
…t table/view definitions
- Changed `CAST(CAST("EventDate" AS Int32) AS Date32)` to `CAST(CAST("EventDate" AS INTEGER) AS DATE)` in `clickbench.rs`
- Updated SQL logic test file `clickbench.slt` to reflect the SQL syntax change for consistency.
…ifiers in clickbench.slt
- Added HITS_VIEW_DDL constant to define the SQL view creation for `hits` with proper `EventDate` casting from UInt16 to DATE. - Refactored the `register_hits` function to utilize a new `create_hits_view` function for better separation of concerns. - Added SQL logic tests in `clickbench.slt` to verify the transformation of `EventDate` from UInt16 to DATE and ensure raw values remain unaffected in `hits_raw`.
nuno-faria
left a comment
There was a problem hiding this comment.
Thanks @kosiew.
BTW @waynexia @Omega359 @pmcgleenon @alamb since you're working on updating the clickbench results.
|
🤖 |
|
run benchmark clickbench_partitioned |
alamb
left a comment
There was a problem hiding this comment.
Thank you @kosiew and @nuno-faria
FYI @Dandandan
I think this is very legit and it seems like DuckDB does a very similar thing here:
I will file a ticket to update the actual ClickBench runner scripts with this same change
| SELECT MIN("EventDate"), MAX("EventDate") FROM hits; | ||
| ---- | ||
| 15901 15901 | ||
| 2013-07-15 2013-07-15 |
| /// ClickBench stores EventDate as UInt16 (days since 1970-01-01) for | ||
| /// storage efficiency (2 bytes vs 4-8 bytes for date types). | ||
| /// This view transforms it to SQL DATE type for query compatibility. | ||
| const HITS_VIEW_DDL: &str = r#"CREATE VIEW hits AS |
There was a problem hiding this comment.
This only changes the DataFusion benchmark runner -- it won't change the actual clickbench results which are created with the scripts here: https://github.com/ClickHouse/ClickBench/tree/main/datafusion
In order to keep the reported benchmarks close to our actual runner I suggest we add the same thing to the clickbench runner scripts too. I will file a ticket
| STORED AS PARQUET | ||
| LOCATION '../core/tests/data/clickbench_hits_10.parquet'; | ||
|
|
||
| # ClickBench encodes EventDate as UInt16 days since epoch. |
There was a problem hiding this comment.
I worry that we are spreading the knowledge needed to run DataFusion on ClickBench effectively all over the place. For example, this view definition is now copied twice
Would it be possible to extend this documentation (either in this PR or another) here
https://github.com/apache/datafusion/blob/main/benchmarks/README.md#clickbench
To mention:
- The need to add the view / cast the integer column (basically the same great explanation in
HITS_VIEW_DDL) - The need to run with
binary_as_string(e.g. https://github.com/ClickHouse/ClickBench/blob/d88f3552bd928e86774e09901f05a6c1ea4e3ef4/datafusion/create.sql#L4)
Note this .slt test doesn't need the binary as string as the conversion aparently was done when creating the subset (the columns say Utf8View, not BinaryView)
andrewlamb@Andrews-MacBook-Pro-3:~/Software/arrow-rs$ datafusion-cli
DataFusion CLI v52.0.0
> describe '/Users/andrewlamb/Software/datafusion/datafusion/core/tests/data/clickbench_hits_10.parquet';
+-----------------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-----------------------+-----------+-------------+
| WatchID | Int64 | YES |
| JavaEnable | Int16 | YES |
| Title | Utf8View | YES |
| GoodEvent | Int16 | YES |
| EventTime | Int64 | YES |
| EventDate | UInt16 | YES |
| CounterID | Int32 | YES |
| ClientIP | Int32 | YES |
| RegionID | Int32 | YES |
| UserID | Int64 | YES |
| CounterClass | Int16 | YES |
| OS | Int16 | YES |
| UserAgent | Int16 | YES |
| URL | Utf8View | YES |
| Referer | Utf8View | YES |
| IsRefresh | Int16 | YES |
| RefererCategoryID | Int16 | YES |
| RefererRegionID | Int32 | YES |
| URLCategoryID | Int16 | YES |
| URLRegionID | Int32 | YES |
| ResolutionWidth | Int16 | YES |
| ResolutionHeight | Int16 | YES |
| ResolutionDepth | Int16 | YES |
| FlashMajor | Int16 | YES |
| FlashMinor | Int16 | YES |
| FlashMinor2 | Utf8View | YES |
| NetMajor | Int16 | YES |
| NetMinor | Int16 | YES |
| UserAgentMajor | Int16 | YES |
| UserAgentMinor | Utf8View | YES |
| CookieEnable | Int16 | YES |
| JavascriptEnable | Int16 | YES |
| IsMobile | Int16 | YES |
| MobilePhone | Int16 | YES |
| MobilePhoneModel | Utf8View | YES |
| Params | Utf8View | YES |
| IPNetworkID | Int32 | YES |
| TraficSourceID | Int16 | YES |
| SearchEngineID | Int16 | YES |
| SearchPhrase | Utf8View | YES |
| AdvEngineID | Int16 | YES |
| IsArtifical | Int16 | YES |
| WindowClientWidth | Int16 | YES |
| WindowClientHeight | Int16 | YES |
| ClientTimeZone | Int16 | YES |
| ClientEventTime | Int64 | YES |
| SilverlightVersion1 | Int16 | YES |
| SilverlightVersion2 | Int16 | YES |
| SilverlightVersion3 | Int32 | YES |
| SilverlightVersion4 | Int16 | YES |
| PageCharset | Utf8View | YES |
| CodeVersion | Int32 | YES |
| IsLink | Int16 | YES |
| IsDownload | Int16 | YES |
| IsNotBounce | Int16 | YES |
| FUniqID | Int64 | YES |
| OriginalURL | Utf8View | YES |
| HID | Int32 | YES |
| IsOldCounter | Int16 | YES |
| IsEvent | Int16 | YES |
| IsParameter | Int16 | YES |
| DontCountHits | Int16 | YES |
| WithHash | Int16 | YES |
| HitColor | Utf8View | YES |
| LocalEventTime | Int64 | YES |
| Age | Int16 | YES |
| Sex | Int16 | YES |
| Income | Int16 | YES |
| Interests | Int16 | YES |
| Robotness | Int16 | YES |
| RemoteIP | Int32 | YES |
| WindowName | Int32 | YES |
| OpenerName | Int32 | YES |
| HistoryLength | Int16 | YES |
| BrowserLanguage | Utf8View | YES |
| BrowserCountry | Utf8View | YES |
| SocialNetwork | Utf8View | YES |
| SocialAction | Utf8View | YES |
| HTTPError | Int16 | YES |
| SendTiming | Int32 | YES |
| DNSTiming | Int32 | YES |
| ConnectTiming | Int32 | YES |
| ResponseStartTiming | Int32 | YES |
| ResponseEndTiming | Int32 | YES |
| FetchTiming | Int32 | YES |
| SocialSourceNetworkID | Int16 | YES |
| SocialSourcePage | Utf8View | YES |
| ParamPrice | Int64 | YES |
| ParamOrderID | Utf8View | YES |
| ParamCurrency | Utf8View | YES |
| ParamCurrencyID | Int16 | YES |
| OpenstatServiceName | Utf8View | YES |
| OpenstatCampaignID | Utf8View | YES |
| OpenstatAdID | Utf8View | YES |
| OpenstatSourceID | Utf8View | YES |
| UTMSource | Utf8View | YES |
| UTMMedium | Utf8View | YES |
| UTMCampaign | Utf8View | YES |
| UTMContent | Utf8View | YES |
| UTMTerm | Utf8View | YES |
| FromTag | Utf8View | YES |
| HasGCLID | Int16 | YES |
| RefererHash | Int64 | YES |
| URLHash | Int64 | YES |
| CLID | Int32 | YES |
+-----------------------+-----------+-------------+
105 row(s) fetched.
Elapsed 0.010 seconds.|
Filed a ticket to track updating clickbench scripts: |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤔 I think we need to look into why this is slower.... |
|
run benchmarks |
|
🤖 |
I think the aggregations were not being triggered before, since the filter did not match any rows. |
|
🤖: Benchmark completed Details
|
So that probably means the results are now correct because the queries are doing more work? Did we verify the new results are now correct by comparing with some other implementation (e.g. DuckDB?) |
I checked and the results now match with DuckDB. |
alamb
left a comment
There was a problem hiding this comment.
Thank you @kosiew and @nuno-faria
FYI @Dandandan and @waynexia -- this change will slow down some queries (though now the answers are correct!)
…to DATE via `hits` view (apache#19881) ## Which issue does this PR close? * Closes apache#18982. --- ## Rationale for this change ClickBench encodes `EventDate` as a `UInt16` representing **days since 1970-01-01**. When DataFusion registers the ClickBench parquet file directly as `hits`, `EventDate` ends up being compared as a string in some queries (notably ClickBench queries 36–42), which causes the date range predicates to filter out all rows. To make ClickBench queries behave as authored (and align with how other engines handle the dataset), we expose `hits` as a view that converts the raw `UInt16` encoding into a proper SQL `DATE`. --- ## What changes are included in this PR? * Register the underlying parquet table as **`hits_raw`** instead of `hits`. * Add a constant **`HITS_VIEW_DDL`** that defines a `hits` view which: * Removes the original `EventDate` column, and * Re-introduces it as `DATE` using `CAST(CAST("EventDate" AS INTEGER) AS DATE)`. * Factor view creation into a helper method (`create_hits_view`) and add error context for easier debugging. * Update the ClickBench sqllogictest file to: * Create `hits_raw` + `hits` view, * Add explicit assertions validating the transformation (`15901` ↔ `2013-07-15`), * Update expected result types where `EventDate` is now a `DATE`, and * Drop the view before dropping the raw table. --- ## Are these changes tested? Yes. * Updated `datafusion/sqllogictest/test_files/clickbench.slt` to cover: * Correct `EventDate` decoding in the `hits` view (returns `DATE`), * Raw `hits_raw.EventDate` remains the original integer encoding, and * Existing ClickBench queries that rely on date predicates (including the previously failing range-filter queries) now execute with the correct types. Script to test [q36-q42](apache#18982). `benchmarks/run_q36_q42.sh` ```bash #!/usr/bin/env bash # Script to run ClickBench queries 36-42 and display results set -e SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd ) BENCHMARK=${1:-"clickbench_1"} OUTPUT_FILE="${2:-results_q36_q42.txt}" echo "==========================================" echo "Running ClickBench Queries 36-42" echo "==========================================" echo "Benchmark: $BENCHMARK" echo "Output file: $OUTPUT_FILE" echo "" # Create results file > "$OUTPUT_FILE" # Run queries 36-42 for q in {36..42}; do echo "Running Query $q..." # Run the query and extract relevant info output=$($SCRIPT_DIR/bench.sh run $BENCHMARK $q 2>&1) # Extract timing and row count from the first iteration iteration_0=$(echo "$output" | grep "Query $q iteration 0" | head -1) avg_time=$(echo "$output" | grep "Query $q avg time" | head -1) echo "Q$q: $iteration_0" | tee -a "$OUTPUT_FILE" echo " $avg_time" | tee -a "$OUTPUT_FILE" echo "" | tee -a "$OUTPUT_FILE" done echo "==========================================" echo "Summary saved to: $OUTPUT_FILE" echo "==========================================" cat "$OUTPUT_FILE" ``` Run results on this branch: ``` Q36: Query 36 iteration 0 took 138.1 ms and returned 10 rows Query 36 avg time: 116.19 ms Q37: Query 37 iteration 0 took 66.4 ms and returned 10 rows Query 37 avg time: 50.57 ms Q38: Query 38 iteration 0 took 98.9 ms and returned 10 rows Query 38 avg time: 83.20 ms Q39: Query 39 iteration 0 took 237.3 ms and returned 10 rows Query 39 avg time: 223.62 ms Q40: Query 40 iteration 0 took 40.6 ms and returned 10 rows Query 40 avg time: 24.43 ms Q41: Query 41 iteration 0 took 36.6 ms and returned 10 rows Query 41 avg time: 22.90 ms Q42: Query 42 iteration 0 took 34.0 ms and returned 10 rows Query 42 avg time: 20.17 ms ``` On `main` branch, the queries return 0 rows. --- ## Are there any user-facing changes? Yes (benchmark/test behavior): * In ClickBench runs, the logical table name `hits` continues to exist, but it is now a **view** that exposes `EventDate` as a proper `DATE` rather than the raw `UInt16` encoding. * This fixes ClickBench queries 36–42 so they return rows without requiring manual casts in the SQL. No public API changes. --- ## LLM-generated code disclosure This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.

Which issue does this PR close?
Rationale for this change
ClickBench encodes
EventDateas aUInt16representing days since 1970-01-01. When DataFusion registers the ClickBench parquet file directly ashits,EventDateends up being compared as a string in some queries (notably ClickBench queries 36–42), which causes the date range predicates to filter out all rows.To make ClickBench queries behave as authored (and align with how other engines handle the dataset), we expose
hitsas a view that converts the rawUInt16encoding into a proper SQLDATE.What changes are included in this PR?
Register the underlying parquet table as
hits_rawinstead ofhits.Add a constant
HITS_VIEW_DDLthat defines ahitsview which:EventDatecolumn, andDATEusingCAST(CAST("EventDate" AS INTEGER) AS DATE).Factor view creation into a helper method (
create_hits_view) and add error context for easier debugging.Update the ClickBench sqllogictest file to:
hits_raw+hitsview,15901↔2013-07-15),EventDateis now aDATE, andAre these changes tested?
Yes.
Updated
datafusion/sqllogictest/test_files/clickbench.sltto cover:EventDatedecoding in thehitsview (returnsDATE),hits_raw.EventDateremains the original integer encoding, andScript to test q36-q42.
benchmarks/run_q36_q42.shRun results on this branch:
On
mainbranch, the queries return 0 rows.Are there any user-facing changes?
Yes (benchmark/test behavior):
hitscontinues to exist, but it is now a view that exposesEventDateas a properDATErather than the rawUInt16encoding.No public API changes.
LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.