Fix ClickBench EventDate handling by casting UInt16 days-since-epoch to DATE via `hits` view by kosiew · Pull Request #19881 · apache/datafusion · GitHub
Skip to content

Fix ClickBench EventDate handling by casting UInt16 days-since-epoch to DATE via hits view#19881

Merged
kosiew merged 8 commits into
apache:mainfrom
kosiew:eventdate-handling-18982
Jan 27, 2026
Merged

Fix ClickBench EventDate handling by casting UInt16 days-since-epoch to DATE via hits view#19881
kosiew merged 8 commits into
apache:mainfrom
kosiew:eventdate-handling-18982

Conversation

@kosiew

@kosiew kosiew commented Jan 19, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?


Rationale for this change

ClickBench encodes EventDate as a UInt16 representing days since 1970-01-01. When DataFusion registers the ClickBench parquet file directly as hits, EventDate ends up being compared as a string in some queries (notably ClickBench queries 36–42), which causes the date range predicates to filter out all rows.

To make ClickBench queries behave as authored (and align with how other engines handle the dataset), we expose hits as a view that converts the raw UInt16 encoding into a proper SQL DATE.


What changes are included in this PR?

  • Register the underlying parquet table as hits_raw instead of hits.

  • Add a constant HITS_VIEW_DDL that defines a hits view which:

    • Removes the original EventDate column, and
    • Re-introduces it as DATE using CAST(CAST("EventDate" AS INTEGER) AS DATE).
  • Factor view creation into a helper method (create_hits_view) and add error context for easier debugging.

  • Update the ClickBench sqllogictest file to:

    • Create hits_raw + hits view,
    • Add explicit assertions validating the transformation (159012013-07-15),
    • Update expected result types where EventDate is now a DATE, and
    • Drop the view before dropping the raw table.

Are these changes tested?

Yes.

  • Updated datafusion/sqllogictest/test_files/clickbench.slt to cover:

    • Correct EventDate decoding in the hits view (returns DATE),
    • Raw hits_raw.EventDate remains the original integer encoding, and
    • Existing ClickBench queries that rely on date predicates (including the previously failing range-filter queries) now execute with the correct types.

Script to test q36-q42.

benchmarks/run_q36_q42.sh

#!/usr/bin/env bash
# Script to run ClickBench queries 36-42 and display results

set -e

SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
BENCHMARK=${1:-"clickbench_1"}
OUTPUT_FILE="${2:-results_q36_q42.txt}"

echo "=========================================="
echo "Running ClickBench Queries 36-42"
echo "=========================================="
echo "Benchmark: $BENCHMARK"
echo "Output file: $OUTPUT_FILE"
echo ""

# Create results file
> "$OUTPUT_FILE"

# Run queries 36-42
for q in {36..42}; do
    echo "Running Query $q..."
    
    # Run the query and extract relevant info
    output=$($SCRIPT_DIR/bench.sh run $BENCHMARK $q 2>&1)
    
    # Extract timing and row count from the first iteration
    iteration_0=$(echo "$output" | grep "Query $q iteration 0" | head -1)
    avg_time=$(echo "$output" | grep "Query $q avg time" | head -1)
    
    echo "Q$q: $iteration_0" | tee -a "$OUTPUT_FILE"
    echo "     $avg_time" | tee -a "$OUTPUT_FILE"
    echo "" | tee -a "$OUTPUT_FILE"
done

echo "=========================================="
echo "Summary saved to: $OUTPUT_FILE"
echo "=========================================="
cat "$OUTPUT_FILE"

Run results on this branch:

Q36: Query 36 iteration 0 took 138.1 ms and returned 10 rows
     Query 36 avg time: 116.19 ms

Q37: Query 37 iteration 0 took 66.4 ms and returned 10 rows
     Query 37 avg time: 50.57 ms

Q38: Query 38 iteration 0 took 98.9 ms and returned 10 rows
     Query 38 avg time: 83.20 ms

Q39: Query 39 iteration 0 took 237.3 ms and returned 10 rows
     Query 39 avg time: 223.62 ms

Q40: Query 40 iteration 0 took 40.6 ms and returned 10 rows
     Query 40 avg time: 24.43 ms

Q41: Query 41 iteration 0 took 36.6 ms and returned 10 rows
     Query 41 avg time: 22.90 ms

Q42: Query 42 iteration 0 took 34.0 ms and returned 10 rows
     Query 42 avg time: 20.17 ms

On main branch, the queries return 0 rows.


Are there any user-facing changes?

Yes (benchmark/test behavior):

  • In ClickBench runs, the logical table name hits continues to exist, but it is now a view that exposes EventDate as a proper DATE rather than the raw UInt16 encoding.
  • This fixes ClickBench queries 36–42 so they return rows without requiring manual casts in the SQL.

No public API changes.


LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.

…gister_hits

- Added a CREATE VIEW statement that selects all columns except "EventDate" and casts "EventDate" from UInt16 to Date32.
- Updated the external table creation logic to use 'hits_raw' instead of 'hits'.
- Refactored the registration process to include the new view for EventDate encoding.
- Changed `CAST(CAST("EventDate" AS Int32) AS Date32)` to `CAST(CAST("EventDate" AS INTEGER) AS DATE)` in `clickbench.rs`
- Updated SQL logic test file `clickbench.slt` to reflect the SQL syntax change for consistency.
- Added HITS_VIEW_DDL constant to define the SQL view creation for `hits` with proper `EventDate` casting from UInt16 to DATE.
- Refactored the `register_hits` function to utilize a new `create_hits_view` function for better separation of concerns.
- Added SQL logic tests in `clickbench.slt` to verify the transformation of `EventDate` from UInt16 to DATE and ensure raw values remain unaffected in `hits_raw`.
@github-actions github-actions Bot added the sqllogictest SQL Logic Tests (.slt) label Jan 19, 2026
@kosiew kosiew marked this pull request as ready for review January 19, 2026 07:00

@nuno-faria nuno-faria left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kosiew.
BTW @waynexia @Omega359 @pmcgleenon @alamb since you're working on updating the clickbench results.

@alamb

alamb commented Jan 25, 2026

Copy link
Copy Markdown
Contributor

@alamb-ghbot

Copy link
Copy Markdown

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing eventdate-handling-18982 (673fad3) to eadbed5 diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

@alamb

alamb commented Jan 25, 2026

Copy link
Copy Markdown
Contributor

run benchmark clickbench_partitioned

alamb
alamb previously approved these changes Jan 25, 2026

@alamb alamb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @kosiew and @nuno-faria

FYI @Dandandan

I think this is very legit and it seems like DuckDB does a very similar thing here:

https://github.com/ClickHouse/ClickBench/blob/d88f3552bd928e86774e09901f05a6c1ea4e3ef4/duckdb-parquet/create.sql#L3C24-L3C33

I will file a ticket to update the actual ClickBench runner scripts with this same change

SELECT MIN("EventDate"), MAX("EventDate") FROM hits;
----
15901 15901
2013-07-15 2013-07-15

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

/// ClickBench stores EventDate as UInt16 (days since 1970-01-01) for
/// storage efficiency (2 bytes vs 4-8 bytes for date types).
/// This view transforms it to SQL DATE type for query compatibility.
const HITS_VIEW_DDL: &str = r#"CREATE VIEW hits AS

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only changes the DataFusion benchmark runner -- it won't change the actual clickbench results which are created with the scripts here: https://github.com/ClickHouse/ClickBench/tree/main/datafusion

In order to keep the reported benchmarks close to our actual runner I suggest we add the same thing to the clickbench runner scripts too. I will file a ticket

STORED AS PARQUET
LOCATION '../core/tests/data/clickbench_hits_10.parquet';

# ClickBench encodes EventDate as UInt16 days since epoch.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry that we are spreading the knowledge needed to run DataFusion on ClickBench effectively all over the place. For example, this view definition is now copied twice

Would it be possible to extend this documentation (either in this PR or another) here
https://github.com/apache/datafusion/blob/main/benchmarks/README.md#clickbench

To mention:

  1. The need to add the view / cast the integer column (basically the same great explanation in HITS_VIEW_DDL)
  2. The need to run with binary_as_string (e.g. https://github.com/ClickHouse/ClickBench/blob/d88f3552bd928e86774e09901f05a6c1ea4e3ef4/datafusion/create.sql#L4)

Note this .slt test doesn't need the binary as string as the conversion aparently was done when creating the subset (the columns say Utf8View, not BinaryView)

andrewlamb@Andrews-MacBook-Pro-3:~/Software/arrow-rs$ datafusion-cli
DataFusion CLI v52.0.0
> describe '/Users/andrewlamb/Software/datafusion/datafusion/core/tests/data/clickbench_hits_10.parquet';
+-----------------------+-----------+-------------+
| column_name           | data_type | is_nullable |
+-----------------------+-----------+-------------+
| WatchID               | Int64     | YES         |
| JavaEnable            | Int16     | YES         |
| Title                 | Utf8View  | YES         |
| GoodEvent             | Int16     | YES         |
| EventTime             | Int64     | YES         |
| EventDate             | UInt16    | YES         |
| CounterID             | Int32     | YES         |
| ClientIP              | Int32     | YES         |
| RegionID              | Int32     | YES         |
| UserID                | Int64     | YES         |
| CounterClass          | Int16     | YES         |
| OS                    | Int16     | YES         |
| UserAgent             | Int16     | YES         |
| URL                   | Utf8View  | YES         |
| Referer               | Utf8View  | YES         |
| IsRefresh             | Int16     | YES         |
| RefererCategoryID     | Int16     | YES         |
| RefererRegionID       | Int32     | YES         |
| URLCategoryID         | Int16     | YES         |
| URLRegionID           | Int32     | YES         |
| ResolutionWidth       | Int16     | YES         |
| ResolutionHeight      | Int16     | YES         |
| ResolutionDepth       | Int16     | YES         |
| FlashMajor            | Int16     | YES         |
| FlashMinor            | Int16     | YES         |
| FlashMinor2           | Utf8View  | YES         |
| NetMajor              | Int16     | YES         |
| NetMinor              | Int16     | YES         |
| UserAgentMajor        | Int16     | YES         |
| UserAgentMinor        | Utf8View  | YES         |
| CookieEnable          | Int16     | YES         |
| JavascriptEnable      | Int16     | YES         |
| IsMobile              | Int16     | YES         |
| MobilePhone           | Int16     | YES         |
| MobilePhoneModel      | Utf8View  | YES         |
| Params                | Utf8View  | YES         |
| IPNetworkID           | Int32     | YES         |
| TraficSourceID        | Int16     | YES         |
| SearchEngineID        | Int16     | YES         |
| SearchPhrase          | Utf8View  | YES         |
| AdvEngineID           | Int16     | YES         |
| IsArtifical           | Int16     | YES         |
| WindowClientWidth     | Int16     | YES         |
| WindowClientHeight    | Int16     | YES         |
| ClientTimeZone        | Int16     | YES         |
| ClientEventTime       | Int64     | YES         |
| SilverlightVersion1   | Int16     | YES         |
| SilverlightVersion2   | Int16     | YES         |
| SilverlightVersion3   | Int32     | YES         |
| SilverlightVersion4   | Int16     | YES         |
| PageCharset           | Utf8View  | YES         |
| CodeVersion           | Int32     | YES         |
| IsLink                | Int16     | YES         |
| IsDownload            | Int16     | YES         |
| IsNotBounce           | Int16     | YES         |
| FUniqID               | Int64     | YES         |
| OriginalURL           | Utf8View  | YES         |
| HID                   | Int32     | YES         |
| IsOldCounter          | Int16     | YES         |
| IsEvent               | Int16     | YES         |
| IsParameter           | Int16     | YES         |
| DontCountHits         | Int16     | YES         |
| WithHash              | Int16     | YES         |
| HitColor              | Utf8View  | YES         |
| LocalEventTime        | Int64     | YES         |
| Age                   | Int16     | YES         |
| Sex                   | Int16     | YES         |
| Income                | Int16     | YES         |
| Interests             | Int16     | YES         |
| Robotness             | Int16     | YES         |
| RemoteIP              | Int32     | YES         |
| WindowName            | Int32     | YES         |
| OpenerName            | Int32     | YES         |
| HistoryLength         | Int16     | YES         |
| BrowserLanguage       | Utf8View  | YES         |
| BrowserCountry        | Utf8View  | YES         |
| SocialNetwork         | Utf8View  | YES         |
| SocialAction          | Utf8View  | YES         |
| HTTPError             | Int16     | YES         |
| SendTiming            | Int32     | YES         |
| DNSTiming             | Int32     | YES         |
| ConnectTiming         | Int32     | YES         |
| ResponseStartTiming   | Int32     | YES         |
| ResponseEndTiming     | Int32     | YES         |
| FetchTiming           | Int32     | YES         |
| SocialSourceNetworkID | Int16     | YES         |
| SocialSourcePage      | Utf8View  | YES         |
| ParamPrice            | Int64     | YES         |
| ParamOrderID          | Utf8View  | YES         |
| ParamCurrency         | Utf8View  | YES         |
| ParamCurrencyID       | Int16     | YES         |
| OpenstatServiceName   | Utf8View  | YES         |
| OpenstatCampaignID    | Utf8View  | YES         |
| OpenstatAdID          | Utf8View  | YES         |
| OpenstatSourceID      | Utf8View  | YES         |
| UTMSource             | Utf8View  | YES         |
| UTMMedium             | Utf8View  | YES         |
| UTMCampaign           | Utf8View  | YES         |
| UTMContent            | Utf8View  | YES         |
| UTMTerm               | Utf8View  | YES         |
| FromTag               | Utf8View  | YES         |
| HasGCLID              | Int16     | YES         |
| RefererHash           | Int64     | YES         |
| URLHash               | Int64     | YES         |
| CLID                  | Int32     | YES         |
+-----------------------+-----------+-------------+
105 row(s) fetched.
Elapsed 0.010 seconds.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracking this in #20007

@alamb

alamb commented Jan 25, 2026

Copy link
Copy Markdown
Contributor

Filed a ticket to track updating clickbench scripts:

@alamb-ghbot

Copy link
Copy Markdown

🤖: Benchmark completed

Details

Comparing HEAD and eventdate-handling-18982
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query    ┃        HEAD ┃ eventdate-handling-18982 ┃       Change ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 0 │  2442.42 ms │               2360.88 ms │    no change │
│ QQuery 1 │   920.83 ms │                949.47 ms │    no change │
│ QQuery 2 │  1869.90 ms │               1843.34 ms │    no change │
│ QQuery 3 │  1075.94 ms │               1138.86 ms │ 1.06x slower │
│ QQuery 4 │  2233.90 ms │               2230.21 ms │    no change │
│ QQuery 5 │ 28296.63 ms │              28376.96 ms │    no change │
│ QQuery 6 │  3977.22 ms │               3842.62 ms │    no change │
│ QQuery 7 │  3001.85 ms │               3081.16 ms │    no change │
└──────────┴─────────────┴──────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                       ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                       │ 43818.70ms │
│ Total Time (eventdate-handling-18982)   │ 43823.50ms │
│ Average Time (HEAD)                     │  5477.34ms │
│ Average Time (eventdate-handling-18982) │  5477.94ms │
│ Queries Faster                          │          0 │
│ Queries Slower                          │          1 │
│ Queries with No Change                  │          7 │
│ Queries with Failure                    │          0 │
└─────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃        HEAD ┃ eventdate-handling-18982 ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0  │     1.95 ms │                  2.57 ms │  1.32x slower │
│ QQuery 1  │    48.80 ms │                 51.47 ms │  1.05x slower │
│ QQuery 2  │   131.27 ms │                136.28 ms │     no change │
│ QQuery 3  │   148.89 ms │                161.57 ms │  1.09x slower │
│ QQuery 4  │  1077.75 ms │               1080.27 ms │     no change │
│ QQuery 5  │  1348.89 ms │               1373.85 ms │     no change │
│ QQuery 6  │     1.87 ms │                  8.27 ms │  4.42x slower │
│ QQuery 7  │    54.06 ms │                 57.08 ms │  1.06x slower │
│ QQuery 8  │  1415.34 ms │               1434.43 ms │     no change │
│ QQuery 9  │  1722.33 ms │               1756.99 ms │     no change │
│ QQuery 10 │   338.34 ms │                361.67 ms │  1.07x slower │
│ QQuery 11 │   391.17 ms │                395.99 ms │     no change │
│ QQuery 12 │  1267.51 ms │               1270.24 ms │     no change │
│ QQuery 13 │  1975.37 ms │               1961.43 ms │     no change │
│ QQuery 14 │  1222.90 ms │               1231.14 ms │     no change │
│ QQuery 15 │  1217.10 ms │               1227.43 ms │     no change │
│ QQuery 16 │  2513.18 ms │               2560.74 ms │     no change │
│ QQuery 17 │  2518.92 ms │               2486.58 ms │     no change │
│ QQuery 18 │  5832.02 ms │               4951.94 ms │ +1.18x faster │
│ QQuery 19 │   121.50 ms │                119.90 ms │     no change │
│ QQuery 20 │  1981.52 ms │               1902.92 ms │     no change │
│ QQuery 21 │  2249.47 ms │               2207.58 ms │     no change │
│ QQuery 22 │  4799.22 ms │               3765.25 ms │ +1.27x faster │
│ QQuery 23 │ 16898.30 ms │              12252.48 ms │ +1.38x faster │
│ QQuery 24 │   213.35 ms │                211.34 ms │     no change │
│ QQuery 25 │   473.29 ms │                468.19 ms │     no change │
│ QQuery 26 │   217.29 ms │                226.98 ms │     no change │
│ QQuery 27 │  2735.27 ms │               2672.41 ms │     no change │
│ QQuery 28 │ 21971.24 ms │              23364.10 ms │  1.06x slower │
│ QQuery 29 │   963.15 ms │                959.61 ms │     no change │
│ QQuery 30 │  1258.19 ms │               1256.13 ms │     no change │
│ QQuery 31 │  1340.39 ms │               1317.84 ms │     no change │
│ QQuery 32 │  4795.01 ms │               4180.91 ms │ +1.15x faster │
│ QQuery 33 │  5733.68 ms │               5439.47 ms │ +1.05x faster │
│ QQuery 34 │  6207.88 ms │               5795.35 ms │ +1.07x faster │
│ QQuery 35 │  1937.40 ms │               1881.00 ms │     no change │
│ QQuery 36 │    67.38 ms │                191.26 ms │  2.84x slower │
│ QQuery 37 │    45.52 ms │                 78.40 ms │  1.72x slower │
│ QQuery 38 │    65.18 ms │                117.49 ms │  1.80x slower │
│ QQuery 39 │   100.42 ms │                347.79 ms │  3.46x slower │
│ QQuery 40 │    25.78 ms │                 43.76 ms │  1.70x slower │
│ QQuery 41 │    23.24 ms │                 39.50 ms │  1.70x slower │
│ QQuery 42 │    19.98 ms │                 34.44 ms │  1.72x slower │
└───────────┴─────────────┴──────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                       ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                       │ 97471.31ms │
│ Total Time (eventdate-handling-18982)   │ 91384.05ms │
│ Average Time (HEAD)                     │  2266.77ms │
│ Average Time (eventdate-handling-18982) │  2125.21ms │
│ Queries Faster                          │          6 │
│ Queries Slower                          │         14 │
│ Queries with No Change                  │         23 │
│ Queries with Failure                    │          0 │
└─────────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃      HEAD ┃ eventdate-handling-18982 ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1  │ 103.77 ms │                105.43 ms │     no change │
│ QQuery 2  │  34.24 ms │                 33.06 ms │     no change │
│ QQuery 3  │  40.52 ms │                 39.95 ms │     no change │
│ QQuery 4  │  32.41 ms │                 30.81 ms │     no change │
│ QQuery 5  │  95.54 ms │                 92.00 ms │     no change │
│ QQuery 6  │  21.10 ms │                 20.85 ms │     no change │
│ QQuery 7  │ 156.44 ms │                159.92 ms │     no change │
│ QQuery 8  │  41.70 ms │                 39.33 ms │ +1.06x faster │
│ QQuery 9  │ 106.77 ms │                102.95 ms │     no change │
│ QQuery 10 │  68.54 ms │                 68.65 ms │     no change │
│ QQuery 11 │  19.21 ms │                 18.44 ms │     no change │
│ QQuery 12 │  52.97 ms │                 52.11 ms │     no change │
│ QQuery 13 │  49.43 ms │                 49.52 ms │     no change │
│ QQuery 14 │  14.98 ms │                 15.16 ms │     no change │
│ QQuery 15 │  30.33 ms │                 30.29 ms │     no change │
│ QQuery 16 │  28.97 ms │                 28.66 ms │     no change │
│ QQuery 17 │ 145.61 ms │                145.75 ms │     no change │
│ QQuery 18 │ 281.11 ms │                280.61 ms │     no change │
│ QQuery 19 │  42.12 ms │                 39.88 ms │ +1.06x faster │
│ QQuery 20 │  56.54 ms │                 57.74 ms │     no change │
│ QQuery 21 │ 188.16 ms │                187.08 ms │     no change │
│ QQuery 22 │  22.51 ms │                 22.82 ms │     no change │
└───────────┴───────────┴──────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                       ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                       │ 1632.94ms │
│ Total Time (eventdate-handling-18982)   │ 1620.99ms │
│ Average Time (HEAD)                     │   74.22ms │
│ Average Time (eventdate-handling-18982) │   73.68ms │
│ Queries Faster                          │         2 │
│ Queries Slower                          │         0 │
│ Queries with No Change                  │        20 │
│ Queries with Failure                    │         0 │
└─────────────────────────────────────────┴───────────┘

@alamb-ghbot

Copy link
Copy Markdown

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing eventdate-handling-18982 (673fad3) to eadbed5 diff using: clickbench_partitioned
Results will be posted here when complete

@alamb-ghbot

Copy link
Copy Markdown

🤖: Benchmark completed

Details

Comparing HEAD and eventdate-handling-18982
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃        HEAD ┃ eventdate-handling-18982 ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0  │     1.96 ms │                  2.72 ms │  1.39x slower │
│ QQuery 1  │    50.02 ms │                 53.33 ms │  1.07x slower │
│ QQuery 2  │   129.16 ms │                132.97 ms │     no change │
│ QQuery 3  │   156.28 ms │                156.47 ms │     no change │
│ QQuery 4  │  1111.41 ms │               1077.15 ms │     no change │
│ QQuery 5  │  1391.85 ms │               1381.75 ms │     no change │
│ QQuery 6  │     1.92 ms │                  8.79 ms │  4.59x slower │
│ QQuery 7  │    54.04 ms │                 55.64 ms │     no change │
│ QQuery 8  │  1449.45 ms │               1459.24 ms │     no change │
│ QQuery 9  │  1756.14 ms │               1797.21 ms │     no change │
│ QQuery 10 │   345.21 ms │                360.57 ms │     no change │
│ QQuery 11 │   396.34 ms │                407.30 ms │     no change │
│ QQuery 12 │  1282.46 ms │               1301.36 ms │     no change │
│ QQuery 13 │  1960.42 ms │               1976.79 ms │     no change │
│ QQuery 14 │  1240.77 ms │               1270.55 ms │     no change │
│ QQuery 15 │  1260.86 ms │               1266.89 ms │     no change │
│ QQuery 16 │  2537.42 ms │               2553.88 ms │     no change │
│ QQuery 17 │  2531.79 ms │               2527.57 ms │     no change │
│ QQuery 18 │  5168.80 ms │               4950.59 ms │     no change │
│ QQuery 19 │   122.33 ms │                121.87 ms │     no change │
│ QQuery 20 │  1918.85 ms │               1886.10 ms │     no change │
│ QQuery 21 │  2197.72 ms │               2155.07 ms │     no change │
│ QQuery 22 │  3711.64 ms │               3802.74 ms │     no change │
│ QQuery 23 │ 15029.95 ms │              12085.06 ms │ +1.24x faster │
│ QQuery 24 │   214.20 ms │                214.13 ms │     no change │
│ QQuery 25 │   473.46 ms │                467.33 ms │     no change │
│ QQuery 26 │   208.46 ms │                205.10 ms │     no change │
│ QQuery 27 │  2694.42 ms │               2673.56 ms │     no change │
│ QQuery 28 │ 21933.66 ms │              23219.59 ms │  1.06x slower │
│ QQuery 29 │   960.55 ms │                982.40 ms │     no change │
│ QQuery 30 │  1289.32 ms │               1269.53 ms │     no change │
│ QQuery 31 │  1364.92 ms │               1310.12 ms │     no change │
│ QQuery 32 │  4541.82 ms │               4174.19 ms │ +1.09x faster │
│ QQuery 33 │  5871.96 ms │               5444.97 ms │ +1.08x faster │
│ QQuery 34 │  5929.37 ms │               5637.21 ms │     no change │
│ QQuery 35 │  1952.50 ms │               1876.86 ms │     no change │
│ QQuery 36 │    71.63 ms │                188.21 ms │  2.63x slower │
│ QQuery 37 │    45.87 ms │                 75.74 ms │  1.65x slower │
│ QQuery 38 │    68.46 ms │                116.42 ms │  1.70x slower │
│ QQuery 39 │   103.53 ms │                340.03 ms │  3.28x slower │
│ QQuery 40 │    26.69 ms │                 43.54 ms │  1.63x slower │
│ QQuery 41 │    22.31 ms │                 38.35 ms │  1.72x slower │
│ QQuery 42 │    20.61 ms │                 34.47 ms │  1.67x slower │
└───────────┴─────────────┴──────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                       ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                       │ 93600.52ms │
│ Total Time (eventdate-handling-18982)   │ 91103.37ms │
│ Average Time (HEAD)                     │  2176.76ms │
│ Average Time (eventdate-handling-18982) │  2118.68ms │
│ Queries Faster                          │          3 │
│ Queries Slower                          │         11 │
│ Queries with No Change                  │         29 │
│ Queries with Failure                    │          0 │
└─────────────────────────────────────────┴────────────┘

@alamb

alamb commented Jan 25, 2026

Copy link
Copy Markdown
Contributor

🤔 I think we need to look into why this is slower....

@alamb alamb dismissed their stale review January 25, 2026 15:31

Performance benchmarks

@alamb

alamb commented Jan 25, 2026

Copy link
Copy Markdown
Contributor

run benchmarks

@alamb-ghbot

Copy link
Copy Markdown

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing eventdate-handling-18982 (673fad3) to eadbed5 diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

@nuno-faria

Copy link
Copy Markdown
Contributor

🤔 I think we need to look into why this is slower....

I think the aggregations were not being triggered before, since the filter did not match any rows.

@alamb-ghbot

Copy link
Copy Markdown

🤖: Benchmark completed

Details

Comparing HEAD and eventdate-handling-18982
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query    ┃        HEAD ┃ eventdate-handling-18982 ┃    Change ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0 │  2428.87 ms │               2319.15 ms │ no change │
│ QQuery 1 │   922.34 ms │                959.59 ms │ no change │
│ QQuery 2 │  1821.40 ms │               1856.73 ms │ no change │
│ QQuery 3 │  1124.19 ms │               1118.62 ms │ no change │
│ QQuery 4 │  2208.08 ms │               2210.47 ms │ no change │
│ QQuery 5 │ 28235.08 ms │              28126.30 ms │ no change │
│ QQuery 6 │  4012.33 ms │               3895.99 ms │ no change │
│ QQuery 7 │  2862.17 ms │               2794.47 ms │ no change │
└──────────┴─────────────┴──────────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                       ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                       │ 43614.46ms │
│ Total Time (eventdate-handling-18982)   │ 43281.33ms │
│ Average Time (HEAD)                     │  5451.81ms │
│ Average Time (eventdate-handling-18982) │  5410.17ms │
│ Queries Faster                          │          0 │
│ Queries Slower                          │          0 │
│ Queries with No Change                  │          8 │
│ Queries with Failure                    │          0 │
└─────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃        HEAD ┃ eventdate-handling-18982 ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0  │     1.91 ms │                  2.62 ms │  1.37x slower │
│ QQuery 1  │    50.54 ms │                 52.29 ms │     no change │
│ QQuery 2  │   132.44 ms │                137.70 ms │     no change │
│ QQuery 3  │   153.53 ms │                156.35 ms │     no change │
│ QQuery 4  │  1042.43 ms │               1100.51 ms │  1.06x slower │
│ QQuery 5  │  1340.16 ms │               1351.90 ms │     no change │
│ QQuery 6  │     1.89 ms │                  8.18 ms │  4.32x slower │
│ QQuery 7  │    54.23 ms │                 56.55 ms │     no change │
│ QQuery 8  │  1394.33 ms │               1461.30 ms │     no change │
│ QQuery 9  │  1690.38 ms │               1800.96 ms │  1.07x slower │
│ QQuery 10 │   345.32 ms │                358.34 ms │     no change │
│ QQuery 11 │   399.74 ms │                410.15 ms │     no change │
│ QQuery 12 │  1240.26 ms │               1264.48 ms │     no change │
│ QQuery 13 │  1929.54 ms │               1972.70 ms │     no change │
│ QQuery 14 │  1221.06 ms │               1258.73 ms │     no change │
│ QQuery 15 │  1192.83 ms │               1243.99 ms │     no change │
│ QQuery 16 │  2504.24 ms │               2538.10 ms │     no change │
│ QQuery 17 │  2469.17 ms │               2541.39 ms │     no change │
│ QQuery 18 │  5130.96 ms │               4847.94 ms │ +1.06x faster │
│ QQuery 19 │   122.86 ms │                121.96 ms │     no change │
│ QQuery 20 │  1893.48 ms │               1882.28 ms │     no change │
│ QQuery 21 │  2178.58 ms │               2201.09 ms │     no change │
│ QQuery 22 │  3704.18 ms │               3749.64 ms │     no change │
│ QQuery 23 │ 12120.87 ms │              12236.95 ms │     no change │
│ QQuery 24 │   218.42 ms │                219.88 ms │     no change │
│ QQuery 25 │   465.14 ms │                467.48 ms │     no change │
│ QQuery 26 │   205.01 ms │                218.69 ms │  1.07x slower │
│ QQuery 27 │  2696.05 ms │               2744.87 ms │     no change │
│ QQuery 28 │ 21738.62 ms │              23131.02 ms │  1.06x slower │
│ QQuery 29 │   956.90 ms │                968.34 ms │     no change │
│ QQuery 30 │  1244.21 ms │               1270.41 ms │     no change │
│ QQuery 31 │  1313.31 ms │               1319.60 ms │     no change │
│ QQuery 32 │  4166.11 ms │               4546.26 ms │  1.09x slower │
│ QQuery 33 │  5285.54 ms │               5354.80 ms │     no change │
│ QQuery 34 │  5753.60 ms │               5869.26 ms │     no change │
│ QQuery 35 │  1909.26 ms │               1911.21 ms │     no change │
│ QQuery 36 │    66.98 ms │                186.84 ms │  2.79x slower │
│ QQuery 37 │    46.45 ms │                 75.08 ms │  1.62x slower │
│ QQuery 38 │    65.69 ms │                116.57 ms │  1.77x slower │
│ QQuery 39 │   102.70 ms │                352.61 ms │  3.43x slower │
│ QQuery 40 │    27.79 ms │                 45.44 ms │  1.63x slower │
│ QQuery 41 │    22.89 ms │                 38.43 ms │  1.68x slower │
│ QQuery 42 │    19.69 ms │                 34.39 ms │  1.75x slower │
└───────────┴─────────────┴──────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                       ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                       │ 88619.30ms │
│ Total Time (eventdate-handling-18982)   │ 91627.29ms │
│ Average Time (HEAD)                     │  2060.91ms │
│ Average Time (eventdate-handling-18982) │  2130.87ms │
│ Queries Faster                          │          1 │
│ Queries Slower                          │         14 │
│ Queries with No Change                  │         28 │
│ Queries with Failure                    │          0 │
└─────────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query     ┃      HEAD ┃ eventdate-handling-18982 ┃       Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1  │ 101.42 ms │                110.15 ms │ 1.09x slower │
│ QQuery 2  │  32.94 ms │                 33.25 ms │    no change │
│ QQuery 3  │  36.47 ms │                 40.18 ms │ 1.10x slower │
│ QQuery 4  │  32.07 ms │                 30.81 ms │    no change │
│ QQuery 5  │  93.80 ms │                 91.88 ms │    no change │
│ QQuery 6  │  20.88 ms │                 20.95 ms │    no change │
│ QQuery 7  │ 164.10 ms │                161.70 ms │    no change │
│ QQuery 8  │  40.74 ms │                 41.96 ms │    no change │
│ QQuery 9  │ 102.77 ms │                104.65 ms │    no change │
│ QQuery 10 │  70.87 ms │                 69.11 ms │    no change │
│ QQuery 11 │  18.82 ms │                 18.36 ms │    no change │
│ QQuery 12 │  52.21 ms │                 51.11 ms │    no change │
│ QQuery 13 │  48.75 ms │                 48.94 ms │    no change │
│ QQuery 14 │  15.16 ms │                 15.23 ms │    no change │
│ QQuery 15 │  30.73 ms │                 30.85 ms │    no change │
│ QQuery 16 │  29.65 ms │                 29.23 ms │    no change │
│ QQuery 17 │ 145.09 ms │                142.56 ms │    no change │
│ QQuery 18 │ 282.25 ms │                286.54 ms │    no change │
│ QQuery 19 │  41.57 ms │                 39.77 ms │    no change │
│ QQuery 20 │  58.67 ms │                 56.33 ms │    no change │
│ QQuery 21 │ 193.29 ms │                191.72 ms │    no change │
│ QQuery 22 │  22.77 ms │                 22.11 ms │    no change │
└───────────┴───────────┴──────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                       ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                       │ 1635.05ms │
│ Total Time (eventdate-handling-18982)   │ 1637.39ms │
│ Average Time (HEAD)                     │   74.32ms │
│ Average Time (eventdate-handling-18982) │   74.43ms │
│ Queries Faster                          │         0 │
│ Queries Slower                          │         2 │
│ Queries with No Change                  │        20 │
│ Queries with Failure                    │         0 │
└─────────────────────────────────────────┴───────────┘

@alamb

alamb commented Jan 25, 2026

Copy link
Copy Markdown
Contributor

🤔 I think we need to look into why this is slower....

I think the aggregations were not being triggered before, since the filter did not match any rows.

So that probably means the results are now correct because the queries are doing more work?

Did we verify the new results are now correct by comparing with some other implementation (e.g. DuckDB?)

@nuno-faria

Copy link
Copy Markdown
Contributor

Did we verify the new results are now correct by comparing with some other implementation (e.g. DuckDB?)

I checked and the results now match with DuckDB.

@alamb alamb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @kosiew and @nuno-faria

FYI @Dandandan and @waynexia -- this change will slow down some queries (though now the answers are correct!)

@kosiew kosiew added this pull request to the merge queue Jan 27, 2026
Merged via the queue into apache:main with commit 5c47d6d Jan 27, 2026
29 checks passed
@kosiew

kosiew commented Jan 27, 2026

Copy link
Copy Markdown
Contributor Author

de-bgunter pushed a commit to de-bgunter/datafusion that referenced this pull request Mar 24, 2026
…to DATE via `hits` view (apache#19881)

## Which issue does this PR close?

* Closes apache#18982.

---

## Rationale for this change

ClickBench encodes `EventDate` as a `UInt16` representing **days since
1970-01-01**. When DataFusion registers the ClickBench parquet file
directly as `hits`, `EventDate` ends up being compared as a string in
some queries (notably ClickBench queries 36–42), which causes the date
range predicates to filter out all rows.

To make ClickBench queries behave as authored (and align with how other
engines handle the dataset), we expose `hits` as a view that converts
the raw `UInt16` encoding into a proper SQL `DATE`.

---

## What changes are included in this PR?

* Register the underlying parquet table as **`hits_raw`** instead of
`hits`.
* Add a constant **`HITS_VIEW_DDL`** that defines a `hits` view which:

  * Removes the original `EventDate` column, and
* Re-introduces it as `DATE` using `CAST(CAST("EventDate" AS INTEGER) AS
DATE)`.
* Factor view creation into a helper method (`create_hits_view`) and add
error context for easier debugging.
* Update the ClickBench sqllogictest file to:

  * Create `hits_raw` + `hits` view,
* Add explicit assertions validating the transformation (`15901` ↔
`2013-07-15`),
  * Update expected result types where `EventDate` is now a `DATE`, and
  * Drop the view before dropping the raw table.

---

## Are these changes tested?

Yes.

* Updated `datafusion/sqllogictest/test_files/clickbench.slt` to cover:

  * Correct `EventDate` decoding in the `hits` view (returns `DATE`),
  * Raw `hits_raw.EventDate` remains the original integer encoding, and
* Existing ClickBench queries that rely on date predicates (including
the previously failing range-filter queries) now execute with the
correct types.

Script to test
[q36-q42](apache#18982).

`benchmarks/run_q36_q42.sh`
```bash
#!/usr/bin/env bash
# Script to run ClickBench queries 36-42 and display results

set -e

SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
BENCHMARK=${1:-"clickbench_1"}
OUTPUT_FILE="${2:-results_q36_q42.txt}"

echo "=========================================="
echo "Running ClickBench Queries 36-42"
echo "=========================================="
echo "Benchmark: $BENCHMARK"
echo "Output file: $OUTPUT_FILE"
echo ""

# Create results file
> "$OUTPUT_FILE"

# Run queries 36-42
for q in {36..42}; do
    echo "Running Query $q..."
    
    # Run the query and extract relevant info
    output=$($SCRIPT_DIR/bench.sh run $BENCHMARK $q 2>&1)
    
    # Extract timing and row count from the first iteration
    iteration_0=$(echo "$output" | grep "Query $q iteration 0" | head -1)
    avg_time=$(echo "$output" | grep "Query $q avg time" | head -1)
    
    echo "Q$q: $iteration_0" | tee -a "$OUTPUT_FILE"
    echo "     $avg_time" | tee -a "$OUTPUT_FILE"
    echo "" | tee -a "$OUTPUT_FILE"
done

echo "=========================================="
echo "Summary saved to: $OUTPUT_FILE"
echo "=========================================="
cat "$OUTPUT_FILE"
```

Run results on this branch:
```
Q36: Query 36 iteration 0 took 138.1 ms and returned 10 rows
     Query 36 avg time: 116.19 ms

Q37: Query 37 iteration 0 took 66.4 ms and returned 10 rows
     Query 37 avg time: 50.57 ms

Q38: Query 38 iteration 0 took 98.9 ms and returned 10 rows
     Query 38 avg time: 83.20 ms

Q39: Query 39 iteration 0 took 237.3 ms and returned 10 rows
     Query 39 avg time: 223.62 ms

Q40: Query 40 iteration 0 took 40.6 ms and returned 10 rows
     Query 40 avg time: 24.43 ms

Q41: Query 41 iteration 0 took 36.6 ms and returned 10 rows
     Query 41 avg time: 22.90 ms

Q42: Query 42 iteration 0 took 34.0 ms and returned 10 rows
     Query 42 avg time: 20.17 ms
```

On `main` branch, the queries return 0 rows.

---

## Are there any user-facing changes?

Yes (benchmark/test behavior):

* In ClickBench runs, the logical table name `hits` continues to exist,
but it is now a **view** that exposes `EventDate` as a proper `DATE`
rather than the raw `UInt16` encoding.
* This fixes ClickBench queries 36–42 so they return rows without
requiring manual casts in the SQL.

No public API changes.

---

## LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated
content has been manually reviewed and tested.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Potential invalid ClickBench results

4 participants