Unify arrow exports across all query result types by evertlammerts · Pull Request #495 · duckdb/duckdb-python · GitHub
Skip to content

Unify arrow exports across all query result types#495

Merged
evertlammerts merged 5 commits into
duckdb:v1.5-variegatafrom
evertlammerts:feat/arrow-promote-to-relation
Jun 16, 2026
Merged

Unify arrow exports across all query result types#495
evertlammerts merged 5 commits into
duckdb:v1.5-variegatafrom
evertlammerts:feat/arrow-promote-to-relation

Conversation

@evertlammerts

@evertlammerts evertlammerts commented Jun 12, 2026

Copy link
Copy Markdown
Member

This PR unifies arrow exports across query result types, and makes sure we always provide the schema from within a transaction.

We are dealing with 3 arrow export types:

  • Arrow Table
  • Arrow RecordBatch
  • Arrow C Stream

... across 3 result types:

  • StreamingQueryResult
  • ArrowQueryResult
  • StreamingQueryResult

The StreamingQueryResult paths are now unified. We re-feed the backing ColumnDataCollection to the engine for parallel conversion into a ArrowQueryResult, and then we delegate to the corresponding ArrowQueryResult path.

The ArrowQueryResult paths deal with materialized data already, and we have no way to plug into the transaction that generated it. The actual fix for this is to cache the schema when creating the ArrowQueryResult, during Finalize. This is a core change that we will probably apply in v2.0. The workaround is to fetch the schema in a separate transaction. For all paths, since we are already dealing with materialized data, we create an arrow table. Then for the streaming paths we return the corresponding stream types directly from the table.

The StreamingQueryResult paths always have access to a valid transaction context, and can get the arrow schema on demand even when that requires catalog access.

As a side effect of this PR, consuming an arrow c stream (reading from con.sql(q).__arrow_c_stream__()) is now lazy, i.e. not materialized. This makes consumption of course slower, but allows streaming much larger datasets.

The materialized paths are overall a little faster, and the non-c stream streaming paths as well.

  ┌───────────────────────────────────────────────────┬────────────────────┬───────────────────┬───────────────────┐
  │               benchmark expression                │ wall base→now (ms) │ CPU base→now (ms) │ mem base→now (MB) │
  ├───────────────────────────────────────────────────┼────────────────────┼───────────────────┼───────────────────┤
  │ r=con.sql(q); r.execute(); r.to_arrow_table()     │ 159 → 161          │ 259 → 286         │ 847 → 875         │
  ├───────────────────────────────────────────────────┼────────────────────┼───────────────────┼───────────────────┤
  │ r=con.sql(q); r.execute(); r.to_arrow_reader()    │ 161 → 144          │ 255 → 263         │ 896 → 877         │
  ├───────────────────────────────────────────────────┼────────────────────┼───────────────────┼───────────────────┤
  │ r=con.sql(q); r.execute(); r.__arrow_c_stream__() │ 157 → 136          │ 282 → 235         │ 854 → 881         │
  ├───────────────────────────────────────────────────┼────────────────────┼───────────────────┼───────────────────┤
  │ con.sql(q).to_arrow_table()                       │ 52 → 35            │ 267 → 244         │ 855 → 854         │
  ├───────────────────────────────────────────────────┼────────────────────┼───────────────────┼───────────────────┤
  │ con.execute(q).to_arrow_table()                   │ 202 → 174          │ 212 → 193         │ 548 → 554         │
  ├───────────────────────────────────────────────────┼────────────────────┼───────────────────┼───────────────────┤
  │ con.sql(q).to_arrow_reader()                      │ 186 → 175          │ 199 → 187         │ 552 → 552         │
  ├───────────────────────────────────────────────────┼────────────────────┼───────────────────┼───────────────────┤
  │ con.sql(q).__arrow_c_stream__()                   │ 48 → 173           │ 250 → 189         │ 857 → 554         │
  └───────────────────────────────────────────────────┴────────────────────┴───────────────────┴───────────────────┘

Fixes #475

@evertlammerts evertlammerts changed the title Pull materialized CDCs through the engine again for arrow conversion … Pull materialized CDCs through the engine again for arrow conversion with a live connection / transaction Jun 12, 2026
@evertlammerts evertlammerts marked this pull request as ready for review June 12, 2026 21:42
@evertlammerts evertlammerts changed the title Pull materialized CDCs through the engine again for arrow conversion with a live connection / transaction Unify arrow exports across all query result types Jun 16, 2026
@evertlammerts evertlammerts merged commit 6ac2daa into duckdb:v1.5-variegata Jun 16, 2026
15 checks passed
@evertlammerts evertlammerts deleted the feat/arrow-promote-to-relation branch June 17, 2026 13:19
evertlammerts added a commit that referenced this pull request Jun 26, 2026
Periodic forward-merge of release-branch bugfixes into main. Notably brings
in #495 "Unify arrow exports across all query result types" (the materialized
slow-path lifetime / connection-GC fix and the test_arrow_refeed suite),
replacing main's older SchemaCachingStreamWrapper/ArrowQueryResultStreamWrapper
approach.

Submodule: external/duckdb is kept at main's pin 0361de441a (v1.5's submodule
bumps discarded; git fast-forwarded the gitlink to main's newer pin).

Conflict resolution:
- .github/workflows/packaging_wheels.yml: applied both intents — v1.5's
  windows-2025 -> windows-2022 (consistent with targeted_test.yml) and main's
  ARM64-comment removal.

Adaptation for main's newer core:
- pyresult.cpp: core's ColumnDataRef now takes vector<Identifier> (not
  vector<string>); promote the deduplicated scan names to Identifiers
  explicitly in MakeColumnDataScanStatement.

Verified: clean build; tests/fast/arrow + tests/fast/udf = 2436 passed,
0 failed (incl. test_capsule_slow_path_survives_connection_gc and the new
test_arrow_refeed suite).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant