Extend function tuple to return named tuple and add function tupleNames by amosbird · Pull Request #54881 · ClickHouse/ClickHouse · GitHub
Skip to content

Extend function tuple to return named tuple and add function tupleNames#54881

Merged
vitlibar merged 2 commits into
ClickHouse:masterfrom
amosbird:named-tuple
Jul 9, 2024
Merged

Extend function tuple to return named tuple and add function tupleNames#54881
vitlibar merged 2 commits into
ClickHouse:masterfrom
amosbird:named-tuple

Conversation

@amosbird

@amosbird amosbird commented Sep 21, 2023

Copy link
Copy Markdown
Collaborator

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Extend function tuple to construct named tuples in query. Introduce function tupleNames to extract names from tuples.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

@robot-ch-test-poll1 robot-ch-test-poll1 added the pr-feature Pull request with new product feature label Sep 21, 2023
@robot-ch-test-poll1

robot-ch-test-poll1 commented Sep 21, 2023

Copy link
Copy Markdown
Contributor

@amosbird amosbird mentioned this pull request Sep 21, 2023
@Avogar Avogar self-assigned this Sep 21, 2023
@alexey-milovidov

Copy link
Copy Markdown
Member

Interesting - I needed this feature just yesterday!

One thought - maybe we should make the tuple function to construct a named tuple instead of introducing another function?

But this is only if named tuples support both the .N and .name access.

@amosbird

Copy link
Copy Markdown
Collaborator Author

But this is only if named tuples support both the .N and .name access.

Yeah. It's reasonable to have tuple return named tuples only if we have full-fledged support. Once it's done, we can alias it to namedTuple.

@alexey-milovidov

Copy link
Copy Markdown
Member

It works:

milovidov-desktop :) SET allow_experimental_analyzer = 1

SET allow_experimental_analyzer = 1

Query id: 8eded140-2f65-4407-a926-93f869739640

Ok.

0 rows in set. Elapsed: 0.000 sec. 

milovidov-desktop :) SELECT CAST(('Hello', 'world') AS Tuple(x String, y String)) AS t, t.1, t.2, t.x, t.y

SELECT
    CAST(('Hello', 'world'), 'Tuple(x String, y String)') AS t,
    t.1,
    t.2,
    t.x,
    t.y

Query id: a26093cb-5b5c-4839-9212-b4f203f68437

┌─t─────────────────┬─tupleElement(t, 1)─┬─tupleElement(t, 2)─┬─t.x───┬─t.y───┐
│ ('Hello','world') │ Hello              │ world              │ Hello │ world │
└───────────────────┴────────────────────┴────────────────────┴───────┴───────┘

1 row in set. Elapsed: 0.011 sec.

Let's anticipate the analyzer enablement, and avoid adding a separate namedTuple function.

@Avogar Avogar removed their assignment Jan 29, 2024
@vpanfilov

Copy link
Copy Markdown

Casting to named tuples has been actually working without new analyzer for a long time, there is a test for that: https://github.com/ClickHouse/ClickHouse/blob/master/tests/queries/0_stateless/00547_named_tuples.sql.

But it is really inconvenient to hardcode datatypes during casting. So namedTuple function or new tuple constuctor will be really useful.

@alexey-milovidov

Copy link
Copy Markdown
Member

One thought - maybe we should make the tuple function to construct a named tuple instead of introducing another function?

But this is only if named tuples support both the .N and .name access.

I'd prefer this way of implementation.

@alexey-milovidov alexey-milovidov added the close in a month if not active This will be closed in case of no information label Apr 23, 2024
@amosbird amosbird changed the title Add function namedTuple and tupleNames Extend function tuple to return named tuple and add function tupleNames Apr 23, 2024
@amosbird

Copy link
Copy Markdown
Collaborator Author

I'd prefer this way of implementation.

It breaks INSERT SELECT query when trying to insert into predefined named tuples from constructed named tuple with different tuple column names. See 01533_multiple_nested, 02475_bson_each_row_format, etc.

It breaks geometry type construction when using named tuple. See 01306_polygons_intersection.

It breaks JOIN analysis because we use tuple types to compare left and right columns. See 02861_join_on_nullsafe_compare, 02911_join_on_nullsafe_optimization, etc.

These are found by fast tests. There might be other issues.

@vitlibar vitlibar self-assigned this Apr 25, 2024
Comment thread src/Functions/tuple.cpp Outdated
Returns a tuple by grouping input arguments.

For columns C1, C2, … with the types T1, T2, …, it returns a Tuple(C1 T1, C2 T2, …) type tuple containing these columns. There is no cost to execute the function.
If there are duplicate column names, subsequent names will be suffixed with `_N` starting from zero. For example, `tuple(C1, C1)` will return a Tuple(C1 T1, C1_0 T1).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are duplicate column names, subsequent names will be suffixed with _N starting from zero. For example, tuple(C1, C1) will return a Tuple(C1 T1, C1_0 T1).

This doesn't sound very good. I mean such name would look quite random if that element of a tuple is not just a column in some table. I think maybe it's better instead to give the ability to specify names explicitly:

tuple(1 AS x, 2 AS y)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be more ergonomic to allow named tuples from tuple(x, y) instead of tuple(x AS x, y AS y).

About identical column names - what about not assigning any names when there is ambiguity?

@vitlibar vitlibar May 13, 2024

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be more ergonomic to allow named tuples from tuple(x, y) instead of tuple(x AS x, y AS y)

I agree, but it seems not very useful to generate names like plus(number, 1) for the second element of tuple tuple(number, number + 1).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also for SELECT tuple(3, 2, 1) AS t, t.1 there is ambiguity: should t.1 mean 1 or 3?

@vitlibar vitlibar May 13, 2024

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about not assigning any names when there is ambiguity?

But we can't assign some names of a tuple and not assign the other names of the same tuple.

Maybe it makes sense to assign names like element_1, element_2, ... and so on for each element of a tuple which is not a literal name, or if that element is a duplicate of another element. There would be still ambiguity with SELECT tuple(number, number + 1 AS element_1) AS t, t.element_1 (because both number and number + 1 should be element_1) but this case is tricky enough to just throw an exception that we can't deduct tuple names.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if a column is named as 1?

tuple(`1`, `2`)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amosbird, the tuple will be unnamed because 1 does not work as unquoted identifier.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numbers are dangerous as the names of tuple's elements because there could be confusion with accessing by index:

> SELECT CAST(('Hello', 'world') AS Tuple(`2` String, `1` String)) AS t, t.1, t.2, t.`1`, t.`2`

   ┌─t─────────────────┬─tupleElement(t, 1)─┬─tupleElement(t, 2)─┬─t.1───┬─t.2───┐
1. │ ('Hello','world') │ Hello              │ world              │ world │ Hello │
   └───────────────────┴────────────────────┴────────────────────┴───────┴───────┘

So let's at least not make such named tuples implicitly.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess

tuple(a AS `42`, b)

will also be unnamed then.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

@alexey-milovidov

Copy link
Copy Markdown
Member

@amosbird We can make it under a setting, enable it by default, and put it under the compatibility setting.

@amosbird

Copy link
Copy Markdown
Collaborator Author

We can make it under a setting, enable it by default, and put it under the compatibility setting.

Sure. I believe I've covered everything. Let's see if isUnquotedIdentifier can help resolve those CI issues.

@robot-ch-test-poll robot-ch-test-poll added the submodule changed At least one submodule changed in this PR. label Jun 15, 2024
@amosbird

Copy link
Copy Markdown
Collaborator Author

contrib/orc submodule is changed (fixed) intentionally.

@amosbird amosbird force-pushed the named-tuple branch 2 times, most recently from 8360f1f to 012161b Compare June 15, 2024 08:55
@amosbird

Copy link
Copy Markdown
Collaborator Author

Stateless tests (release, old analyzer, s3, DatabaseReplicated) [2/4]

03036_dynamic_read_subcolumns #65319

Also need to sync with private repo.

Other than these two issues, the CI is green.

@vitlibar Could you please help take another look?

@amosbird

Copy link
Copy Markdown
Collaborator Author

Stateless tests (ubsan) [2/2] — fail: 1, passed: 3322, skipped: 30

01158_zookeeper_log_long #64790

@alexey-milovidov alexey-milovidov removed the close in a month if not active This will be closed in case of no information label Jun 18, 2024
@alexey-milovidov

Copy link
Copy Markdown
Member

Ok. It is almost ready to merge, but why the ORC submodule was updated? Let's reset it.

Merged via the queue into ClickHouse:master with commit 34516c1 Jul 9, 2024
@robot-clickhouse-ci-1 robot-clickhouse-ci-1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jul 9, 2024
baibaichen added a commit to Kyligence/gluten that referenced this pull request Jul 10, 2024
baibaichen added a commit to apache/gluten that referenced this pull request Jul 10, 2024
* [GLUTEN-1632][CH]Daily Update Clickhouse Version (20240710)

* Fix UT due to ClickHouse/ClickHouse#54881

* style

---------

Co-authored-by: kyligence-git <gluten@kyligence.io>
Co-authored-by: Chang Chen <baibaichen@gmail.com>
@Avogar

Avogar commented Jul 17, 2024

Copy link
Copy Markdown
Member

Seems like it breaks compatibility for case tuple(NULL):
#66636

Before:

:) select toTypeName(tuple(NULL))

┌─toTypeName(tuple(NULL))──┐
│ Tuple(Nullable(Nothing)) │
└──────────────────────────┘

Now:

:) select toTypeName(tuple(NULL))

┌─toTypeName((NULL))──────────────┐
│ Tuple(`NULL` Nullable(Nothing)) │
└─────────────────────────────────┘

@amosbird

Copy link
Copy Markdown
Collaborator Author

Yes. The isUnquotedIdentifier function needs enhancement to exclude NULL. I suppose that's the only exception.

@Avogar

Avogar commented Jul 17, 2024

Copy link
Copy Markdown
Member

Can you create a fix for it?

@amosbird

Copy link
Copy Markdown
Collaborator Author

Sure. Will do it now. And we also need to handle true/false.

max-vostrikov added a commit that referenced this pull request Jul 25, 2024
…ble_named_columns_in_function_tuple=1 (default value)
max-vostrikov added a commit that referenced this pull request Jul 25, 2024
…when enable_named_columns_in_function_tuple=1 (default value)
max-vostrikov added a commit that referenced this pull request Jul 26, 2024
…when enable_named_columns_in_function_tuple=1 (default value)
max-vostrikov added a commit that referenced this pull request Jul 26, 2024
…when enable_named_columns_in_function_tuple=1 (default value)
max-vostrikov added a commit that referenced this pull request Jul 29, 2024
…when enable_named_columns_in_function_tuple=1 (default value)
max-vostrikov added a commit that referenced this pull request Jul 29, 2024
…when enable_named_columns_in_function_tuple=1 (default value)
max-vostrikov added a commit that referenced this pull request Jul 30, 2024
…when enable_named_columns_in_function_tuple=1 (default value)
github-merge-queue Bot pushed a commit that referenced this pull request Aug 1, 2024
…ction_tuple_tests

Added some tests in relation with #54881
@Hubbitus

Copy link
Copy Markdown

Hello.
Thank you very much for this great addition. But could you please say from which version of ClickHouse it is present?

@max-vostrikov

Copy link
Copy Markdown
Contributor

@Hubbitus it is in 24.7

@clickgapai

Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature Pull request with new product feature pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.