For input formats with column names, search columns case-insensitively if there is no ambiguity by Manerone · Pull Request #99346 · ClickHouse/ClickHouse · GitHub
Skip to content

For input formats with column names, search columns case-insensitively if there is no ambiguity#99346

Merged
Avogar merged 65 commits into
ClickHouse:masterfrom
Manerone:matheus-92851
Apr 21, 2026
Merged

For input formats with column names, search columns case-insensitively if there is no ambiguity#99346
Avogar merged 65 commits into
ClickHouse:masterfrom
Manerone:matheus-92851

Conversation

@Manerone

@Manerone Manerone commented Mar 12, 2026

Copy link
Copy Markdown
Contributor

Changelog category

  • Improvement

Changelog entry

Adds a new setting input_format_column_name_matching_mode which allows different case sensitivities for input formats.

Documentation entry for user-facing changes

The possible options for the new setting are:

  • match_case: will do a case-sensitive search of the input columns.
  • ignore_case: will do a case-insensitive search of the input columns.
  • auto: will first try to match case-sensitive, if it fails tries to match case-insensitive.

NOTE: case-insensitive searches can have ambiguity (e.g. Name and namE are ambiguous). If any ambiguity is detected, an exception will be thrown.

Supported formats:

  • JSONEachRow
  • CSVWithNames
  • JSONColumns
  • BSONEachRow
  • RowBinaryWithNames

Examples

Match Case

CREATE TABLE test (id Int, name String);
SET input_format_column_name_matching_mode = 'match_case'
INSERT INTO test FORMAT JSONEachRow {"id": 0, "NAME": "aaa"};

SELECT * FROM test
┌─id─┬─name─┐
│  0 │      │
└────┴──────┘

Ignore Case

CREATE TABLE test (id Int, name String);
SET input_format_column_name_matching_mode = 'ignore_case'
INSERT INTO test FORMAT JSONEachRow {"id": 0, "NAME": "aaa"};

SELECT * FROM test
┌─id─┬─name─┐
│  0 │ aaa  │
└────┴──────┘

-- Ambiguity
CREATE TABLE test2 (id Int, ID Int);
SET input_format_column_name_matching_mode = 'ignore_case'
INSERT INTO test FORMAT JSONEachRow {"id": 0}

DB::Exception: Ambiguous field (`id`) when processing data: (at row 1)

-- Ambiguity is only detected if the columns are referenced
CREATE TABLE test(id Int, ID Int, name String)
SET input_format_column_name_matching_mode = 'ignore_case'
INSERT INTO test FORMAT JSONEachRow {"name": "aa"}

SELECT * FROM test

┌─id─┬─ID─┬─name─┐
│  0 │  0 │ aa   │
└────┴────┴──────┘


Auto Case

CREATE TABLE test (id Int, ID Int);
SET input_format_column_name_matching_mode = 'auto'
INSERT INTO test FORMAT JSONEachRow {"id": 1}

SELECT * FROM test

┌─id─┬─ID─┐
│  1 │  0 │
└────┴────┘

-- Ambiguity
CREATE TABLE test (id Int, ID Int);
SET input_format_column_name_matching_mode = 'auto'
INSERT INTO test FORMAT JSONEachRow {"Id": 1} -- Notice that `Id` doesn't match any column, so it is matched by a case insensitive search

DB::Exception: Ambiguous field (`id`) when processing data: (at row 1)

Dicussion - possible corner case

CREATE TABLE test(id Int);

SET input_format_column_name_matching_mode = 'auto'

INSERT INTO json_test FORMAT JSONEachRow {"ID": 444, "id": 123};

SELECT * FROM json_test

   ┌──id─┐
1. │ 444 │
   └─────┘

Influence of flaginput_format_json_ignore_unnecessary_fields. If the flag is set to true, we set the case match to auto, then the reader will match ID with the table field id and it will consider the rest of the columns irrelevant. We can, of course, still check the remaining columns, but it would go against a previous comment found in the code:

            // Keep parsing the remaining fields in case of the json is invalid.
            // But not look up the name in the name_map since the cost cannot be ignored

The ambiguity check works fine in the following scenario:

CREATE TABLE test(id Int, name String, age Int);

SET input_format_column_name_matching_mode = 'auto'

INSERT INTO json_test FORMAT JSONEachRow {"ID": 444, "id": 123};

SELECT * FROM json_test

DB::Exception: Ambiguous field (`id`) when processing data: (at row 1)

Here it will work because total_columns is 3, and the ambiguity happens before reaching the total_columns.


Note

Medium Risk
Changes column-name lookup behavior across multiple input formats and adds new ambiguity/duplicate-field errors, which could alter ingestion outcomes for existing pipelines relying on previous matching/skip behavior.

Overview
Adds a new setting, input_format_with_names_case_insensitive_column_matching, to control how input formats that carry column names match those names to table columns: case-sensitive, case-insensitive, or AUTO (exact match first, then case-insensitive).

Introduces CaseAwareBlockNameMap to implement this matching (including ambiguity detection when multiple table columns collide under case-insensitive comparison) and wires it into JSONEachRow, JSONColumns, BSONEachRow, and RowBinaryWithNames/CSVWithNames header mapping. Parsing now throws on ambiguous mappings and on repeated fields that resolve to the same target column, with new stateless tests covering normal and ambiguous cases.

Written by Cursor Bugbot for commit 77a9311. This will update automatically on new commits. Configure here.

Version info

  • Merged into: 26.4.1.1102

@Avogar Avogar changed the title [Proof of Concept #92851] For input formats with column names, search columns case-insensitively if there is no ambiguity For input formats with column names, search columns case-insensitively if there is no ambiguity Apr 7, 2026
Comment thread src/Core/CaseAwareBlockNameMap.h
Comment thread src/Core/CaseAwareBlockNameMap.cpp
@Manerone

Manerone commented Apr 9, 2026

Copy link
Copy Markdown
Contributor Author

@Avogar

Avogar commented Apr 10, 2026

Copy link
Copy Markdown
Member

@Manerone I downloaded the build from one of the commits, and this query just don't throw any client or server error:

avogar-dev :) CREATE TABLE json_test (age Int, AGE Int);

CREATE TABLE json_test
(
    `age` Int,
    `AGE` Int
)

Query id: acfecf28-12a4-4ef7-921b-f41534385db7

Ok.

0 rows in set. Elapsed: 0.003 sec.

avogar-dev :) INSERT INTO json_test FORMAT JSONEachRow {"AgE": 1, "aGe": 20};

INSERT INTO json_test FORMAT JSONEachRow

Query id: 2f9829fc-cef0-443f-b31d-1c5bf0568555

Ok.

1 row in set. Elapsed: 0.004 sec.

avogar-dev :) SELECT * FROM json_test

SELECT *
FROM json_test

Query id: 0ea65591-136b-4fb0-8462-c45247bb51be

   ┌─age─┬─AGE─┐
1. │   00 │
   └─────┴─────┘

1 row in set. Elapsed: 0.003 sec.

So this is the issue, as I understand, you expect it to throw. But it doesn't.

@Manerone

Copy link
Copy Markdown
Contributor Author

@Avogar just to be sure, did you set input_format_column_name_matching_mode to auto or ignore_case? I think if it stays in the default option (match_case) it will not find any matches and use the default value of the columns.

@Avogar

Avogar commented Apr 10, 2026

Copy link
Copy Markdown
Member

Ah, right sorry, forgot to change the setting 🤦‍♂️

The actual reason is randomized setting async_insert in the tests. With async inserts errors don't happen on client, but are collected somewhere on server. You need to use clientError and add -- Tags: no-async-insert as the first line in the test.

@Manerone

Copy link
Copy Markdown
Contributor Author

@Avogar thanks a lot for the help! I have applied the changes, lets see if it finally passes now.

$CLICKHOUSE_CLIENT -q "CREATE TABLE test (id Int, iD Int, name String, NAME String)"
$CLICKHOUSE_CLIENT -q "SET input_format_csv_detect_header=true;
SET input_format_column_name_matching_mode='auto';
INSERT INTO test FROM INFILE '$CURDIR/data_csv/csv_with_names.csv' FORMAT CSVWithNames; -- { clientError 117 }"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These -- { clientError 117 } annotations are inside the SQL string passed to clickhouse-client, so they are treated as SQL comments and are not validated by the shell test harness.

As written, this script does not assert that ambiguity queries actually fail; a regression to successful execution would still pass this test.

Please capture and check the exit code (e.g., via run_with_error from shell_config.sh) for the expected-failure cases.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see run_with_error being used anywhere in the tests. I actually see the shell tests using the same pattern as me. Example files:

  • tests/queries/0_stateless/02878_use_structure_from_insertion_table_with_explicit_insert_columns.sh
  • tests/queries/0_stateless/02932_refreshable_materialized_views_1.sh
  • tests/queries/0_stateless/03410_variant_bad_binary_data.sh

$CLICKHOUSE_CLIENT -q "DROP TABLE IF EXISTS test"
$CLICKHOUSE_CLIENT -q "CREATE TABLE test (id Int, iD Int, name String, NAME String)"
$CLICKHOUSE_CLIENT -q "SET input_format_column_name_matching_mode='auto';
INSERT INTO test FROM INFILE '$CURDIR/data_binary/row_binary_with_names.bin' FORMAT RowBinaryWithNames; -- { clientError 117 }"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue here as in the CSV shell test: -- { clientError 117 } appears inside the SQL text, so it is not interpreted by the test harness as an expected error check.

Please validate failure cases explicitly (e.g., call run_with_error "$CLICKHOUSE_CLIENT" -q "..." and assert the returned code is 117). Otherwise these negative cases can silently pass if the query unexpectedly succeeds.

$CLICKHOUSE_CLIENT -q "DROP TABLE IF EXISTS test"
$CLICKHOUSE_CLIENT -q "CREATE TABLE test (id Int, iD Int, name String, NAME String)"
$CLICKHOUSE_CLIENT -q "SET input_format_column_name_matching_mode='auto';
INSERT INTO test FROM INFILE '$CURDIR/data_bson/bson_with_names.bson' FORMAT BSONEachRow; -- { clientError 117 }"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shell test has the same false-negative pattern: expected-error markers like -- { clientError 117 } are embedded in the SQL query string and are not checked by the shell test framework.

Please assert the error code explicitly for these negative cases (for example with run_with_error + an exit-code check). Without that, regressions where ambiguity is no longer rejected can pass unnoticed.

@Avogar

Avogar commented Apr 13, 2026

Copy link
Copy Markdown
Member

The CI is not in a good shape right now. We are working on reducing flaky failures. I will postpone merging it for a few days until the PRs with CI fixes are merged.

@clickhouse-gh

clickhouse-gh Bot commented Apr 20, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.10% 84.00% -0.10%
Functions 91.10% 91.10% +0.00%
Branches 76.50% 76.50% +0.00%

Changed lines: 99.03% (204/206) · Uncovered code

Full report · Diff report

@Avogar Avogar added this pull request to the merge queue Apr 21, 2026
Merged via the queue into ClickHouse:master with commit 2dde042 Apr 21, 2026
159 of 161 checks passed
@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-improvement Pull request with some product improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants