For input formats with column names, search columns case-insensitively if there is no ambiguity#99346
Conversation
- Re-enable the optimization at JSONEachRowRowInputFormat::columnIndex - Test the other modes - Implement other format readers - Write tests for each format reader - Fix the settings, to allow changing it during execution
- PIMPL idiom to avoid virtual calls and to hide implementation details, since they are only relevant within the context of that class
Change from clientError to serverError
|
@Manerone I downloaded the build from one of the commits, and this query just don't throw any client or server error: avogar-dev :) CREATE TABLE json_test (age Int, AGE Int);
CREATE TABLE json_test
(
`age` Int,
`AGE` Int
)
Query id: acfecf28-12a4-4ef7-921b-f41534385db7
Ok.
0 rows in set. Elapsed: 0.003 sec.
avogar-dev :) INSERT INTO json_test FORMAT JSONEachRow {"AgE": 1, "aGe": 20};
INSERT INTO json_test FORMAT JSONEachRow
Query id: 2f9829fc-cef0-443f-b31d-1c5bf0568555
Ok.
1 row in set. Elapsed: 0.004 sec.
avogar-dev :) SELECT * FROM json_test
SELECT *
FROM json_test
Query id: 0ea65591-136b-4fb0-8462-c45247bb51be
┌─age─┬─AGE─┐
1. │ 0 │ 0 │
└─────┴─────┘
1 row in set. Elapsed: 0.003 sec.So this is the issue, as I understand, you expect it to throw. But it doesn't. |
|
@Avogar just to be sure, did you set |
|
Ah, right sorry, forgot to change the setting 🤦♂️ The actual reason is randomized setting |
|
@Avogar thanks a lot for the help! I have applied the changes, lets see if it finally passes now. |
| $CLICKHOUSE_CLIENT -q "CREATE TABLE test (id Int, iD Int, name String, NAME String)" | ||
| $CLICKHOUSE_CLIENT -q "SET input_format_csv_detect_header=true; | ||
| SET input_format_column_name_matching_mode='auto'; | ||
| INSERT INTO test FROM INFILE '$CURDIR/data_csv/csv_with_names.csv' FORMAT CSVWithNames; -- { clientError 117 }" |
There was a problem hiding this comment.
These -- { clientError 117 } annotations are inside the SQL string passed to clickhouse-client, so they are treated as SQL comments and are not validated by the shell test harness.
As written, this script does not assert that ambiguity queries actually fail; a regression to successful execution would still pass this test.
Please capture and check the exit code (e.g., via run_with_error from shell_config.sh) for the expected-failure cases.
There was a problem hiding this comment.
I don't see run_with_error being used anywhere in the tests. I actually see the shell tests using the same pattern as me. Example files:
- tests/queries/0_stateless/02878_use_structure_from_insertion_table_with_explicit_insert_columns.sh
- tests/queries/0_stateless/02932_refreshable_materialized_views_1.sh
- tests/queries/0_stateless/03410_variant_bad_binary_data.sh
| $CLICKHOUSE_CLIENT -q "DROP TABLE IF EXISTS test" | ||
| $CLICKHOUSE_CLIENT -q "CREATE TABLE test (id Int, iD Int, name String, NAME String)" | ||
| $CLICKHOUSE_CLIENT -q "SET input_format_column_name_matching_mode='auto'; | ||
| INSERT INTO test FROM INFILE '$CURDIR/data_binary/row_binary_with_names.bin' FORMAT RowBinaryWithNames; -- { clientError 117 }" |
There was a problem hiding this comment.
Same issue here as in the CSV shell test: -- { clientError 117 } appears inside the SQL text, so it is not interpreted by the test harness as an expected error check.
Please validate failure cases explicitly (e.g., call run_with_error "$CLICKHOUSE_CLIENT" -q "..." and assert the returned code is 117). Otherwise these negative cases can silently pass if the query unexpectedly succeeds.
| $CLICKHOUSE_CLIENT -q "DROP TABLE IF EXISTS test" | ||
| $CLICKHOUSE_CLIENT -q "CREATE TABLE test (id Int, iD Int, name String, NAME String)" | ||
| $CLICKHOUSE_CLIENT -q "SET input_format_column_name_matching_mode='auto'; | ||
| INSERT INTO test FROM INFILE '$CURDIR/data_bson/bson_with_names.bson' FORMAT BSONEachRow; -- { clientError 117 }" |
There was a problem hiding this comment.
This shell test has the same false-negative pattern: expected-error markers like -- { clientError 117 } are embedded in the SQL query string and are not checked by the shell test framework.
Please assert the error code explicitly for these negative cases (for example with run_with_error + an exit-code check). Without that, regressions where ambiguity is no longer rejected can pass unnoticed.
|
The CI is not in a good shape right now. We are working on reducing flaky failures. I will postpone merging it for a few days until the PRs with CI fixes are merged. |
LLVM Coverage ReportChanged lines: 99.03% (204/206) · Uncovered code |
2dde042

Changelog category
Changelog entry
Adds a new setting
input_format_column_name_matching_modewhich allows different case sensitivities for input formats.Documentation entry for user-facing changes
The possible options for the new setting are:
match_case: will do a case-sensitive search of the input columns.ignore_case: will do a case-insensitive search of the input columns.auto: will first try to match case-sensitive, if it fails tries to match case-insensitive.NOTE: case-insensitive searches can have ambiguity (e.g. Name and namE are ambiguous). If any ambiguity is detected, an exception will be thrown.
Supported formats:
Examples
Match Case
Ignore Case
Auto Case
Dicussion - possible corner case
Influence of flag
input_format_json_ignore_unnecessary_fields. If the flag is set totrue, we set the case match toauto, then the reader will matchIDwith the table fieldidand it will consider the rest of the columns irrelevant. We can, of course, still check the remaining columns, but it would go against a previous comment found in the code:The ambiguity check works fine in the following scenario:
Here it will work because total_columns is 3, and the ambiguity happens before reaching the total_columns.
Note
Medium Risk
Changes column-name lookup behavior across multiple input formats and adds new ambiguity/duplicate-field errors, which could alter ingestion outcomes for existing pipelines relying on previous matching/skip behavior.
Overview
Adds a new setting,
input_format_with_names_case_insensitive_column_matching, to control how input formats that carry column names match those names to table columns: case-sensitive, case-insensitive, or AUTO (exact match first, then case-insensitive).Introduces
CaseAwareBlockNameMapto implement this matching (including ambiguity detection when multiple table columns collide under case-insensitive comparison) and wires it intoJSONEachRow,JSONColumns,BSONEachRow, andRowBinaryWithNames/CSVWithNamesheader mapping. Parsing now throws on ambiguous mappings and on repeated fields that resolve to the same target column, with new stateless tests covering normal and ambiguous cases.Written by Cursor Bugbot for commit 77a9311. This will update automatically on new commits. Configure here.
Version info
26.4.1.1102