Remove `stop_words` param for unicode tokenizer by george-larionov · Pull Request #99834 · ClickHouse/ClickHouse · GitHub
Skip to content

Remove stop_words param for unicode tokenizer#99834

Merged
george-larionov merged 3 commits into
masterfrom
remove-stop-words-param-from-unicode-tokenizer
Mar 19, 2026
Merged

Remove stop_words param for unicode tokenizer#99834
george-larionov merged 3 commits into
masterfrom
remove-stop-words-param-from-unicode-tokenizer

Conversation

@george-larionov

Copy link
Copy Markdown
Member

Removing stop words parameter of unicode tokenizer. It was decided that we should add it back in as a index-level parameter later. See the PR where it was added: #99357

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

@clickhouse-gh

clickhouse-gh Bot commented Mar 18, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-not-for-changelog This PR should not be mentioned in the changelog label Mar 18, 2026
@george-larionov

Copy link
Copy Markdown
Member Author

@rschu1ze @Ergus @amosbird

@george-larionov george-larionov marked this pull request as ready for review March 18, 2026 02:42
@george-larionov george-larionov changed the title removing stop words param for unicode tokenizer Remove stop_words param for unicode tokenizer Mar 18, 2026
Comment thread src/Interpreters/TokenizerFactory.cpp

@Ergus Ergus left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added a minor comment.

@Ergus Ergus self-assigned this Mar 18, 2026
Comment thread src/Functions/tokens.cpp
@rschu1ze rschu1ze changed the title Remove stop_words param for unicode tokenizer Remove stop_words param for unicode tokenizer Mar 18, 2026
@rschu1ze

rschu1ze commented Mar 18, 2026

Copy link
Copy Markdown
Member

@amosbird FYI ^^ The plan is to add stoppword support back as part of the postprocessor: #98939

if (escaped)
{
--pos;
escaped = false;

@rschu1ze rschu1ze Mar 19, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure it's safe to delete l. 663?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, escaped is local to the function, it was updated here because it was used a few lines later for a stop-words specific bit of logic. However now that code is gone and this line can be dropped safely because right after we return from the function.

The only potential issue is that whoever adds the stop_words functionality back in needs to add this line back in as well, but if they just revert all the changes in this PR it should be fine.

@clickhouse-gh

clickhouse-gh Bot commented Mar 19, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 83.80% 83.80% +0.00%
Functions 24.00% 24.00% +0.00%
Branches 76.30% 76.40% +0.10%

PR changed lines: PR changed-lines coverage: 96.23% (51/53, 0 noise lines excluded)
Diff coverage report
Uncovered code

@george-larionov george-larionov added this pull request to the merge queue Mar 19, 2026
Merged via the queue into master with commit cb53882 Mar 19, 2026
163 checks passed
@george-larionov george-larionov deleted the remove-stop-words-param-from-unicode-tokenizer branch March 19, 2026 22:04
@robot-ch-test-poll2 robot-ch-test-poll2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Mar 19, 2026
george-larionov pushed a commit to amosbird/ClickHouse that referenced this pull request Mar 20, 2026
…stop words removal

Revert accidental submodule pointer changes for libstemmer_c, NuRaft, and
rust_vendor that leaked in during rebase.

Update test references to reflect that stop words were removed from
unicodeWord tokenizer by PR ClickHouse#99834: fullwidth comma is now returned as a
token, and the stop-words-as-argument test case is removed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-not-for-changelog This PR should not be mentioned in the changelog pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants