fix: Query string not re-tokenized per field when using field-level token_separators/symbols_to_index #2828 by mdhaduk · Pull Request #2904 · typesense/typesense · GitHub
Skip to content

fix: Query string not re-tokenized per field when using field-level token_separators/symbols_to_index #2828#2904

Open
mdhaduk wants to merge 1 commit into
typesense:v31from
mdhaduk:feat/2828-query-string-not-retockenized-issue
Open

fix: Query string not re-tokenized per field when using field-level token_separators/symbols_to_index #2828#2904
mdhaduk wants to merge 1 commit into
typesense:v31from
mdhaduk:feat/2828-query-string-not-retockenized-issue

Conversation

@mdhaduk

@mdhaduk mdhaduk commented May 4, 2026

Copy link
Copy Markdown
Contributor

closes #2828

Problem

When query_by spans multiple fields with different symbols_to_index or token_separators configurations, Typesense tokenized the query once using the first field’s rules and reused those tokens for every subsequent field.

This produced the wrong token set for fields that index content differently.

Root Cause

In collection.cpp, parse_search_query was called once with field[0]’s:

  • locale
  • stemmer
  • symbols_to_index
  • token_separators

The result was then copied into every field_query_tokens[i] slot.

All downstream search paths used this single token set for all fields.

Fix

src/collection.cpp

Re-tokenize the query for each field individually.

Field 0 is unchanged. Fields i > 0 each get their own parse_search_query call using their own:

  • locale
  • stemmer
  • symbols_to_index
  • token_separators

src/index.cpp

Updated three search paths:

1. Primary path

Fields are grouped by their resulting token set using field_group_t.

Fields that produce identical tokens share a group and are passed together to fuzzy_search_fields, preserving cross-field matching.

Fields with different token sets get separate calls with their own tokens.

This ensures orig_num_tokens is always coherent with the fields being searched, so scoring is not artificially penalized.

2. Space-as-typos fallback

When the primary search returns zero results and split_join_tokens=fallback, the resolved token set is now searched per group rather than broadcast to all fields via a single call.

3. Drop-tokens path

all_queries used to be built from field_query_tokens[0] only. Truncated subsets were then searched across all fields with sku-style tokens that other fields never indexed.

The fix builds group_all_queries per group, so each group drops tokens from its own N-token representation.

Why not just union all tokens and pass all fields once?

fuzzy_search_fields treats its token list as the single correct query representation for all fields given to it.

orig_num_tokens is set to the token count of that list.

With a union like (Referring to example in original issue here):

["wooden-desk", "wooden", "desk"]

a document perfectly matching title scores 2/3, while a perfect sku match scores 1/3.

Both are penalized relative to what they should be.

Grouping ensures each call has a coherent token count for accurate scoring.

Testing

Extended PerFieldQueryRetokenization to cover the regression, field-order invariance, cross-field correctness, and the three-field tokenization case.

…aware

  Space-as-typos fallback and drop-tokens loop were calling
  fuzzy_search_fields with all fields and field[0]'s tokens.
  Both paths now iterate field_groups so each group is searched
  with the token set matching how its fields were indexed.
@mdhaduk

mdhaduk commented May 5, 2026

Copy link
Copy Markdown
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Query string not re-tokenized per field when using field-level token_separators/symbols_to_index with multiple query_by fields

1 participant