iframe-proxy

mdhaduk · 2026-05-04T05:52:21Z

Problem

When query_by spans multiple fields with different symbols_to_index or token_separators configurations, Typesense tokenized the query once using the first field’s rules and reused those tokens for every subsequent field.

This produced the wrong token set for fields that index content differently.

Root Cause

In collection.cpp, parse_search_query was called once with field[0]’s:

locale
stemmer
symbols_to_index
token_separators

The result was then copied into every field_query_tokens[i] slot.

All downstream search paths used this single token set for all fields.

Fix

`src/collection.cpp`

Re-tokenize the query for each field individually.

Field 0 is unchanged. Fields i > 0 each get their own parse_search_query call using their own:

locale
stemmer
symbols_to_index
token_separators

`src/index.cpp`

Updated three search paths:

1. Primary path

Fields are grouped by their resulting token set using field_group_t.

Fields that produce identical tokens share a group and are passed together to fuzzy_search_fields, preserving cross-field matching.

Fields with different token sets get separate calls with their own tokens.

This ensures orig_num_tokens is always coherent with the fields being searched, so scoring is not artificially penalized.

2. Space-as-typos fallback

When the primary search returns zero results and split_join_tokens=fallback, the resolved token set is now searched per group rather than broadcast to all fields via a single call.

3. Drop-tokens path

all_queries used to be built from field_query_tokens[0] only. Truncated subsets were then searched across all fields with sku-style tokens that other fields never indexed.

The fix builds group_all_queries per group, so each group drops tokens from its own N-token representation.

Why not just union all tokens and pass all fields once?

fuzzy_search_fields treats its token list as the single correct query representation for all fields given to it.

orig_num_tokens is set to the token count of that list.

With a union like (Referring to example in original issue here):

["wooden-desk", "wooden", "desk"]

a document perfectly matching title scores 2/3, while a perfect sku match scores 1/3.

Both are penalized relative to what they should be.

Grouping ensures each call has a coherent token count for accurate scoring.

Testing

Extended PerFieldQueryRetokenization to cover the regression, field-order invariance, cross-field correctness, and the three-field tokenization case.

…aware Space-as-typos fallback and drop-tokens loop were calling fuzzy_search_fields with all fields and field[0]'s tokens. Both paths now iterate field_groups so each group is searched with the token set matching how its fields were indexed.

mdhaduk · 2026-05-05T17:51:50Z

mdhaduk mentioned this pull request May 5, 2026

Query string not re-tokenized per field when using field-level token_separators/symbols_to_index with multiple query_by fields #2828

Open

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix: Query string not re-tokenized per field when using field-level token_separators/symbols_to_index #2828#2904

fix: Query string not re-tokenized per field when using field-level token_separators/symbols_to_index #2828#2904
mdhaduk wants to merge 1 commit into
typesense:v31from
mdhaduk:feat/2828-query-string-not-retockenized-issue

mdhaduk commented May 4, 2026

Uh oh!

mdhaduk commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Uh oh!

Conversation

mdhaduk commented May 4, 2026

Problem

Root Cause

Fix

src/collection.cpp

src/index.cpp

1. Primary path

2. Space-as-typos fallback

3. Drop-tokens path

Why not just union all tokens and pass all fields once?

Testing

Uh oh!

mdhaduk commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`src/collection.cpp`

`src/index.cpp`