{{ message }}
fix: Query string not re-tokenized per field when using field-level token_separators/symbols_to_index #2828#2904
Open
mdhaduk wants to merge 1 commit into
Conversation
…aware Space-as-typos fallback and drop-tokens loop were calling fuzzy_search_fields with all fields and field[0]'s tokens. Both paths now iterate field_groups so each group is searched with the token set matching how its fields were indexed.
Contributor
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

closes #2828
Problem
When
query_byspans multiple fields with differentsymbols_to_indexortoken_separatorsconfigurations, Typesense tokenized the query once using the first field’s rules and reused those tokens for every subsequent field.This produced the wrong token set for fields that index content differently.
Root Cause
In
collection.cpp,parse_search_querywas called once withfield[0]’s:symbols_to_indextoken_separatorsThe result was then copied into every
field_query_tokens[i]slot.All downstream search paths used this single token set for all fields.
Fix
src/collection.cppRe-tokenize the query for each field individually.
Field
0is unchanged. Fieldsi > 0each get their ownparse_search_querycall using their own:symbols_to_indextoken_separatorssrc/index.cppUpdated three search paths:
1. Primary path
Fields are grouped by their resulting token set using
field_group_t.Fields that produce identical tokens share a group and are passed together to
fuzzy_search_fields, preserving cross-field matching.Fields with different token sets get separate calls with their own tokens.
This ensures
orig_num_tokensis always coherent with the fields being searched, so scoring is not artificially penalized.2. Space-as-typos fallback
When the primary search returns zero results and
split_join_tokens=fallback, the resolved token set is now searched per group rather than broadcast to all fields via a single call.3. Drop-tokens path
all_queriesused to be built fromfield_query_tokens[0]only. Truncated subsets were then searched across all fields withsku-style tokens that other fields never indexed.The fix builds
group_all_queriesper group, so each group drops tokens from its ownN-token representation.Why not just union all tokens and pass all fields once?
fuzzy_search_fieldstreats its token list as the single correct query representation for all fields given to it.orig_num_tokensis set to the token count of that list.With a union like (Referring to example in original issue here):
["wooden-desk", "wooden", "desk"]a document perfectly matching title scores 2/3, while a perfect sku match scores 1/3.
Both are penalized relative to what they should be.
Grouping ensures each call has a coherent token count for accurate scoring.
Testing
Extended PerFieldQueryRetokenization to cover the regression, field-order invariance, cross-field correctness, and the three-field tokenization case.