`<locale>`: Improve `std::collate::do_transform()`'s handling of wrongly encoded input by muellerj2 · Pull Request #5431 · microsoft/STL · GitHub
Skip to content

<locale>: Improve std::collate::do_transform()'s handling of wrongly encoded input#5431

Merged
StephanTLavavej merged 3 commits into
microsoft:mainfrom
muellerj2:locale-collate-handling-invalid-input
Apr 29, 2025
Merged

<locale>: Improve std::collate::do_transform()'s handling of wrongly encoded input#5431
StephanTLavavej merged 3 commits into
microsoft:mainfrom
muellerj2:locale-collate-handling-invalid-input

Conversation

@muellerj2

Copy link
Copy Markdown

Resolves #5210.

Changes:

  • This fixes the comments in xstrxform.cpp and xwcsxfrm.cpp: The functions return SIZE_MAX on error, not INT_MAX like their UCRT counterparts.
  • In _Wcsxfrm(), the return value is now correctly set to static_cast<size_t>(-1) for one error case, bringing it in line with the other error case. Previously, it was set to INT_MAX, which is potentially a valid length for a sort key.
    • While this is clearly a bug in the code, I couldn't actually find any way to exert this error path. (I was a bit disappointed to find that my fix can't actually be observed to fix anything, but on the other hand, this means that we don't have to worry about any potential compatibility impact.)
  • I changed collate::do_transform() to return an empty string in the error case instead of throwing a confusing length_error("string too large").
    • The standard remains completely silent on what collate::do_transform() should do in case of error.
    • The empty string can only really be a valid return value for the empty string, so there isn't any potential for confusion whether the return value represents an actual sort key or an error.
    • Alternatively, we could consciously choose that the function throws an exception. I would prefer this exception to be more appropriate than length_error. Maybe runtime_error?
      • This would mean that the regex implementation would have to defend against this exception, because regex_match and regex_search really shouldn't throw just because the searched input string isn't valid UTF-8.

@muellerj2 muellerj2 requested a review from a team as a code owner April 24, 2025 18:27
@github-project-automation github-project-automation Bot moved this to Initial Review in STL Code Reviews Apr 24, 2025
@StephanTLavavej StephanTLavavej added the bug Something isn't working label Apr 24, 2025
@StephanTLavavej StephanTLavavej self-assigned this Apr 24, 2025
Comment thread tests/std/tests/GH_005236_collate_facet/test.cpp Outdated
@StephanTLavavej

Copy link
Copy Markdown
Member

@StephanTLavavej StephanTLavavej removed their assignment Apr 24, 2025
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Apr 24, 2025
@StephanTLavavej

Copy link
Copy Markdown
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej

Copy link
Copy Markdown
Member

@muellerj2 muellerj2 deleted the locale-collate-handling-invalid-input branch May 31, 2025 21:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

<locale>: std::collate<_Elem>::do_transform() should behave appropriately when _LStrxfrm() fails

3 participants