{{ message }}
<regex>: Revise parsing of escape sequences#5380
Merged
StephanTLavavej merged 15 commits intoApr 10, 2025
Merged
Conversation
muellerj2
commented
Mar 30, 2025
Member
StephanTLavavej
approved these changes
Apr 9, 2025
Member
|
Thanks for this amazing overhaul and extreme attention to detail! 😻 I pushed one significant change to the product code and a few to the test code - please meow if I messed anything up. |
Closed
Member
|
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed. |
Member
|
Resolved adjacent-edit conflicts with #5364 in
|
StephanTLavavej
approved these changes
Apr 10, 2025
Member
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Fixes #5244. There is still one remaining issue related to escape sequences in basic regular expressions. I created #5379 to track this.
Changes
[\b]now matches "backspace" (see the semantics of the production in the recent ECMAScript standard, unchanged since the 3rd edition). I repurposed and renamed one of the_L_flagsto enable this oddity in the parser for ECMAScript. This flag is only set for ECMAScript and had previously been used to represent that the backslash can escape itself in ECMAScript -- but this is superfluous because the flag_L_ident_ECMAimplies the same.[\z]and[\Z]no longer match a custom character class "z" provided by a custom regex traits class. Instead, they will match the characters "z" and "Z", as the escapes also do outside of squared character classes. (This change will also make it easier to solve <regex> mishandles locale-based character classes outside of the char range #992 in the future.)\0represents the literal NUL. The standard also demands that such an escape must not be followed by another decimal digit (which is still specified this way in recent ECMAScript standards), so we now throw anerror_escapein this case. As a consequence of this change, a backreference escape can no longer be written with leading zeros (see also the most recent ECMAScript standard).\|is no longer interpreted to mean the character|. Instead, it is rejected, since its meaning is undefined according to the POSIX standard version referenced by the C++ standard (Section 9.3.2). Users can write|instead.\|implementation-defined (in a not particularly good way, I think): It states that\|could either be interpreted as the character|or an actual pipe (Section 9.3.2 in the recent standard). The standard then continues that a future version may require\|to denote a pipe.In any case, the interpretation of
\|as the character|, which was implemented before this PR, is not the one recommended by the current version of the standard.]is accepted now, as the POSIX standard requires because it does not declare]to be a special character). (I noticed this issue when I added[\]]as a test case.)_L_flagsfor this, so I check one of the flags that are set for basic regular expressions but not for any of the other supported grammars. Setting_L_paren_balfor basic regular expressions would make the parser accept wrong usages of\}and\)as well.[\a]matches the backslash and a. This was how these escape sequences were interpreted before this PR.std::regexdoes in ECMAScript mode, too), but others stick closer to the underlying extended regular expressions semantics and treat them as backslash + character.\}and\]in extended regular expressionsAccording to the POSIX standard referenced by the C++ standard, the escape sequence
\}is undefined in extended regular expressions. However, more recent POSIX versions do define it and it also feels unbalanced to reject\}while\{is accepted, so I opted to keep supporting the escape sequence\}.\[but not\]. More recent POSIX versions define\]as well in basic and extended regular expressions. Should we add support for\]?