gh-95555: Add the CATEGORY_UCD opcode and the simple enumerated properties by serhiy-storchaka · Pull Request #153023 · python/cpython · GitHub
Skip to content

gh-95555: Add the CATEGORY_UCD opcode and the simple enumerated properties#153023

Open
serhiy-storchaka wants to merge 1 commit into
python:mainfrom
serhiy-storchaka:re-prop-t1
Open

gh-95555: Add the CATEGORY_UCD opcode and the simple enumerated properties#153023
serhiy-storchaka wants to merge 1 commit into
python:mainfrom
serhiy-storchaka:re-prop-t1

Conversation

@serhiy-storchaka

@serhiy-storchaka serhiy-storchaka commented Jul 4, 2026

Copy link
Copy Markdown
Member

First of four stacked PRs that complete \p{...} support with the properties backed by the Unicode Character Database, matched in C through a unicodedata capsule. This PR adds the machinery and the simplest tier: the enumerated properties stored as a single byte in the per-character record — Bidi_Class (bc), East_Asian_Width (ea), Grapheme_Cluster_Break (gcb) and Indic_Conjunct_Break (incb).

unicodedata exports the _ucd_re_CAPI capsule (the \N{...} precedent) and _ucd_re_info(), which lists the property selectors and value names. The parser resolves a property name and value to a selector and value index; the new CATEGORY_UCD opcode packs (negate, property, value) into one operand and is matched in C by sre_category_ucd(), which compares the index returned by the capsule. A negated single value is one charset item, so \P{bc=AL} composes inside a set.

The follow-up PRs add, in order: the numeric and binary properties (ccc, Bidi_Mirrored, Extended_Pictographic), the computed properties (Block, Decomposition_Type, Numeric_Type) and the remaining General_Category values and groups (Ll, Lo, M/P/S) with POSIX punct.

… properties

Introduce the unicodedata capsule used to match \p{...} properties that need
the Unicode Character Database, starting with the simplest tier: the
enumerated properties stored as a single byte in the per-character record --
Bidi_Class (bc), East_Asian_Width (ea), Grapheme_Cluster_Break (gcb) and
Indic_Conjunct_Break (incb).

unicodedata exports the _ucd_re_CAPI capsule (the \N{...} precedent) and
_ucd_re_info(), which lists the property selectors and value names.  The new
CATEGORY_UCD opcode packs (negate, property, value) and is matched in C by
sre_category_ucd().  A negated single value is one charset item, so \P{bc=AL}
composes inside a set.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@read-the-docs-community

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant