Replace more alternate characters in format_skeleton#1122
Replace more alternate characters in format_skeleton#1122akx merged 1 commit intopython-babel:masterfrom
format_skeleton#1122Conversation
Replaces some additional characters which never appear in resource files before matching the skeleton. This replicates the behaviour of ICU.
There was a problem hiding this comment.
I smell a performance issue 😄 We'd now be doing 5 x O(n) lookups over skeleton.
I suggest something like:
- make the special skeleton character mapping a module-level dict (
{'z': 'v', 'k': 'H', 'K': 'h', 'a': '', 'b': ''}) - derive a module-level
frozensetof the keys there - do a first "early" check
if set(skeleton) & special_skeleton_frozen_set:to see if we need to do any work regarding these - ... and if yes, loop over the actual mapping and process the replacements as necessary.
This should probably be benchmarked, though.
There was a problem hiding this comment.
I was also a bit worried about that, I'll run some benchmarks so we can compare
There was a problem hiding this comment.
Ok here are some results. FTR I'm measuring the current PR against this diff, lmk if this is not what you meant :)
diff --git a/babel/dates.py b/babel/dates.py
index 25e66d5..442e805 100644
--- a/babel/dates.py
+++ b/babel/dates.py
@@ -1855,6 +1855,9 @@ def split_interval_pattern(pattern: str) -> list[str]:
return [untokenize_pattern(tokens) for tokens in parts]
+alternate_chars = {'z': 'v', 'k': 'H', 'K': 'h', 'a': '', 'b': ''}
+alternate_chars_set = frozenset(alternate_chars.keys())
+
def match_skeleton(skeleton: str, options: Iterable[str], allow_different_fields: bool = False) -> str | None:
"""
Find the closest match for the given datetime skeleton among the options given.
@@ -1888,16 +1891,10 @@ def match_skeleton(skeleton: str, options: Iterable[str], allow_different_fields
# Filter out falsy values and sort for stability; when `interval_formats` is passed in, there may be a None key.
options = sorted(option for option in options if option)
- if 'z' in skeleton and not any('z' in option for option in options):
- skeleton = skeleton.replace('z', 'v')
- if 'k' in skeleton and not any('k' in option for option in options):
- skeleton = skeleton.replace('k', 'H')
- if 'K' in skeleton and not any('K' in option for option in options):
- skeleton = skeleton.replace('K', 'h')
- if 'a' in skeleton and not any('a' in option for option in options):
- skeleton = skeleton.replace('a', '')
- if 'b' in skeleton and not any('b' in option for option in options):
- skeleton = skeleton.replace('b', '')
+ if set(skeleton) & alternate_chars_set:
+ for char, replacement in alternate_chars.items():
+ if char in skeleton and not any(char in option for option in options):
+ skeleton = skeleton.replace(char, replacement)
get_input_field_width = dict(t[1] for t in tokenize_pattern(skeleton) if t[0] == "field").get
best_skeleton = None
I benchmarked two skeletons, one containing a replacement char, and one not: Hmz, yMMd.
This is the benchmark code:
# skeleton.py
from babel.dates import match_skeleton
from babel import Locale
locale = Locale.parse('en_GB')
# skeleton = 'Hmz'
skeleton = 'yMMd'
assert skeleton not in locale.datetime_skeletons
for _ in range(1000):
match_skeleton(skeleton, locale.datetime_skeletons)Results:
For yMMd
hyperfine 'git checkout format-skeleton && python skeleton.py' 'git checkout format-skeleton-optimized && python skeleton.py'
Benchmark 1: git checkout format-skeleton && python skeleton.py
Time (mean ± σ): 308.1 ms ± 7.2 ms [User: 290.7 ms, System: 17.3 ms]
Range (min … max): 292.9 ms … 318.0 ms 10 runs
Benchmark 2: git checkout format-skeleton-optimized && python skeleton.py
Time (mean ± σ): 314.3 ms ± 9.9 ms [User: 296.6 ms, System: 17.6 ms]
Range (min … max): 297.8 ms … 333.4 ms 10 runs
Summary
'git checkout format-skeleton && python skeleton.py' ran
1.02 ± 0.04 times faster than 'git checkout format-skeleton-optimized && python skeleton.py'For Hmz:
hyperfine 'git checkout format-skeleton && python skeleton.py' 'git checkout format-skeleton-optimized && python skeleton.py'
Benchmark 1: git checkout format-skeleton && python skeleton.py
Time (mean ± σ): 204.5 ms ± 11.5 ms [User: 187.4 ms, System: 16.9 ms]
Range (min … max): 188.8 ms … 228.3 ms 13 runs
Benchmark 2: git checkout format-skeleton-optimized && python skeleton.py
Time (mean ± σ): 205.8 ms ± 9.2 ms [User: 189.3 ms, System: 16.4 ms]
Range (min … max): 192.3 ms … 220.0 ms 13 runs
Summary
'git checkout format-skeleton && python skeleton.py' ran
1.01 ± 0.07 times faster than 'git checkout format-skeleton-optimized && python skeleton.py'Looks like the optimized version using the set check is actually a bit slower 🤔
There was a problem hiding this comment.
friendly reminder @akx :) Which version should we go with?
There was a problem hiding this comment.
Let's keep it simple and use the obvious version now :) Thanks for measuring.

Related discussion: #1084
Replaces some additional characters which never appear in resource files before matching the skeleton. This replicates the behaviour of ICU: https://github.com/unicode-org/icu/blob/5e22f0076ec9b55056cd8a84e9ef370632f44174/icu4j/main/core/src/main/java/com/ibm/icu/text/DateIntervalInfo.java#L1090-L1102