BUG: handle non-ASCII flags and tofile format strings#31257
BUG: handle non-ASCII flags and tofile format strings#31257mjbommar wants to merge 5 commits intonumpy:mainfrom
Conversation
|
Very sorry @jorenham , codex sub agent ignored instructions not to execute gh CLI commands... I think the patch contents itself were already specified by @WarrenWeckesser in the original report that I made yesterday. I reviewed the code, ran the original MRE to confirm the patches fixed the issue, and checked that build/test are green after. If you re-open it, I will manually edit this to clean up the template. Otherwise |
| if (byteobj == NULL) { | ||
| Py_DECREF(strobj); | ||
| Py_DECREF(it); | ||
| if (n4 != 0) { |
There was a problem hiding this comment.
Note that the full NumPy testsuite passes for me locally on this branch if I simplify this conditional to always execute per diff below. If this is genuinely needed, it might be helpful to enforce the requirement with a test. If there's a good reason it is needed, but can't be tested, perhaps an explanatory comment might at least make sense.
--- a/numpy/_core/src/multiarray/convert.c
+++ b/numpy/_core/src/multiarray/convert.c
@@ -302,7 +302,7 @@ PyArray_ToFile(PyArrayObject *self, FILE *fp, char *sep, char *format)
if (byteobj == NULL) {
Py_DECREF(strobj);
Py_DECREF(it);
- if (n4 != 0) {
+ if (1) {
PyErr_SetString(PyExc_ValueError,
"The `format` parameter must contain only ASCII "
"characters.");There was a problem hiding this comment.
I'll add a test to catch this branch (e.g., str())
There was a problem hiding this comment.
@tylerjereddy - OK, should see coverage on that branch with the extra test I pushed
There was a problem hiding this comment.
I would be tempted to just stick with the unicode error either way. Sure, it's not necessary immediately clear it comes from the format itself, but I suspect it's clear enough?
There was a problem hiding this comment.
@seberg I'd defer to you and @WarrenWeckesser, who originally proposed it.
There was a problem hiding this comment.
It turns out it is probably better to stick with the unicode error as @seberg suggests. I hadn't considered Unicode string arrays in my initial comments in #31254. As it is now in the PR, if the array contains non-ASCII Unicode, and format="%s", the error message in the ValueError is not correct:
In [25]: np.__version__
Out[25]: '2.5.0.dev0+git20260416.dedf7c2'
In [26]: s = np.array(['λ123'], dtype='<U4')
In [27]: s.tofile('junk.dat', sep="; ", format="%s")
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[27], line 1
----> 1 s.tofile('junk.dat', sep="; ", format="%s")
ValueError: The `format` parameter must contain only ASCII characters.
There was a problem hiding this comment.
@WarrenWeckesser @seberg OK, see note below, hopefully this is what you were thinking for the format case
There was a problem hiding this comment.
Can we please just use the error as-is? I don't think it is worth going through the trouble to validate the format string explicitly just to give it an explicit ValueError.
|
OK, hopefully this hits the mark now. Here's a summary of everything that's in the PR now:
If you want me to clean up the branch so it's clean without the commit history, just let me know and I'll push a single clean commit. Wasn't sure if that will erase the code review or break the GH UI now. I'm afraid to do anything these days with all the changes they made... |
ngoldbaum
left a comment
There was a problem hiding this comment.
I triggered CI but unfortunately github actions is having issues and there are lots of unrelated test failures. Someone is probably going to need to re-trigger.
I left a comment below about C API style and another nit about how you classified the release note. Please take or leave them as you see fit.
There was a problem hiding this comment.
total nit but this should probably be labeled as an "improvement" rather than a "change"
There was a problem hiding this comment.
Another way to handle the repeated cleanup code is to add a cleanup goto label below here that does Py_XDECREF on all the objects that need cleanup in error paths and then use goto cleanup in error paths above.
You can even define the return value like int ret = -1 in the function header and then set ret = 0 here and be sure then return ret in the cleanup code to unify the error paths and the happy path.
You also need to make sure that you initialize all the variables to NULL before the first use. The cost of this approach is some extra NULL checks which IMO is worth it for the improved maintainability and readability.
IMO that's better style for CPython C API code (despite the use of goto) because it avoids duplicate or almost-duplicate cleanup code which can sometimes be tricky to reason about.

PR summary
As described in #31254, non-ASCII strings are not handled
safely by
arrayflags_setitemandPyArray_ToFile.What to do with Unicode strings elsewhere upstream / downstream
is a complicated question that has design consequences, so
@WarrenWeckesser suggested reasonable error handling in the bug
report comments to provide the user with useful, non-segfaulting
execptions.
This PR simply implements those recommendations and adds coverage
to ensure
KeyError/ValueErrorare properly raised whenecnountered.
First-time contributor introduction
Hi there! Numeric/numarray user who has been on the long journey
after switching from Matlab and Fortran a long time ago...thank you
for all of your work - I know how much went into 2.0!
AI Disclosure
The original C issue was detected using Claude Agent SDK and a variety of
tools like semgrep / ast-grep / libclang.
Claude Code (Opus 4.6) and Codex CLI were used to prepare the PR branch,
although the credit for the proposed fixes themselves really go to
@WarrenWeckesser.
PS: sorry again for earlier auto-submission.