BUG: handle non-ASCII flags and tofile format strings by mjbommar · Pull Request #31257 · numpy/numpy · GitHub
Skip to content

BUG: handle non-ASCII flags and tofile format strings#31257

Open
mjbommar wants to merge 5 commits intonumpy:mainfrom
mjbommar:gh-31254-non-ascii-ascii-guards
Open

BUG: handle non-ASCII flags and tofile format strings#31257
mjbommar wants to merge 5 commits intonumpy:mainfrom
mjbommar:gh-31254-non-ascii-ascii-guards

Conversation

@mjbommar
Copy link
Copy Markdown

@mjbommar mjbommar commented Apr 16, 2026

PR summary

As described in #31254, non-ASCII strings are not handled
safely by arrayflags_setitem and PyArray_ToFile.

What to do with Unicode strings elsewhere upstream / downstream
is a complicated question that has design consequences, so
@WarrenWeckesser suggested reasonable error handling in the bug
report comments to provide the user with useful, non-segfaulting
execptions.

This PR simply implements those recommendations and adds coverage
to ensure KeyError / ValueError are properly raised when
ecnountered.

First-time contributor introduction

Hi there! Numeric/numarray user who has been on the long journey
after switching from Matlab and Fortran a long time ago...thank you
for all of your work - I know how much went into 2.0!

AI Disclosure

The original C issue was detected using Claude Agent SDK and a variety of
tools like semgrep / ast-grep / libclang.

Claude Code (Opus 4.6) and Codex CLI were used to prepare the PR branch,
although the credit for the proposed fixes themselves really go to
@WarrenWeckesser.

PS: sorry again for earlier auto-submission.

@jorenham
Copy link
Copy Markdown
Member

@jorenham jorenham closed this Apr 16, 2026
@mjbommar
Copy link
Copy Markdown
Author

Very sorry @jorenham , codex sub agent ignored instructions not to execute gh CLI commands...

I think the patch contents itself were already specified by @WarrenWeckesser in the original report that I made yesterday. I reviewed the code, ran the original MRE to confirm the patches fixed the issue, and checked that build/test are green after.

If you re-open it, I will manually edit this to clean up the template. Otherwise
I'm just going to create clutter opening another PR to fix as @WarrenWeckesser requested

Comment thread numpy/_core/src/multiarray/convert.c Outdated
if (byteobj == NULL) {
Py_DECREF(strobj);
Py_DECREF(it);
if (n4 != 0) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the full NumPy testsuite passes for me locally on this branch if I simplify this conditional to always execute per diff below. If this is genuinely needed, it might be helpful to enforce the requirement with a test. If there's a good reason it is needed, but can't be tested, perhaps an explanatory comment might at least make sense.

--- a/numpy/_core/src/multiarray/convert.c
+++ b/numpy/_core/src/multiarray/convert.c
@@ -302,7 +302,7 @@ PyArray_ToFile(PyArrayObject *self, FILE *fp, char *sep, char *format)
             if (byteobj == NULL) {
                 Py_DECREF(strobj);
                 Py_DECREF(it);
-                if (n4 != 0) {
+                if (1) {
                     PyErr_SetString(PyExc_ValueError,
                             "The `format` parameter must contain only ASCII "
                             "characters.");

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a test to catch this branch (e.g., str())

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tylerjereddy - OK, should see coverage on that branch with the extra test I pushed

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be tempted to just stick with the unicode error either way. Sure, it's not necessary immediately clear it comes from the format itself, but I suspect it's clear enough?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seberg I'd defer to you and @WarrenWeckesser, who originally proposed it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out it is probably better to stick with the unicode error as @seberg suggests. I hadn't considered Unicode string arrays in my initial comments in #31254. As it is now in the PR, if the array contains non-ASCII Unicode, and format="%s", the error message in the ValueError is not correct:

In [25]: np.__version__
Out[25]: '2.5.0.dev0+git20260416.dedf7c2'

In [26]: s = np.array(['λ123'], dtype='<U4')

In [27]: s.tofile('junk.dat', sep="; ", format="%s")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[27], line 1
----> 1 s.tofile('junk.dat', sep="; ", format="%s")

ValueError: The `format` parameter must contain only ASCII characters.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WarrenWeckesser @seberg OK, see note below, hopefully this is what you were thinking for the format case

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please just use the error as-is? I don't think it is worth going through the trouble to validate the format string explicitly just to give it an explicit ValueError.

@mjbommar
Copy link
Copy Markdown
Author

OK, hopefully this hits the mark now. Here's a summary of everything that's in the PR now:

  • Fixed arrayflags_setitem so non-ASCII flag keys no longer crash and instead follow the existing KeyError path.
  • Fixed ndarray.tofile() so non-ASCII format strings no longer crash and now raise ValueError.
  • Preserved UnicodeEncodeError for non-ASCII Unicode array element data in tofile(), instead of misreporting it as a format error.
  • Covered the reviewed format="%s" Unicode-array case so it now raises UnicodeEncodeError rather than a misleading ValueError.
  • Added regression tests for non-ASCII flag keys, non-ASCII format, non-ASCII element data without format, and non-ASCII element data with ASCII format="%s".
  • Added a release note describing the crash fixes and resulting exception behavior.

If you want me to clean up the branch so it's clean without the commit history, just let me know and I'll push a single clean commit. Wasn't sure if that will erase the code review or break the GH UI now. I'm afraid to do anything these days with all the changes they made...

Copy link
Copy Markdown
Member

@ngoldbaum ngoldbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I triggered CI but unfortunately github actions is having issues and there are lots of unrelated test failures. Someone is probably going to need to re-trigger.

I left a comment below about C API style and another nit about how you classified the release note. Please take or leave them as you see fit.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

total nit but this should probably be labeled as an "improvement" rather than a "change"

Copy link
Copy Markdown
Member

@ngoldbaum ngoldbaum Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way to handle the repeated cleanup code is to add a cleanup goto label below here that does Py_XDECREF on all the objects that need cleanup in error paths and then use goto cleanup in error paths above.

You can even define the return value like int ret = -1 in the function header and then set ret = 0 here and be sure then return ret in the cleanup code to unify the error paths and the happy path.

You also need to make sure that you initialize all the variables to NULL before the first use. The cost of this approach is some extra NULL checks which IMO is worth it for the improved maintainability and readability.

IMO that's better style for CPython C API code (despite the use of goto) because it avoids duplicate or almost-duplicate cleanup code which can sometimes be tricky to reason about.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants