Fix jsonable_encoder crash on invalid UTF‑8 bytes#15641
Conversation
Co-authored-by: Cursor <cursoragent@cursor.com>
t0ugh-sys
left a comment
There was a problem hiding this comment.
Code Review: Fix jsonable_encoder crash on invalid UTF-8 bytes
Summary
This PR addresses a real issue — jsonable_encoder raising UnicodeDecodeError when encoding bytes values containing invalid UTF-8. The fix changes the bytes encoder from o.decode() to o.decode(errors="replace").
Observations
What's good:
- The bug is real and the fix is straightforward. Passing arbitrary binary payloads through
jsonable_encoderis a legitimate use case, and an unhandledUnicodeDecodeErroris surprising behavior. - A regression test is included, which is great. The test uses
b"\xff"(a common invalid UTF-8 byte) and asserts the replacement character\ufffdappears in the output.
Potential concerns and suggestions:
-
Lossy encoding is a design choice. Using
errors="replace"silently replaces invalid bytes with�(\ufffd). This is safe in the sense that it never crashes, but it also means the round-trip is lossy — you cannot reconstruct the original bytes from the encoded string. An alternative approach like base64 encoding (base64.b64encode(o).decode()) would be lossless, but that would be a larger behavioral change. The"replace"strategy is reasonable as a minimal fix, but it's worth being aware that this could mask data corruption in downstream usage. -
Documentation. It would be good to note this behavior somewhere (or at least in the changelog/release notes) so users know that invalid UTF-8 bytes are silently replaced rather than raising an error. Previously, a
UnicodeDecodeErrorat least signaled something was wrong. -
Test coverage. The test is minimal but adequate for the regression case. You might also consider testing a mixed string like
b"hello\xffworld"to verify that valid bytes surrounding invalid ones are preserved correctly (e.g.,"hello\ufffdworld"). -
No API surface change. The PR correctly notes this is not an API change — only the error handling behavior changes. This is a good, focused fix.
Verdict
This is a reasonable and minimal fix. The "replace" strategy is the least-invasive option and prevents unexpected crashes. LGTM with the minor suggestion to add one more test case for mixed valid/invalid byte sequences.
YuriiMotov
left a comment
There was a problem hiding this comment.
Not sure it should be default behavior (it's lossy), but probably we can go this way.
Anyway, this behavior should be documented clearly.

jsonable_encoder currently decodes bytes using the default UTF‑8 settings, which can raise UnicodeDecodeError for arbitrary binary payloads.
This PR makes bytes encoding more robust by decoding with errors="replace", so encoding never crashes and instead produces a safe, JSON-serializable string.
Update bytes encoder to o.decode(errors="replace")
Add a regression test covering b"\xff" (invalid UTF‑8) to ensure it doesn’t raise and encodes to "\ufffd"
No API changes intended—this just prevents an unexpected exception during encoding.