gh-145234: Fix SystemError in parser when \r is introduced after code…#145276
Closed
gourijain029-del wants to merge 1 commit intopython:mainfrom
Closed
gh-145234: Fix SystemError in parser when \r is introduced after code…#145276gourijain029-del wants to merge 1 commit intopython:mainfrom
gourijain029-del wants to merge 1 commit intopython:mainfrom
Conversation
|
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
Member
|
Closing as per https://devguide.python.org/getting-started/generative-ai/ Please don't randomly open AI generated PRs |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR fixes a SystemError: Parser/string_parser.c:286: bad argument to internal function that occurred when a Python file used an encoding (like UTF-7) that introduced \r characters after decoding.
Root Cause
The crash was caused by a synchronization failure between the tokenizer, the lexer, and the string parser:
Tokenizer: When the file tokenizer recoded a line (e.g., from UTF-7 to UTF-8), it was not normalizing newlines. If the codec introduced a \r, it remained in the buffer.
Lexer: The lexer skipped \r characters but did not correctly trigger "beginning-of-line" (atbol) logic. This meant that if a \r followed a comment (#...), the lexer would remain in a state where it thought it was still on the same line, causing it to merge the comment and the subsequent string literal into a single, invalid token.
String Parser: When
_PyPegen_parse_string
received this broken token (which didn't start with a quote character), it raised a SystemError.
Changes
Parser/lexer/lexer.c
: Updated the lexer to treat a standalone \r as a full newline. It now correctly sets atbol = 1 and resets the current token start, preventing the "merging" of tokens across lines.
Parser/tokenizer/file_tokenizer.c
:
Updated
tok_readline_recode
to explicitly call
_PyTokenizer_translate_newlines
on the UTF-8 decoded buffer.
Optimized
tok_underflow_file
to immediately discard and re-decode the buffer as soon as a coding spec is identified, preventing raw bytes from leaking into the parser.
Lib/test/test_parser_utf7_r.py
: Added a new regression test that uses a UTF-7 encoded \r to reproduce the original crash.
\rs introduced after codec decoding causeSystemError#145234