-
Notifications
You must be signed in to change notification settings - Fork 1.4k
fix pyexpat #6582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix pyexpat #6582
Conversation
|
Warning Rate limit exceeded@youknowone has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 5 minutes and 48 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ⛔ Files ignored due to path filters (4)
📒 Files selected for processing (1)
📝 WalkthroughWalkthroughParse now accepts either string or bytes and converts to bytes before parsing. A new optional namespace_separator is stored and used by a make_name helper to produce names; intern is initialized from the provided argument. do_parse now returns Results, Parse/ParseFile surface integer status, and ParserCreate validates namespace_separator and forwards intern/namespace_separator to the parser constructor. Changes
Sequence Diagram(s)(omitted) Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
crates/stdlib/src/pyexpat.rs (1)
362-368: ParserCreate arguments are currently unused.The
encoding,namespace_separator, andinternparameters defined inParserCreateArgsare not being used in theparser_createfunction. This means XML encoding declarations and namespace handling are not currently supported. While this appears to be existing technical debt (indicated by the_argsprefix), users may expect these parameters to have an effect based on CPython's pyexpat API.Would you like me to open an issue to track implementing support for these ParserCreate parameters?
📜 Review details
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Lib/xmlrpc/client.pyis excluded by!Lib/**
📒 Files selected for processing (1)
crates/stdlib/src/pyexpat.rs
🧰 Additional context used
📓 Path-based instructions (1)
**/*.rs
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.rs: Follow the default rustfmt code style by runningcargo fmtto format Rust code
Always run clippy to lint code (cargo clippy) before completing tasks and fix any warnings or lints introduced by changes
Follow Rust best practices for error handling and memory management
Use the macro system (pyclass,pymodule,pyfunction, etc.) when implementing Python functionality in Rust
Files:
crates/stdlib/src/pyexpat.rs
🧠 Learnings (1)
📚 Learning: 2025-12-27T14:03:49.011Z
Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-12-27T14:03:49.011Z
Learning: Applies to **/*.rs : Use the macro system (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in Rust
Applied to files:
crates/stdlib/src/pyexpat.rs
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Run rust tests (windows-latest)
- GitHub Check: Run snippets and cpython tests (windows-latest)
- GitHub Check: Run snippets and cpython tests (macos-latest)
🔇 Additional comments (2)
crates/stdlib/src/pyexpat.rs (2)
52-52: LGTM: Import additions support the new functionality.The addition of
PyBytesRefandEitherimports correctly enables the Parse method to accept both string and bytes inputs.Also applies to: 54-54
354-360: Option with #[pyarg(any, optional)] is a valid pattern in RustPython.The codebase uses both
OptionalArg<T>andOption<T>with#[pyarg(any, optional)]. WhileOptionalArgis more commonly used,Optionis also a supported and legitimate pattern, as evidenced by its use in multiple stdlib modules (codecs.rs,sys.rs,lzma.rs,openssl.rs) and elsewhere. The change is consistent with established patterns in the codebase.
| fn parse( | ||
| &self, | ||
| data: Either<PyStrRef, PyBytesRef>, | ||
| _isfinal: OptionalArg<bool>, | ||
| vm: &VirtualMachine, | ||
| ) { | ||
| let bytes = match data { | ||
| Either::A(s) => s.as_bytes().to_vec(), | ||
| Either::B(b) => b.as_bytes().to_vec(), | ||
| }; | ||
| let reader = Cursor::<Vec<u8>>::new(bytes); | ||
| let parser = self.create_config().create_reader(reader); | ||
| self.do_parse(vm, parser); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, let's examine the actual file and the Parse method
cd /tmp && find . -name "pyexpat.rs" -path "*/stdlib/src/*" 2>/dev/null | head -1Repository: RustPython/RustPython
Length of output: 47
🏁 Script executed:
#!/bin/bash
# List files in the stdlib crates directory
find . -path "*/stdlib/src*" -name "*.rs" | head -20Repository: RustPython/RustPython
Length of output: 665
🏁 Script executed:
#!/bin/bash
# Examine the pyexpat.rs file around the specified lines
cat -n crates/stdlib/src/pyexpat.rs | sed -n '310,350p'Repository: RustPython/RustPython
Length of output: 1799
🏁 Script executed:
#!/bin/bash
# Search for Parse method definition and all its usages
rg -n "fn parse|\.parse\(" crates/stdlib/src/pyexpat.rs -A 2Repository: RustPython/RustPython
Length of output: 468
🏁 Script executed:
#!/bin/bash
# Check if Parse is a public method and look for any breaking change documentation
rg -n "pub.*parse|parse.*Either|PyBytesRef" crates/stdlib/src/pyexpat.rs -B 2 -A 5Repository: RustPython/RustPython
Length of output: 719
🏁 Script executed:
#!/bin/bash
# Check the PyBytesRef definition to verify as_bytes() method exists
rg -n "impl.*PyBytesRef|fn as_bytes" --type=rust -B 2 -A 5 | head -50Repository: RustPython/RustPython
Length of output: 2637
🏁 Script executed:
#!/bin/bash
# Look for the definition of PyBytesRef and its methods
fd -e rs -x rg "pub.*as_bytes|PyBytesRef" {} + | grep -i "as_bytes" | head -20Repository: RustPython/RustPython
Length of output: 15647
🏁 Script executed:
#!/bin/bash
# Check builtins module where PyBytesRef is likely defined
find . -path "*/builtins*" -name "*.rs" | xargs rg -l "PyBytesRef" | head -5Repository: RustPython/RustPython
Length of output: 230
🏁 Script executed:
#!/bin/bash
# Look for PyBytesRef and its as_bytes method
rg "impl.*PyBytes|fn as_bytes" -A 3 | head -40Repository: RustPython/RustPython
Length of output: 1980
🏁 Script executed:
#!/bin/bash
# Check if there are tests for the parse method
rg -n "test.*parse|Parse" crates/stdlib/src/pyexpat.rsRepository: RustPython/RustPython
Length of output: 988
🏁 Script executed:
#!/bin/bash
# Check for any Python side behavior of Parse in tests
fd -e py -path "*/test*" -exec rg -l "Parse\|pyexpat" {} \; | head -5Repository: RustPython/RustPython
Length of output: 233
🏁 Script executed:
#!/bin/bash
# Verify PyBytesRef specifically has as_bytes() method
rg -n "impl.*PyBytes|pub struct PyBytes" crates/vm/src/builtins/bytes.rs -A 10 | head -40Repository: RustPython/RustPython
Length of output: 921
🏁 Script executed:
#!/bin/bash
# Check the PyBytesRef type definition to ensure it wraps PyBytesInner
rg -n "type PyBytesRef|pub.*PyBytesRef" crates/vm/src/builtins/bytes.rs -B 2 -A 2Repository: RustPython/RustPython
Length of output: 154
🏁 Script executed:
#!/bin/bash
# Check PyBytes as_bytes() method implementation
rg -n "impl.*PyBytes|fn as_bytes" crates/vm/src/builtins/bytes.rs -A 3 | head -50Repository: RustPython/RustPython
Length of output: 1518
🏁 Script executed:
#!/bin/bash
# Find the actual as_bytes() method definition in PyBytes
rg -n "pub.*fn as_bytes|fn as_bytes" crates/vm/src/builtins/bytes.rs -B 2 -A 3Repository: RustPython/RustPython
Length of output: 178
Implementation is correct, but the signature change is not actually breaking.
Both PyStrRef and PyBytesRef have as_bytes() methods, and the conversion to Vec<u8> properly handles both string (UTF-8) and bytes inputs. However, the signature change from accepting PyStrRef to Either<PyStrRef, PyBytesRef> is backward compatible—existing code passing strings continues to work. This is a feature expansion, not a breaking change.
Note: The method has no error handling; do_parse() is called without checking its result.
🤖 Prompt for AI Agents
In crates/stdlib/src/pyexpat.rs around lines 321 to 334, the parse method
currently calls self.do_parse(vm, parser) and ignores its return value or any
errors; update the method to capture the result of do_parse, handle possible
errors (map them to appropriate Python exceptions via the VirtualMachine or
return a Result), and propagate or log failures instead of dropping them; ensure
the function signature and callers are adjusted as needed so parsing failures
produce a clear error (e.g., convert Err into a VM exception or return a
PyResult) and preserve existing behavior for successful parses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/stdlib/src/pyexpat.rs (1)
114-152: Validate the intern parameter type.The
internparameter is accepted asPyObjectRefwithout type validation. If a non-dict object is passed, it could cause runtime errors when dict operations are attempted later. Consider validating thatinternis a dict type whenSomeis provided.🔎 Proposed validation
fn new( namespace_separator: Option<String>, intern: Option<PyObjectRef>, vm: &VirtualMachine, ) -> PyResult<PyExpatLikeXmlParserRef> { - let intern_dict = intern.unwrap_or_else(|| vm.ctx.new_dict().into()); + let intern_dict = match intern { + Some(obj) => { + // Verify it's a dict + if !obj.is(&vm.ctx.types.dict_type) { + return Err(vm.new_type_error("intern must be a dictionary".to_owned())); + } + obj + } + None => vm.ctx.new_dict().into(), + }; Ok(Self {
♻️ Duplicate comments (2)
crates/stdlib/src/pyexpat.rs (2)
342-362: Silently ignoring all parsing errors may hide legitimate issues.While the comment indicates this is for compatibility with existing Python code, silently ignoring all errors from
do_parse(line 360) means legitimate parsing errors are never surfaced to the caller. Consider at least logging errors or handling specific error types that should be propagated.This concern was previously raised in past review comments.
365-377: Same error handling concern as in Parse.Like the
parsemethod, this silently ignores all parsing errors (line 375). The same concerns about hiding legitimate issues apply here.
📜 Review details
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (4)
Lib/test/test_pyexpat.pyis excluded by!Lib/**Lib/test/test_xml_dom_xmlbuilder.pyis excluded by!Lib/**Lib/test/test_xmlrpc.pyis excluded by!Lib/**Lib/xmlrpc/client.pyis excluded by!Lib/**
📒 Files selected for processing (1)
crates/stdlib/src/pyexpat.rs
🧰 Additional context used
📓 Path-based instructions (1)
**/*.rs
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.rs: Follow the default rustfmt code style by runningcargo fmtto format Rust code
Always run clippy to lint code (cargo clippy) before completing tasks and fix any warnings or lints introduced by changes
Follow Rust best practices for error handling and memory management
Use the macro system (pyclass,pymodule,pyfunction, etc.) when implementing Python functionality in Rust
Files:
crates/stdlib/src/pyexpat.rs
🧠 Learnings (4)
📚 Learning: 2025-12-24T23:47:38.444Z
Learnt from: youknowone
Repo: RustPython/RustPython PR: 0
File: :0-0
Timestamp: 2025-12-24T23:47:38.444Z
Learning: CPython's xml.parsers.expat xmlparser exposes a namespace_prefixes boolean attribute (implemented in the pyexpat C module) that controls namespace prefix reporting via XML_SetReturnNSTriplet. This attribute is used internally by xml.dom.expatbuilder and xml.etree.ElementTree.
Applied to files:
crates/stdlib/src/pyexpat.rs
📚 Learning: 2025-11-10T06:27:41.954Z
Learnt from: youknowone
Repo: RustPython/RustPython PR: 6243
File: vm/src/function/buffer.rs:123-129
Timestamp: 2025-11-10T06:27:41.954Z
Learning: In vm/src/function/buffer.rs, line 36 in the try_rw_bytes_like method intentionally uses TypeError (not BufferError) for the "buffer is not a read-write bytes-like object" error case, even though a similar error message in ArgMemoryBuffer::try_from_borrowed_object uses BufferError. The TypeError is the intended exception type for this specific code path.
Applied to files:
crates/stdlib/src/pyexpat.rs
📚 Learning: 2025-12-27T14:03:49.011Z
Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-12-27T14:03:49.011Z
Learning: Applies to **/*.rs : Use the macro system (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in Rust
Applied to files:
crates/stdlib/src/pyexpat.rs
📚 Learning: 2025-12-24T23:47:38.444Z
Learnt from: youknowone
Repo: RustPython/RustPython PR: 0
File: :0-0
Timestamp: 2025-12-24T23:47:38.444Z
Learning: CPython's xml.parsers.expat xmlparser does not expose an intern attribute. String interning for XML parsing in CPython is handled through SAX's expatreader feature_string_interning mechanism, not as a direct xmlparser.intern attribute.
Applied to files:
crates/stdlib/src/pyexpat.rs
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: Run snippets and cpython tests (windows-latest)
- GitHub Check: Run rust tests (windows-latest)
- GitHub Check: Ensure compilation on various targets
- GitHub Check: Check the WASM package and demo
- GitHub Check: Run snippets and cpython tests (ubuntu-latest)
- GitHub Check: Run rust tests (ubuntu-latest)
- GitHub Check: Run snippets and cpython tests (macos-latest)
- GitHub Check: Run snippets and cpython tests on wasm-wasi
- GitHub Check: Run tests under miri
- GitHub Check: Run rust tests (macos-latest)
- GitHub Check: Check Rust code with clippy
🔇 Additional comments (5)
crates/stdlib/src/pyexpat.rs (5)
52-54: LGTM!The new imports are necessary for accepting both string and bytes input in the parse method.
69-70: LGTM!The
namespace_separatorfield is properly added with the correct#[pytraverse(skip)]annotation for non-Python object data.
291-297: LGTM!The
make_namehelper correctly constructs element names with namespace prefixes when both separator and namespace are present.
299-339: LGTM!The error handling is now properly structured with
Resultreturn type, andmake_nameis consistently used for namespace-aware naming.
381-388: LGTM!The change from
OptionalArg<T>toOption<T>is appropriate here since these are parsed arguments with#[pyarg(any, optional)]. Theinterntype change toPyObjectRefis also correct since it accepts dict objects.
| fn parser_create( | ||
| _args: ParserCreateArgs, | ||
| args: ParserCreateArgs, | ||
| vm: &VirtualMachine, | ||
| ) -> PyResult<PyExpatLikeXmlParserRef> { | ||
| PyExpatLikeXmlParser::new(vm) | ||
| // Validate namespace_separator: must be at most one character | ||
| let ns_sep = match args.namespace_separator { | ||
| Some(ref s) => { | ||
| let chars: Vec<char> = s.as_str().chars().collect(); | ||
| if chars.len() > 1 { | ||
| return Err(vm.new_value_error( | ||
| "namespace_separator must be at most one character, omitted, or None" | ||
| .to_owned(), | ||
| )); | ||
| } | ||
| Some(s.as_str().to_owned()) | ||
| } | ||
| None => None, | ||
| }; | ||
|
|
||
| // encoding parameter is currently not used (xml-rs handles encoding from XML declaration) | ||
| let _ = args.encoding; | ||
|
|
||
| PyExpatLikeXmlParser::new(ns_sep, args.intern, vm) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# Search for encoding parameter usage in tests and documentation
rg -n "encoding" crates/stdlib/src/pyexpat.rs -B2 -A2
# Search for ParserCreate tests
rg -n "ParserCreate|parser_create" crates/stdlib/src/pyexpat.rs -B1 -A1
# Check if there are any test files related to pyexpat
fd -e py -e rs "test" | grep -i "expat\|parser"Repository: RustPython/RustPython
Length of output: 736
🏁 Script executed:
# Find the PyExpatLikeXmlParser implementation
rg -n "impl.*PyExpatLikeXmlParser|fn new" crates/stdlib/src/pyexpat.rs -B2 -A5
# Search for test files related to expat or xml parsing
find . -type f -name "*.py" | xargs rg -l "pyexpat|xml.parsers.expat" 2>/dev/null | head -10
# Check the struct definition and new method
rg -n "struct PyExpatLikeXmlParser|pub fn new" crates/stdlib/src/pyexpat.rs -A10Repository: RustPython/RustPython
Length of output: 878
🏁 Script executed:
# Search for pyexpat test files
find . -type f -name "*.py" -path "*/test*" | xargs rg -l "expat|ParserCreate" 2>/dev/null | head -5
# Also search for any examples or tests that mention encoding with parsers
rg -n "ParserCreate.*encoding|encoding.*parser" --type=py 2>/dev/null
# Look in the test_pyexpat or similar if it exists
find . -type f -name "*expat*" | grep -E "\.(py|rs)$"Repository: RustPython/RustPython
Length of output: 95
🏁 Script executed:
# Search for tests within pyexpat.rs
rg -n "#\[test\]|#\[cfg.*test" crates/stdlib/src/pyexpat.rs -A10
# Search for any inline examples or documentation
rg -n "///" crates/stdlib/src/pyexpat.rs -B1 -A1
# Check the full file length to understand the scope
wc -l crates/stdlib/src/pyexpat.rs
# Search for any XML declaration or encoding-related processing
rg -n "xml-rs|encoding|XML|declaration" crates/stdlib/src/pyexpat.rs -C2Repository: RustPython/RustPython
Length of output: 1422
🏁 Script executed:
# Search for cargo dependencies to understand xml-rs version
rg -n "xml-rs|xml" Cargo.toml 2>/dev/null | head -20
# Look for any Cargo.toml in stdlib crate
cat crates/stdlib/Cargo.toml | grep -A5 -B5 "xml"
# Search for any comments about Python API compatibility
rg -n "python|API|compat" crates/stdlib/src/pyexpat.rs -i | head -15Repository: RustPython/RustPython
Length of output: 151
🏁 Script executed:
# Look at the stdlib Cargo.toml directly
head -100 crates/stdlib/Cargo.toml | grep -E "xml|dependencies" -A10
# Also check if there's information about xml-rs online or in comments
rg -n "xml-rs" crates/stdlib/src/pyexpat.rsRepository: RustPython/RustPython
Length of output: 1174
🌐 Web query:
Python xml.parsers.expat ParserCreate encoding parameter behavior documentation
💡 Result:
Short answer
- xml.parsers.expat.ParserCreate(encoding=None, namespace_separator=None) accepts an optional encoding string that, if given, overrides any encoding declared or detected in the XML document. [1]
- The encoding string must name an encoding supported by Expat; Expat only has a small built‑in set (US‑ASCII, UTF‑8, UTF‑16, ISO‑8859‑1). Any other name will cause Expat to invoke its UnknownEncodingHandler. [1][2]
Sources
- Python docs: xml.parsers.expat.ParserCreate description. [1]
- Expat reference: XML_ParserCreate behavior and built‑in encodings. [2]
🏁 Script executed:
# Search for TODO, FIXME, or other markers about encoding limitations
rg -n "TODO|FIXME|HACK|NOTE|BUG|encoding" crates/stdlib/src/pyexpat.rs | head -20
# Check if there are any comments about limitations
rg -n "limitation|not supported|unsupported|not implemented|doesn't support" crates/stdlib/src/pyexpat.rs -iRepository: RustPython/RustPython
Length of output: 242
Ignoring the encoding parameter breaks Python API compatibility.
The encoding parameter in ParserCreate is part of Python's documented xml.parsers.expat API and allows overriding the encoding detected in the XML document. RustPython silently ignores this parameter because xml-rs only auto-detects from the XML declaration. This breaks code that explicitly specifies encoding before parsing. The parameter should either be supported or raise an error if provided with a non-default value, rather than silently discarding it.
🤖 Prompt for AI Agents
In crates/stdlib/src/pyexpat.rs around lines 391 to 414, the function currently
ignores the ParserCreateArgs.encoding field which breaks Python API
compatibility; change the function to check args.encoding and if Some(value)
return an appropriate Python exception (e.g., vm.new_value_error) explaining
that specifying encoding is not supported by this backend instead of silently
discarding it, or alternatively implement support by passing the provided
encoding into the underlying parser initialization; update the code path so that
when encoding is present you either raise the error with a clear message or wire
the encoding into PyExpatLikeXmlParser::new (and adjust that constructor to
accept and honor the encoding).
Summary by CodeRabbit
New Features
Bug Fixes / Behavior Changes
✏️ Tip: You can customize this high-level summary in your review settings.