-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Add pcre_u guard to functions using u flag #9724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Conversation
|
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the Core Committers: Use this line as a base for the props when committing in SVN: To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
Test using WordPress PlaygroundThe changes in this pull request can previewed and tested using a WordPress Playground instance. WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser. Some things to be aware of
For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation. |
| } else { | ||
| if ( preg_match( '/^[a-z0-9]+(-[a-z0-9]+)*$/m', $slug ) ) { | ||
| $sanitized_slugs[] = $slug; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this PCRE pattern is only looking at US-ASCII characters and doesn’t even need the UTF-8 flag. do you see any reason not to update this simply to remove the flag?
| $pattern = "#$word#iu"; | ||
| } else { | ||
| $pattern = "#$word#i"; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect this could be another case where it’s okay to remove the UTF-8 flag, because whatever $word is, it’s going to appear here as bytes, not as source code. that means it’s already matching sequences of the requested bytes/text.
it would be good to verify this. one setup would be to have PHP using an internal_encoding of latin1 (if that’s even possible, I can’t remember if changing the internal encoding has been removed) and then testing "b\xC3\xBCch" against "#b\xFCch#i. if these match then the PCRE functions are converting text before matching. if they don’t match, I think we can probably remove the flag.
| $pattern = "#$word#iu"; | ||
| } else { | ||
| $pattern = "#$word#i"; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above: is the u flag necessary here given that we’re injecting runtime bytes into the pattern and not attempting to translate source code?
| if ( 1 === @preg_match( '/^./us', $text ) ) { | ||
| if ( 1 === preg_match( '/^./us', $text ) ) { | ||
| return $text; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this whole function has been updated in trunk. these changes are no longer relevant.
| } else { | ||
| $words_array = array( str_split( $text ) ); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have this function slated for much bigger updates. I would recommend against updating the PCRE usage here because of that.
| $decline = preg_match( '#\b\d{1,2}\.? [^\d ]+\b#u', $date ); | ||
| } else { | ||
| $decline = preg_match( '#\b\d{1,2}\.? [^\d ]+\b#', $date ); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the difference between the \b with and without the UTF-8 flag?
| $chars = array( mb_str_split( $line, 1, 'UTF-8' ) ); | ||
| } else { | ||
| $chars = array( str_split( $line ) ); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this splitting of lines into characters and looking for a backslash is something we can probably do away with entirely: a streaming approach with strpos( '\\' ) would suffice because all of the escapes are US-ASCII. this means we don’t need to split the lines and we don’t need a million string concatenations.
| $text = preg_replace( "/[\x{00a0}\x{200b}]+/u", ' ', $text ); | ||
| } else { | ||
| $text = str_replace( array( "\xc2\xa0", "\xe2\x80\x8b" ), ' ', $text ); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the regex isn’t necessary, it would seem fine to replace it directly with the str_replace(), but two ideas:
- use
strtr( $text, array( … ) ) - use the Unicode string literals like
"\u{00A0}"and"\u{200B}"
although it would be good to verify that all supported versions of PHP support that Unicode syntax without any extensions. I think they do.
This PR adds
_wp_can_use_pcre_uguards to all the functions that usepcre_umodifier flag in regex.Currently WordPress assumes that
uflag is available by default but when thepcre_usupport isn't present this falls apart and functions likeparse_shortcodes_attslike breaks returning NULL.Trac ticket: https://core.trac.wordpress.org/ticket/63913
This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.