Skip to content

Conversation

@USERSATOSHI
Copy link

This PR adds _wp_can_use_pcre_u guards to all the functions that use pcre_u modifier flag in regex.

Currently WordPress assumes that u flag is available by default but when the pcre_u support isn't present this falls apart and functions like parse_shortcodes_atts like breaks returning NULL.

Trac ticket: https://core.trac.wordpress.org/ticket/63913


This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

@github-actions
Copy link

github-actions bot commented Sep 3, 2025

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props tusharbharti.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@github-actions
Copy link

github-actions bot commented Sep 3, 2025

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

} else {
if ( preg_match( '/^[a-z0-9]+(-[a-z0-9]+)*$/m', $slug ) ) {
$sanitized_slugs[] = $slug;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this PCRE pattern is only looking at US-ASCII characters and doesn’t even need the UTF-8 flag. do you see any reason not to update this simply to remove the flag?

$pattern = "#$word#iu";
} else {
$pattern = "#$word#i";
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this could be another case where it’s okay to remove the UTF-8 flag, because whatever $word is, it’s going to appear here as bytes, not as source code. that means it’s already matching sequences of the requested bytes/text.

it would be good to verify this. one setup would be to have PHP using an internal_encoding of latin1 (if that’s even possible, I can’t remember if changing the internal encoding has been removed) and then testing "b\xC3\xBCch" against "#b\xFCch#i. if these match then the PCRE functions are converting text before matching. if they don’t match, I think we can probably remove the flag.

$pattern = "#$word#iu";
} else {
$pattern = "#$word#i";
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above: is the u flag necessary here given that we’re injecting runtime bytes into the pattern and not attempting to translate source code?

if ( 1 === @preg_match( '/^./us', $text ) ) {
if ( 1 === preg_match( '/^./us', $text ) ) {
return $text;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this whole function has been updated in trunk. these changes are no longer relevant.

} else {
$words_array = array( str_split( $text ) );
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have this function slated for much bigger updates. I would recommend against updating the PCRE usage here because of that.

$decline = preg_match( '#\b\d{1,2}\.? [^\d ]+\b#u', $date );
} else {
$decline = preg_match( '#\b\d{1,2}\.? [^\d ]+\b#', $date );
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the difference between the \b with and without the UTF-8 flag?

$chars = array( mb_str_split( $line, 1, 'UTF-8' ) );
} else {
$chars = array( str_split( $line ) );
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this splitting of lines into characters and looking for a backslash is something we can probably do away with entirely: a streaming approach with strpos( '\\' ) would suffice because all of the escapes are US-ASCII. this means we don’t need to split the lines and we don’t need a million string concatenations.

$text = preg_replace( "/[\x{00a0}\x{200b}]+/u", ' ', $text );
} else {
$text = str_replace( array( "\xc2\xa0", "\xe2\x80\x8b" ), ' ', $text );
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the regex isn’t necessary, it would seem fine to replace it directly with the str_replace(), but two ideas:

  • use strtr( $text, array( … ) )
  • use the Unicode string literals like "\u{00A0}" and "\u{200B}"

although it would be good to verify that all supported versions of PHP support that Unicode syntax without any extensions. I think they do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants