Skip to content

[Stdlib] Use SIMD-accelerated brace scanning in format string parsing#5919

Open
msaelices wants to merge 12 commits intomodular:mainfrom
msaelices:format-optimization
Open

[Stdlib] Use SIMD-accelerated brace scanning in format string parsing#5919
msaelices wants to merge 12 commits intomodular:mainfrom
msaelices:format-optimization

Conversation

@msaelices
Copy link
Contributor

@msaelices msaelices commented Feb 8, 2026

Summary

  • Generalize _memchr in string_slice.mojo into a single variadic function that accepts N needle characters via a VariadicPack, using SIMD to search for all of them simultaneously in a single pass
  • Inner loops use @parameter for with rebind[Scalar[dtype]] for compile-time unrolling
  • Add _find_next_brace() helper in format.mojo that uses _memchr to locate {/} characters
  • Replace byte-by-byte for loop in compile_entries_runtime() with while loop that jumps directly to the next brace
  • Add bench_format.mojo benchmark and SIMD boundary tests

Benchmark results (main vs this PR)

Median of 5 repetitions, 1000 format calls per iteration:

Benchmark main (ms) PR (ms) Speedup
literal[4B x 4] 0.46 0.39 1.2x
literal[64B x 4] 0.99 0.55 1.8x
literal[512B x 4] 4.46 0.67 6.7x
literal[4096B x 4] 28.59 2.20 13.0x
runtime_short 0.59 0.24 2.5x

@msaelices msaelices requested a review from a team as a code owner February 8, 2026 00:12
Copilot AI review requested due to automatic review settings February 8, 2026 00:12
Replace the byte-by-byte loop in `compile_entries_runtime()` with a
SIMD-vectorized `_find_next_brace()` helper that scans 16-64 bytes at
once to locate `{` and `}` characters, skipping over literal text in
chunks. Falls back to scalar at compile time.

Signed-off-by: Manuel Saelices <msaelices@gmail.com>
@msaelices msaelices force-pushed the format-optimization branch from 67ce5ad to bcf41f3 Compare February 8, 2026 00:13
@msaelices msaelices marked this pull request as draft February 8, 2026 00:15
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves runtime format-string parsing performance in std::collections::string by adding a SIMD-based scanner to locate {/} bytes more efficiently, reducing per-byte branching in the main parsing loop.

Changes:

  • Added _find_next_brace() helper that scans for {/} using SIMD masks at runtime, with scalar fallback for compile-time evaluation and short strings.
  • Updated compile_entries_runtime() to use a brace-jumping while loop driven by _find_next_brace() instead of byte-by-byte iteration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +400 to 406
var i = 0
while True:
var next = _find_next_brace(fmt_ptr, i, fmt_len)
if next == -1:
break
i = next
if fmt_ptr[i] == `{`:
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new SIMD-accelerated control flow is only taken when fmt_len is at least one SIMD vector wide, but existing format-string tests appear to primarily cover short inputs. Please add tests that exercise the SIMD path (e.g., long format strings with {/} located at and across vector boundaries, including escaped {{ / }} split across a boundary) to ensure correctness of the vectorized scan + scalar tail fallback.

Copilot uses AI. Check for mistakes.
continue
var i = 0
while True:
var next = _find_next_brace(fmt_ptr, i, fmt_len)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: This seems like a more general useful facility than just finding braces for this specific context. Something along the lines of StringSlice.find_one_of(codepoints...) or something like that.

Copy link
Contributor Author

@msaelices msaelices Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ups, Just read. Actually, I found we can use the _memchr function and I did it here: 2270a30

But, it is slower than before, so I've created a new _memchr2 to find any of the two chars passed: msaelices@724458c

It could be generalized to support n-char searches, I agree, but I’m not sure whether that would be within the scope of this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a TODO line for generalizing the _memchr and _memchr2 functions: 7990e34



@always_inline
fn _find_next_brace(
Copy link
Contributor

@NathanSWard NathanSWard Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment: This is a nice optimization for sure, however a few things to consider:

  1. For format-strings and when we eventually have f/t-strings this code will be run at compile time meaning the vectorized approach isn't saving us much as there is no runtime performance benefits and it may end up slowing down compile times.
  2. If you do add benchmarks for this, we need to really only be testing for small strings, where I'm not sure how much perf improvement we'll get. Since most format strings relatively small e.g. "Hello, {}! I am {}".format(...). So I'd be interested to see these cases as long format sting aren't really a very common thing 👀

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I did not know it. I guess we can parametrize the code to differentiate regular Strings vs f-ones, right?
  2. Added benchmarks.

Signed-off-by: Manuel Saelices <msaelices@gmail.com>
@msaelices msaelices force-pushed the format-optimization branch from 1d8cc74 to 3269532 Compare February 8, 2026 00:30
Replace custom SIMD scanning with two _memchr calls (one for each brace
character), reusing existing SIMD-optimized infrastructure instead of
duplicating it.

Signed-off-by: Manuel Saelices <msaelices@gmail.com>
Add _memchr2/_memchr2_impl to string_slice.mojo for finding the first
occurrence of either of two byte values in a single SIMD pass. Use it in
_find_next_brace instead of two separate _memchr calls.

Signed-off-by: Manuel Saelices <msaelices@gmail.com>
…mchr_any

Signed-off-by: Manuel Saelices <msaelices@gmail.com>
@msaelices msaelices requested a review from NathanSWard February 8, 2026 01:02
@msaelices msaelices marked this pull request as ready for review February 8, 2026 01:03
Comment on lines 2737 to 2749
# TODO: Generalize _memchr/_memchr2 into a single variadic _memchr_any that
# accepts N needle characters and builds the SIMD mask with a parameter loop,
# similar to Rust's memchr crate which provides memchr, memchr2, and memchr3.


@always_inline
fn _memchr2[
dtype: DType, //
](
source: Span[mut=False, Scalar[dtype], ...],
char1: Scalar[dtype],
char2: Scalar[dtype],
) -> source.UnsafePointerType:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is already generalizable nowadays. It should be just receiving a variadic pack and iterating over each element and oring a result variable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Done: 4582c8e

…t parsing

Add tests for format strings with braces at SIMD vector boundaries (16,
32, 64 bytes), escaped braces straddling boundaries, and braces in
scalar tail positions.

Add bench_format_runtime_short for typical short format strings like
"Hello, {}! I am {} years old and I like {}.".

Signed-off-by: Manuel Saelices <msaelices@gmail.com>
Signed-off-by: Manuel Saelices <msaelices@gmail.com>
@msaelices msaelices requested a review from martinvuyk February 10, 2026 17:37
Comment on lines 2712 to 2714
for j in range(len(chars)):
if ptr[i] == chars[j]:
return ptr + i
Copy link
Contributor

@martinvuyk martinvuyk Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO you should unroll the loop at compile time here. Similarly in other places

We might leave this more complicated version for another time, up to you:

I'm not 100% sure how well the compiler might optimize this but it's best if you give it a shove in the right direction. Not sure if this optimization is faster or not either, just eyeballing it

Suggested change
for j in range(len(chars)):
if ptr[i] == chars[j]:
return ptr + i
var data = SIMD[dtype, next_power_of_to(len(chars))](chars[0])
@parameter
for j in range(1, len(chars)):
data[j] = chars[j]
if data.eq(ptr[i]).reduce_or():
return ptr + i

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this does not work because @parameter for requires compile-time bounds, and VariadicList length is not known at compile-time in Mojo

Copy link
Contributor

@martinvuyk martinvuyk Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this does not work because @parameter for requires compile-time bounds, and VariadicList length is not known at compile-time in Mojo

Yeah that limitation has irked me for a while now (see #4144). It's why I oginally meant for this to use a VariadicPack but while I know of a way to do it currently, it will be ugly. We can polish this later on (i.e. just leave the runtime loop for now).

This actually gave me an idea of how we could allow some nicer version of it, I'll go play around and hopefully figure it out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked that now the speed-up is even better than before 🚀

Will try the VariadicPack approach

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If variadic pack is too verbose and since variadic list doesn't have a comptime size, you could use InlineArray as an alternative too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it works like a charm: msaelices@c4a3796

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmarks are similar or slightly better, specially in the runtime_short bench

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! FYI I opened #5935 to see if we ever get a nicer way to implement this without the rebind workaround. If this were a public function we'd have to comptime assert that the type of each of the VariadicPack elements is Scalar[dtype], IMO we can leave it as is since this is private

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NathanSWard @martinvuyk Does this PR look good to go?

Signed-off-by: Manuel Saelices <msaelices@gmail.com>
@msaelices msaelices requested a review from martinvuyk February 10, 2026 22:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants