[Stdlib] Use SIMD-accelerated brace scanning in format string parsing#5919
[Stdlib] Use SIMD-accelerated brace scanning in format string parsing#5919msaelices wants to merge 12 commits intomodular:mainfrom
Conversation
Replace the byte-by-byte loop in `compile_entries_runtime()` with a
SIMD-vectorized `_find_next_brace()` helper that scans 16-64 bytes at
once to locate `{` and `}` characters, skipping over literal text in
chunks. Falls back to scalar at compile time.
Signed-off-by: Manuel Saelices <msaelices@gmail.com>
67ce5ad to
bcf41f3
Compare
There was a problem hiding this comment.
Pull request overview
This PR improves runtime format-string parsing performance in std::collections::string by adding a SIMD-based scanner to locate {/} bytes more efficiently, reducing per-byte branching in the main parsing loop.
Changes:
- Added
_find_next_brace()helper that scans for{/}using SIMD masks at runtime, with scalar fallback for compile-time evaluation and short strings. - Updated
compile_entries_runtime()to use a brace-jumpingwhileloop driven by_find_next_brace()instead of byte-by-byte iteration.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| var i = 0 | ||
| while True: | ||
| var next = _find_next_brace(fmt_ptr, i, fmt_len) | ||
| if next == -1: | ||
| break | ||
| i = next | ||
| if fmt_ptr[i] == `{`: |
There was a problem hiding this comment.
The new SIMD-accelerated control flow is only taken when fmt_len is at least one SIMD vector wide, but existing format-string tests appear to primarily cover short inputs. Please add tests that exercise the SIMD path (e.g., long format strings with {/} located at and across vector boundaries, including escaped {{ / }} split across a boundary) to ensure correctness of the vectorized scan + scalar tail fallback.
| continue | ||
| var i = 0 | ||
| while True: | ||
| var next = _find_next_brace(fmt_ptr, i, fmt_len) |
There was a problem hiding this comment.
question: This seems like a more general useful facility than just finding braces for this specific context. Something along the lines of StringSlice.find_one_of(codepoints...) or something like that.
There was a problem hiding this comment.
Ups, Just read. Actually, I found we can use the _memchr function and I did it here: 2270a30
But, it is slower than before, so I've created a new _memchr2 to find any of the two chars passed: msaelices@724458c
It could be generalized to support n-char searches, I agree, but I’m not sure whether that would be within the scope of this PR.
There was a problem hiding this comment.
I've added a TODO line for generalizing the _memchr and _memchr2 functions: 7990e34
|
|
||
|
|
||
| @always_inline | ||
| fn _find_next_brace( |
There was a problem hiding this comment.
comment: This is a nice optimization for sure, however a few things to consider:
- For format-strings and when we eventually have
f/t-stringsthis code will be run at compile time meaning the vectorized approach isn't saving us much as there is no runtime performance benefits and it may end up slowing down compile times. - If you do add benchmarks for this, we need to really only be testing for small strings, where I'm not sure how much perf improvement we'll get. Since most format strings relatively small e.g.
"Hello, {}! I am {}".format(...). So I'd be interested to see these cases as long format sting aren't really a very common thing 👀
There was a problem hiding this comment.
- I did not know it. I guess we can parametrize the code to differentiate regular Strings vs f-ones, right?
- Added benchmarks.
Signed-off-by: Manuel Saelices <msaelices@gmail.com>
1d8cc74 to
3269532
Compare
Replace custom SIMD scanning with two _memchr calls (one for each brace character), reusing existing SIMD-optimized infrastructure instead of duplicating it. Signed-off-by: Manuel Saelices <msaelices@gmail.com>
Add _memchr2/_memchr2_impl to string_slice.mojo for finding the first occurrence of either of two byte values in a single SIMD pass. Use it in _find_next_brace instead of two separate _memchr calls. Signed-off-by: Manuel Saelices <msaelices@gmail.com>
…mchr_any Signed-off-by: Manuel Saelices <msaelices@gmail.com>
| # TODO: Generalize _memchr/_memchr2 into a single variadic _memchr_any that | ||
| # accepts N needle characters and builds the SIMD mask with a parameter loop, | ||
| # similar to Rust's memchr crate which provides memchr, memchr2, and memchr3. | ||
|
|
||
|
|
||
| @always_inline | ||
| fn _memchr2[ | ||
| dtype: DType, // | ||
| ]( | ||
| source: Span[mut=False, Scalar[dtype], ...], | ||
| char1: Scalar[dtype], | ||
| char2: Scalar[dtype], | ||
| ) -> source.UnsafePointerType: |
There was a problem hiding this comment.
I think this is already generalizable nowadays. It should be just receiving a variadic pack and iterating over each element and oring a result variable
…t parsing
Add tests for format strings with braces at SIMD vector boundaries (16,
32, 64 bytes), escaped braces straddling boundaries, and braces in
scalar tail positions.
Add bench_format_runtime_short for typical short format strings like
"Hello, {}! I am {} years old and I like {}.".
Signed-off-by: Manuel Saelices <msaelices@gmail.com>
Signed-off-by: Manuel Saelices <msaelices@gmail.com>
| for j in range(len(chars)): | ||
| if ptr[i] == chars[j]: | ||
| return ptr + i |
There was a problem hiding this comment.
IMO you should unroll the loop at compile time here. Similarly in other places
We might leave this more complicated version for another time, up to you:
I'm not 100% sure how well the compiler might optimize this but it's best if you give it a shove in the right direction. Not sure if this optimization is faster or not either, just eyeballing it
| for j in range(len(chars)): | |
| if ptr[i] == chars[j]: | |
| return ptr + i | |
| var data = SIMD[dtype, next_power_of_to(len(chars))](chars[0]) | |
| @parameter | |
| for j in range(1, len(chars)): | |
| data[j] = chars[j] | |
| if data.eq(ptr[i]).reduce_or(): | |
| return ptr + i |
There was a problem hiding this comment.
I think this does not work because @parameter for requires compile-time bounds, and VariadicList length is not known at compile-time in Mojo
There was a problem hiding this comment.
I think this does not work because
@parameter forrequires compile-time bounds, andVariadicListlength is not known at compile-time in Mojo
Yeah that limitation has irked me for a while now (see #4144). It's why I oginally meant for this to use a VariadicPack but while I know of a way to do it currently, it will be ugly. We can polish this later on (i.e. just leave the runtime loop for now).
This actually gave me an idea of how we could allow some nicer version of it, I'll go play around and hopefully figure it out
There was a problem hiding this comment.
I've checked that now the speed-up is even better than before 🚀
Will try the VariadicPack approach
There was a problem hiding this comment.
If variadic pack is too verbose and since variadic list doesn't have a comptime size, you could use InlineArray as an alternative too.
There was a problem hiding this comment.
it works like a charm: msaelices@c4a3796
There was a problem hiding this comment.
The benchmarks are similar or slightly better, specially in the runtime_short bench
There was a problem hiding this comment.
Nice! FYI I opened #5935 to see if we ever get a nicer way to implement this without the rebind workaround. If this were a public function we'd have to comptime assert that the type of each of the VariadicPack elements is Scalar[dtype], IMO we can leave it as is since this is private
There was a problem hiding this comment.
@NathanSWard @martinvuyk Does this PR look good to go?
Signed-off-by: Manuel Saelices <msaelices@gmail.com>
Summary
_memchrinstring_slice.mojointo a single variadic function that accepts N needle characters via aVariadicPack, using SIMD to search for all of them simultaneously in a single pass@parameter forwithrebind[Scalar[dtype]]for compile-time unrolling_find_next_brace()helper informat.mojothat uses_memchrto locate{/}charactersforloop incompile_entries_runtime()withwhileloop that jumps directly to the next bracebench_format.mojobenchmark and SIMD boundary testsBenchmark results (
mainvs this PR)Median of 5 repetitions, 1000 format calls per iteration:
literal[4B x 4]literal[64B x 4]literal[512B x 4]literal[4096B x 4]runtime_short