[Stdlib] Use pre-scanned capacity in _split() for explicit separators#5912
[Stdlib] Use pre-scanned capacity in _split() for explicit separators#5912msaelices wants to merge 4 commits intomodular:mainfrom
Conversation
Replace the hardcoded preallocation of 32 items with exact capacity using `count(sep) + 1`. When `has_maxsplit` is true, allocate `maxsplit + 1` instead. This eliminates repeated dynamic reallocations when splitting strings with many separator occurrences. Signed-off-by: Manuel Saelices <msaelices@gmail.com>
Add `bench_string_split_dense` to measure split performance on strings with many separator occurrences (10, 100, 1000, 10000 items). This complements the existing split benchmarks that use natural language text with infrequent separators. Signed-off-by: Manuel Saelices <msaelices@gmail.com>
Signed-off-by: Manuel Saelices <msaelices@gmail.com>
There was a problem hiding this comment.
Pull request overview
Updates the stdlib StringSlice._split() preallocation strategy to reduce List reallocations when splitting strings with many explicit separator occurrences, and adds coverage/measurement for dense-split scenarios.
Changes:
- Replace fixed
_split()preallocation heuristic (32) with pre-scanned/exact capacity sizing for explicit separators. - Add a regression test that splits a string into >32 items.
- Add a dense-separator split benchmark (
bench_string_split_dense) with sizes from 10 to 10k items.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| mojo/stdlib/test/collections/string/test_string_slice.mojo | Adds a test case that splits into 50 items to exercise large-result behavior. |
| mojo/stdlib/std/collections/string/string_slice.mojo | Changes _split() list capacity initialization for explicit separators based on maxsplit or count(sep) + 1. |
| mojo/stdlib/benchmarks/collections/bench_string.mojo | Adds a dense-separator split benchmark and registers it in main(). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| @parameter | ||
| if has_maxsplit: | ||
| amnt = maxsplit + 1 if maxsplit < prealloc else prealloc | ||
| output = {capacity = amnt} | ||
| amnt = maxsplit + 1 | ||
| output = {capacity = amnt} | ||
| else: |
There was a problem hiding this comment.
In the has_maxsplit branch, initializing output with capacity = maxsplit + 1 can allocate an arbitrarily large list if callers pass a huge maxsplit (even if the input contains few/no separators). This is a potential OOM/DoS vector and a regression from the previous capped prealloc behavior. Consider bounding the capacity to a reasonable upper limit derived from the input (e.g., min(maxsplit + 1, src_str.count(sep) + 1) or at least min(maxsplit + 1, str_byte_len // sep_len + 1)).
| if has_maxsplit: | ||
| amnt = maxsplit + 1 if maxsplit < prealloc else prealloc | ||
| output = {capacity = amnt} | ||
| amnt = maxsplit + 1 | ||
| output = {capacity = amnt} | ||
| else: |
There was a problem hiding this comment.
For has_maxsplit=False, src_str.count(sep) introduces an extra full scan (count uses repeated find) before the actual split loop performs another sequence of find calls. This guarantees at least ~2x substring search work compared to the previous heuristic prealloc and may regress performance for long strings with few separators. If the intent is to optimize only dense-separator cases, consider a hybrid strategy (e.g., start with a small capacity and only fall back to counting/reserving once growth crosses a threshold, or count only when the string is above a size/expected-splits heuristic).
Summary
_split()with exact capacity:count(sep) + 1for unlimited splits,maxsplit + 1for limited splits. Eliminates repeated reallocations for strings with many separator occurrences.bench_string_split_densebenchmark to measure split performance on dense separator strings (10 to 10k items).Benchmarks
COMPLETE