[Stdlib] Use pre-scanned capacity in _split() for explicit separators by msaelices · Pull Request #5912 · modular/modular

msaelices · 2026-02-07T00:08:52Z

Summary

Replace hardcoded preallocation of 32 items in _split() with exact capacity: count(sep) + 1 for unlimited splits, maxsplit + 1 for limited splits. Eliminates repeated reallocations for strings with many separator occurrences.
Add bench_string_split_dense benchmark to measure split performance on dense separator strings (10 to 10k items).

Benchmarks

COMPLETE

Replace the hardcoded preallocation of 32 items with exact capacity using `count(sep) + 1`. When `has_maxsplit` is true, allocate `maxsplit + 1` instead. This eliminates repeated dynamic reallocations when splitting strings with many separator occurrences. Signed-off-by: Manuel Saelices <msaelices@gmail.com>

Add `bench_string_split_dense` to measure split performance on strings with many separator occurrences (10, 100, 1000, 10000 items). This complements the existing split benchmarks that use natural language text with infrequent separators. Signed-off-by: Manuel Saelices <msaelices@gmail.com>

Signed-off-by: Manuel Saelices <msaelices@gmail.com>

Copilot

Pull request overview

Updates the stdlib StringSlice._split() preallocation strategy to reduce List reallocations when splitting strings with many explicit separator occurrences, and adds coverage/measurement for dense-split scenarios.

Changes:

Replace fixed _split() preallocation heuristic (32) with pre-scanned/exact capacity sizing for explicit separators.
Add a regression test that splits a string into >32 items.
Add a dense-separator split benchmark (bench_string_split_dense) with sizes from 10 to 10k items.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
mojo/stdlib/test/collections/string/test_string_slice.mojo	Adds a test case that splits into 50 items to exercise large-result behavior.
mojo/stdlib/std/collections/string/string_slice.mojo	Changes `_split()` list capacity initialization for explicit separators based on `maxsplit` or `count(sep) + 1`.
mojo/stdlib/benchmarks/collections/bench_string.mojo	Adds a dense-separator split benchmark and registers it in `main()`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-07T00:12:52Z

mojo/stdlib/std/collections/string/string_slice.mojo


    @parameter
    if has_maxsplit:
-        amnt = maxsplit + 1 if maxsplit < prealloc else prealloc
-    output = {capacity = amnt}
+        amnt = maxsplit + 1
+        output = {capacity = amnt}
+    else:


In the has_maxsplit branch, initializing output with capacity = maxsplit + 1 can allocate an arbitrarily large list if callers pass a huge maxsplit (even if the input contains few/no separators). This is a potential OOM/DoS vector and a regression from the previous capped prealloc behavior. Consider bounding the capacity to a reasonable upper limit derived from the input (e.g., min(maxsplit + 1, src_str.count(sep) + 1) or at least min(maxsplit + 1, str_byte_len // sep_len + 1)).

Copilot · 2026-02-07T00:12:52Z

mojo/stdlib/std/collections/string/string_slice.mojo

    if has_maxsplit:
-        amnt = maxsplit + 1 if maxsplit < prealloc else prealloc
-    output = {capacity = amnt}
+        amnt = maxsplit + 1
+        output = {capacity = amnt}
+    else:


For has_maxsplit=False, src_str.count(sep) introduces an extra full scan (count uses repeated find) before the actual split loop performs another sequence of find calls. This guarantees at least ~2x substring search work compared to the previous heuristic prealloc and may regress performance for long strings with few separators. If the intent is to optimize only dense-separator cases, consider a hybrid strategy (e.g., start with a small capacity and only fall back to counting/reserving once growth crosses a threshold, or count only when the string is above a size/expected-splits heuristic).

msaelices added 2 commits February 7, 2026 01:05

Copilot AI review requested due to automatic review settings February 7, 2026 00:08

msaelices requested a review from a team as a code owner February 7, 2026 00:08

Copilot started reviewing on behalf of msaelices February 7, 2026 00:09 View session

Slightly more readable version if declare the amnt var once

20f5ee7

Signed-off-by: Manuel Saelices <msaelices@gmail.com>

Copilot AI reviewed Feb 7, 2026

View reviewed changes

Merge branch 'main' into split-preallocate

1aeab6a

msaelices marked this pull request as draft February 7, 2026 11:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Stdlib] Use pre-scanned capacity in _split() for explicit separators#5912

[Stdlib] Use pre-scanned capacity in _split() for explicit separators#5912
msaelices wants to merge 4 commits intomodular:mainfrom
msaelices:split-preallocate

msaelices commented Feb 7, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 7, 2026

Uh oh!

Copilot AI Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

msaelices commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmarks

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

msaelices commented Feb 7, 2026 •

edited

Loading