Skip to content

[Stdlib] Use pre-scanned capacity in _split() for explicit separators#5912

Draft
msaelices wants to merge 4 commits intomodular:mainfrom
msaelices:split-preallocate
Draft

[Stdlib] Use pre-scanned capacity in _split() for explicit separators#5912
msaelices wants to merge 4 commits intomodular:mainfrom
msaelices:split-preallocate

Conversation

@msaelices
Copy link
Contributor

@msaelices msaelices commented Feb 7, 2026

Summary

  • Replace hardcoded preallocation of 32 items in _split() with exact capacity: count(sep) + 1 for unlimited splits, maxsplit + 1 for limited splits. Eliminates repeated reallocations for strings with many separator occurrences.
  • Add bench_string_split_dense benchmark to measure split performance on dense separator strings (10 to 10k items).

Benchmarks

COMPLETE

Replace the hardcoded preallocation of 32 items with exact capacity
using `count(sep) + 1`. When `has_maxsplit` is true, allocate
`maxsplit + 1` instead. This eliminates repeated dynamic reallocations
when splitting strings with many separator occurrences.

Signed-off-by: Manuel Saelices <msaelices@gmail.com>
Add `bench_string_split_dense` to measure split performance on strings
with many separator occurrences (10, 100, 1000, 10000 items). This
complements the existing split benchmarks that use natural language text
with infrequent separators.

Signed-off-by: Manuel Saelices <msaelices@gmail.com>
Copilot AI review requested due to automatic review settings February 7, 2026 00:08
@msaelices msaelices requested a review from a team as a code owner February 7, 2026 00:08
Signed-off-by: Manuel Saelices <msaelices@gmail.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the stdlib StringSlice._split() preallocation strategy to reduce List reallocations when splitting strings with many explicit separator occurrences, and adds coverage/measurement for dense-split scenarios.

Changes:

  • Replace fixed _split() preallocation heuristic (32) with pre-scanned/exact capacity sizing for explicit separators.
  • Add a regression test that splits a string into >32 items.
  • Add a dense-separator split benchmark (bench_string_split_dense) with sizes from 10 to 10k items.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
mojo/stdlib/test/collections/string/test_string_slice.mojo Adds a test case that splits into 50 items to exercise large-result behavior.
mojo/stdlib/std/collections/string/string_slice.mojo Changes _split() list capacity initialization for explicit separators based on maxsplit or count(sep) + 1.
mojo/stdlib/benchmarks/collections/bench_string.mojo Adds a dense-separator split benchmark and registers it in main().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 2887 to +2892

@parameter
if has_maxsplit:
amnt = maxsplit + 1 if maxsplit < prealloc else prealloc
output = {capacity = amnt}
amnt = maxsplit + 1
output = {capacity = amnt}
else:
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the has_maxsplit branch, initializing output with capacity = maxsplit + 1 can allocate an arbitrarily large list if callers pass a huge maxsplit (even if the input contains few/no separators). This is a potential OOM/DoS vector and a regression from the previous capped prealloc behavior. Consider bounding the capacity to a reasonable upper limit derived from the input (e.g., min(maxsplit + 1, src_str.count(sep) + 1) or at least min(maxsplit + 1, str_byte_len // sep_len + 1)).

Copilot uses AI. Check for mistakes.
Comment on lines 2889 to +2892
if has_maxsplit:
amnt = maxsplit + 1 if maxsplit < prealloc else prealloc
output = {capacity = amnt}
amnt = maxsplit + 1
output = {capacity = amnt}
else:
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For has_maxsplit=False, src_str.count(sep) introduces an extra full scan (count uses repeated find) before the actual split loop performs another sequence of find calls. This guarantees at least ~2x substring search work compared to the previous heuristic prealloc and may regress performance for long strings with few separators. If the intent is to optimize only dense-separator cases, consider a hybrid strategy (e.g., start with a small capacity and only fall back to counting/reserving once growth crosses a threshold, or count only when the string is above a size/expected-splits heuristic).

Copilot uses AI. Check for mistakes.
@msaelices msaelices marked this pull request as draft February 7, 2026 11:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant