Skip to content

[Stdlib] Add SIMD-optimized List.__contains__() for scalar types#5918

Open
msaelices wants to merge 13 commits intomodular:mainfrom
msaelices:contains-optimization
Open

[Stdlib] Add SIMD-optimized List.__contains__() for scalar types#5918
msaelices wants to merge 13 commits intomodular:mainfrom
msaelices:contains-optimization

Conversation

@msaelices
Copy link
Contributor

@msaelices msaelices commented Feb 7, 2026

Summary

  • Delegate List.__contains__ for scalar types to Span.__contains__ instead of duplicating SIMD logic
  • Fix Span.__contains__ SIMD width cascade to include widths 4 and 2, enabling SIMD for 64-bit types on AVX2
  • Add benchmarks for Int, Float64, and UInt8 across sizes 10-100k (50 repetitions)

Benchmark results (main vs this branch)

Machine: x86_64, Intel i7-12700H (AVX2, 256-bit SIMD), 50 repetitions

Benchmark Size main (ms) branch (ms) Speedup
Float64 10 0.000924 0.000404 2.3x
Float64 50 0.003033 0.001032 2.9x
Float64 100 0.005506 0.000785 7.0x
Float64 1,000 0.034644 0.010380 3.3x
Float64 10,000 0.355334 0.073229 4.9x
Float64 100,000 3.227314 1.003895 3.2x
UInt8 50 0.008404 0.001952 4.3x
UInt8 100 0.011563 0.000948 12.2x
UInt8 1,000 0.032597 0.001294 25.2x
UInt8 10,000 0.369522 0.012411 29.8x
UInt8 100,000 2.731753 0.098097 27.8x
Int hit 100 0.006290 0.003364 1.9x
Int miss 100 0.015014 0.006682 2.2x
Int hit 100,000 3.515771 3.039424 1.2x

Add a specialized `__contains__` overload for `List[Scalar[dtype]]` that
uses SIMD vectorized comparison instead of element-by-element scanning.
This yields up to 4-32x speedups depending on element size and SIMD width.

Signed-off-by: Manuel Saelices <msaelices@gmail.com>
Add benchmarks covering Int (hit/miss), Float64, and UInt8 element
types across sizes from 10 to 100,000 elements.

Signed-off-by: Manuel Saelices <msaelices@gmail.com>
@msaelices msaelices requested a review from a team as a code owner February 7, 2026 22:50
Copilot AI review requested due to automatic review settings February 7, 2026 22:50
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a SIMD-specialized List.__contains__ implementation for scalar element types to speed up membership checks, along with tests and a new benchmark to measure performance across common scalar types.

Changes:

  • Added a SIMD-vectorized __contains__ overload for List[Scalar[dtype]] in the standard library.
  • Extended test_list_contains to exercise scalar list membership behavior on larger inputs.
  • Introduced a new collections benchmark for List.__contains__ across Int, Float64, and UInt8 sizes.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
mojo/stdlib/std/collections/list.mojo Adds SIMD-optimized List.__contains__ overload for scalar element lists.
mojo/stdlib/test/collections/test_list.mojo Adds tests intended to cover the SIMD-optimized membership path for scalar types.
mojo/stdlib/benchmarks/collections/bench_list_contains.mojo Adds a new benchmark suite for list membership across types and sizes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 909 to 917
# Test with UInt8.
var bytes = List[UInt8]()
for i in range(256):
bytes.append(UInt8(i))

assert_true(UInt8(0) in bytes)
assert_true(UInt8(127) in bytes)
assert_true(UInt8(255) in bytes)

Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UInt8 tests only exercise the hit path: bytes contains every value 0..255, so there’s no way for x in bytes to return False. Add a UInt8 miss case (e.g., build a list missing one value and assert it’s not contained) to cover the negative path of the SIMD overload.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 34 to 38
fn make_list_uint8[size: Int]() -> List[UInt8]:
var l = List[UInt8](capacity=size)
for i in range(size):
l.append(UInt8(i % 256))
return l^
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make_list_uint8 repeats values for sizes > 256 (UInt8(i % 256)), so target in items will typically succeed very early (first occurrence within the first 256 elements) rather than scanning ~size/2 elements as intended. This makes the UInt8 benchmark results not representative; consider constructing data where the first match occurs near the middle (or add an explicit miss benchmark) to measure the real traversal cost.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 105 to 109
fn bench_list_contains_uint8[size: Int](mut b: Bencher) raises:
"""Search for a UInt8 element present in the middle of the list."""
var items = make_list_uint8[size]()
var target = UInt8((size // 2) % 256)

Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UInt8 benchmark docstring says the element is “present in the middle of the list”, but with the current data generation (repeating 0..255), the target’s first occurrence is near the beginning for large sizes. Update the benchmark setup (and/or docstring) so the measured access pattern matches the description.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signed-off-by: Manuel Saelices <msaelices@gmail.com>
Ensure value 0 first appears at position size // 2 so the benchmark
measures scanning half the list, not finding a match early due to
repeating values.

Signed-off-by: Manuel Saelices <msaelices@gmail.com>
Replace the SIMD-optimized List.__contains__ overload for scalar types
with a simple delegation to Span.__contains__, which already has an
equivalent (and more sophisticated) SIMD implementation. This eliminates
code duplication and addresses PR review feedback.

Signed-off-by: Manuel Saelices <msaelices@gmail.com>
Add widths 4 and 2 to the SIMD width cascade in Span.__contains__.
Previously the smallest width was 8, which meant no SIMD code was
generated for 64-bit types (Int, Float64) on AVX2 where
simd_width_of returns 4. This gives 2-3x speedups for Float64
containment checks.

Signed-off-by: Manuel Saelices <msaelices@gmail.com>
Increase num_repetitions from 5 to 50 for more stable and
representative benchmark results.

Signed-off-by: Manuel Saelices <msaelices@gmail.com>
@msaelices msaelices requested a review from NathanSWard February 8, 2026 22:26
@parameter
fn call_fn() raises:
for _ in range(100):
var res = target in items
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not totally sure we need these benchmarks as this PR is not doing any performance improvements, it's merely delegating the call to Span.__contains__.
If we do keep the benchmarks updating them to Span feels like a better approach.
However, if you just want to remove this benchmark, I'd be okay with that too :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 74 to 81
fn call_fn() raises:
for _ in range(100):
var res = target in items
keep(res)

b.iter[call_fn]()
keep(Bool(items))
keep(target)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few style suggestion on benchmarks if we do keep them (IMO it'll be easier to just remove them),

  1. Let's use black_box in the testing function
  2. Prefer the overload of .iter that works with unified closures - this allows you to remove the items/target at the end of this function as they were only needed because @parameter closures emit a false-positive warning about "unused variables".
  3. No need to mark the function raises unless it does actually raise an error.
Suggested change
fn call_fn() raises:
for _ in range(100):
var res = target in items
keep(res)
b.iter[call_fn]()
keep(Bool(items))
keep(target)
fn call_fn() unified {read}:
for _ in range(100):
var res = black_box(target) in items
keep(res)
b.iter(call_fn)

The PR now just delegates to Span.__contains__, so separate
List-specific benchmarks are unnecessary.

Signed-off-by: Manuel Saelices <msaelices@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants