[Stdlib] Add SIMD-optimized List.__contains__() for scalar types#5918
[Stdlib] Add SIMD-optimized List.__contains__() for scalar types#5918msaelices wants to merge 13 commits intomodular:mainfrom
Conversation
Add a specialized `__contains__` overload for `List[Scalar[dtype]]` that uses SIMD vectorized comparison instead of element-by-element scanning. This yields up to 4-32x speedups depending on element size and SIMD width. Signed-off-by: Manuel Saelices <msaelices@gmail.com>
Add benchmarks covering Int (hit/miss), Float64, and UInt8 element types across sizes from 10 to 100,000 elements. Signed-off-by: Manuel Saelices <msaelices@gmail.com>
There was a problem hiding this comment.
Pull request overview
This PR adds a SIMD-specialized List.__contains__ implementation for scalar element types to speed up membership checks, along with tests and a new benchmark to measure performance across common scalar types.
Changes:
- Added a SIMD-vectorized
__contains__overload forList[Scalar[dtype]]in the standard library. - Extended
test_list_containsto exercise scalar list membership behavior on larger inputs. - Introduced a new collections benchmark for
List.__contains__acrossInt,Float64, andUInt8sizes.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
mojo/stdlib/std/collections/list.mojo |
Adds SIMD-optimized List.__contains__ overload for scalar element lists. |
mojo/stdlib/test/collections/test_list.mojo |
Adds tests intended to cover the SIMD-optimized membership path for scalar types. |
mojo/stdlib/benchmarks/collections/bench_list_contains.mojo |
Adds a new benchmark suite for list membership across types and sizes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Test with UInt8. | ||
| var bytes = List[UInt8]() | ||
| for i in range(256): | ||
| bytes.append(UInt8(i)) | ||
|
|
||
| assert_true(UInt8(0) in bytes) | ||
| assert_true(UInt8(127) in bytes) | ||
| assert_true(UInt8(255) in bytes) | ||
|
|
There was a problem hiding this comment.
The UInt8 tests only exercise the hit path: bytes contains every value 0..255, so there’s no way for x in bytes to return False. Add a UInt8 miss case (e.g., build a list missing one value and assert it’s not contained) to cover the negative path of the SIMD overload.
| fn make_list_uint8[size: Int]() -> List[UInt8]: | ||
| var l = List[UInt8](capacity=size) | ||
| for i in range(size): | ||
| l.append(UInt8(i % 256)) | ||
| return l^ |
There was a problem hiding this comment.
make_list_uint8 repeats values for sizes > 256 (UInt8(i % 256)), so target in items will typically succeed very early (first occurrence within the first 256 elements) rather than scanning ~size/2 elements as intended. This makes the UInt8 benchmark results not representative; consider constructing data where the first match occurs near the middle (or add an explicit miss benchmark) to measure the real traversal cost.
| fn bench_list_contains_uint8[size: Int](mut b: Bencher) raises: | ||
| """Search for a UInt8 element present in the middle of the list.""" | ||
| var items = make_list_uint8[size]() | ||
| var target = UInt8((size // 2) % 256) | ||
|
|
There was a problem hiding this comment.
The UInt8 benchmark docstring says the element is “present in the middle of the list”, but with the current data generation (repeating 0..255), the target’s first occurrence is near the beginning for large sizes. Update the benchmark setup (and/or docstring) so the measured access pattern matches the description.
Signed-off-by: Manuel Saelices <msaelices@gmail.com>
Ensure value 0 first appears at position size // 2 so the benchmark measures scanning half the list, not finding a match early due to repeating values. Signed-off-by: Manuel Saelices <msaelices@gmail.com>
Replace the SIMD-optimized List.__contains__ overload for scalar types with a simple delegation to Span.__contains__, which already has an equivalent (and more sophisticated) SIMD implementation. This eliminates code duplication and addresses PR review feedback. Signed-off-by: Manuel Saelices <msaelices@gmail.com>
Add widths 4 and 2 to the SIMD width cascade in Span.__contains__. Previously the smallest width was 8, which meant no SIMD code was generated for 64-bit types (Int, Float64) on AVX2 where simd_width_of returns 4. This gives 2-3x speedups for Float64 containment checks. Signed-off-by: Manuel Saelices <msaelices@gmail.com>
Increase num_repetitions from 5 to 50 for more stable and representative benchmark results. Signed-off-by: Manuel Saelices <msaelices@gmail.com>
| @parameter | ||
| fn call_fn() raises: | ||
| for _ in range(100): | ||
| var res = target in items |
There was a problem hiding this comment.
I'm not totally sure we need these benchmarks as this PR is not doing any performance improvements, it's merely delegating the call to Span.__contains__.
If we do keep the benchmarks updating them to Span feels like a better approach.
However, if you just want to remove this benchmark, I'd be okay with that too :)
| fn call_fn() raises: | ||
| for _ in range(100): | ||
| var res = target in items | ||
| keep(res) | ||
|
|
||
| b.iter[call_fn]() | ||
| keep(Bool(items)) | ||
| keep(target) |
There was a problem hiding this comment.
A few style suggestion on benchmarks if we do keep them (IMO it'll be easier to just remove them),
- Let's use
black_boxin the testing function - Prefer the overload of
.iterthat works with unified closures - this allows you to remove the items/target at the end of this function as they were only needed because@parameterclosures emit a false-positive warning about "unused variables". - No need to mark the function
raisesunless it does actually raise an error.
| fn call_fn() raises: | |
| for _ in range(100): | |
| var res = target in items | |
| keep(res) | |
| b.iter[call_fn]() | |
| keep(Bool(items)) | |
| keep(target) | |
| fn call_fn() unified {read}: | |
| for _ in range(100): | |
| var res = black_box(target) in items | |
| keep(res) | |
| b.iter(call_fn) |
The PR now just delegates to Span.__contains__, so separate List-specific benchmarks are unnecessary. Signed-off-by: Manuel Saelices <msaelices@gmail.com>
Summary
List.__contains__for scalar types toSpan.__contains__instead of duplicating SIMD logicSpan.__contains__SIMD width cascade to include widths 4 and 2, enabling SIMD for 64-bit types on AVX2Benchmark results (main vs this branch)
Machine: x86_64, Intel i7-12700H (AVX2, 256-bit SIMD), 50 repetitions