-
-
Notifications
You must be signed in to change notification settings - Fork 33.9k
gh-144015: Add portable SIMD optimization for bytes.hex() #143991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
gpshead
wants to merge
16
commits into
python:main
Choose a base branch
from
gpshead:opt-pystrhex
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+258
−6
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add AVX2-accelerated hexlify for the no-separator path when converting bytes to hexadecimal strings. This processes 32 bytes per iteration instead of 1, using: - SIMD nibble extraction (shift + mask) - Arithmetic nibble-to-hex conversion (branchless) - Interleave operations for correct output ordering Runtime CPU detection via CPUID ensures AVX2 is only used when available. Falls back to scalar code for inputs < 32 bytes or when AVX2 is not supported. Performance improvement (bytes.hex() no separator): - 32 bytes: 1.3x faster - 64 bytes: 1.7x faster - 128 bytes: 3.0x faster - 256 bytes: 4.0x faster - 512 bytes: 4.9x faster - 4096 bytes: 11.9x faster Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add AVX-512 accelerated hexlify for the no-separator path when available. This processes 64 bytes per iteration using: - AVX-512F, AVX-512BW for 512-bit operations - AVX-512VBMI for efficient byte-level permutation (permutex2var_epi8) - Masked blend for branchless nibble-to-hex conversion Runtime detection via CPUID checks for all three required extensions. Falls back to AVX2 for 32-63 byte remainders, then scalar for <32 bytes. CPU hierarchy: - AVX-512 (F+BW+VBMI): 64 bytes/iteration, uses for inputs >= 64 bytes - AVX2: 32 bytes/iteration, uses for inputs >= 32 bytes - Scalar: remaining bytes Expected performance improvement over AVX2 for large inputs (4KB+) due to doubled throughput per iteration. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add NEON vectorized implementation for AArch64 that processes 16 bytes per iteration using 128-bit NEON registers. Uses the same nibble-to-hex arithmetic approach as AVX2/AVX-512 versions. NEON is always available on AArch64, so no runtime detection is needed. The implementation uses vzip1q_u8/vzip2q_u8 for interleaving high/low nibbles into the correct output order. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add SSE2 vectorized implementation that processes 16 bytes per iteration. SSE2 is always available on x86-64 (part of AMD64 baseline), so no runtime detection is needed. This provides SIMD acceleration for all x86-64 machines, even those without AVX2. The dispatch now cascades: AVX-512 (64+ bytes) → AVX2 (32+ bytes) → SSE2 (16+ bytes) → scalar. Benchmarks show ~5-6% improvement for 16-20 byte inputs, which is useful for common hash digest sizes (MD5=16 bytes, SHA1=20 bytes). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Benchmarks showed SSE2 performs nearly as well as AVX2 for most input sizes (within 5% up to 256 bytes, within 8% at 512+ bytes). Since SSE2 is always available on x86-64 (part of the baseline), this eliminates: - Runtime CPU feature detection via CPUID - ~200 lines of AVX2/AVX-512 intrinsics code - Maintenance burden of multiple SIMD implementations The simpler SSE2-only approach provides most of the performance benefit with significantly less code complexity. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…sions Replace separate platform-specific SSE2 and NEON implementations with a single unified implementation using GCC/Clang vector extensions. The portable code uses __builtin_shufflevector for interleave operations, which compiles to native SIMD instructions: - x86-64: punpcklbw/punpckhbw (SSE2) - ARM64: zip1/zip2 (NEON) This eliminates code duplication while maintaining SIMD performance. Requires GCC 12+ or Clang 3.0+ on x86-64 or ARM64. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Extend the portable SIMD hexlify to handle separator cases where
bytes_per_sep >= 16. Uses in-place shuffle: SIMD hexlify to output
buffer, then work backwards to insert separators via memmove.
For 4096 bytes with sep=32: ~3.3µs (vs ~7.3µs for sep=1 scalar).
Useful for hex dump style output like bytes.hex('\n', 32).
Also adds benchmark for newline separator every 32 bytes.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Lower the threshold from abs_bytes_per_sep >= 16 to >= 8 for the SIMD hexlify + memmove shuffle path. Benchmarks show this is worthwhile for sep=8 and above, but memmove overhead negates benefits for smaller values. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
GCC's vector extensions generate inefficient code for unsigned byte comparison (hi > nine): psubusb + pcmpeqb + pcmpeqb (3 instructions). By casting to signed bytes before comparison, GCC generates the efficient pcmpgtb instruction instead. This is safe because nibble values (0-15) are within signed byte range. This reduces the SIMD loop from 29 to 25 instructions, matching the performance of explicit SSE2 intrinsics while keeping the portable vector extensions approach. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Extract the scalar hexlify loop into _Py_hexlify_scalar() which is shared between the SIMD fallback path and the main non-SIMD path. Uses table lookup via Py_hexdigits for consistency. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Extend portable SIMD support to ARM32 when NEON is available. The __builtin_shufflevector interleave compiles to vzip instructions on ARMv7 NEON, similar to zip1/zip2 on ARM64. NEON is optional on 32-bit ARM (unlike ARM64 where it's mandatory), so we check for __ARM_NEON in addition to __arm__. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add targeted tests for corner cases relevant to SIMD optimization: - test_hex_simd_boundaries: Test lengths around the 16-byte SIMD threshold (14, 15, 16, 17, 31, 32, 33, 64, 65 bytes) - test_hex_nibble_boundaries: Test the 9/10 nibble value boundary where digits become letters, verifying the signed comparison optimization works correctly - test_hex_simd_separator: Test SIMD separator insertion path (triggered when sep >= 8 and len >= 16) with various group sizes and both positive/negative bytes_per_sep Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
🤖 New build scheduled with the buildbot fleet by @gpshead for commit 5fc294c 🤖 Results will be shown at: https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F143991%2Fmerge If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again. |
Member
Author
|
buildbot failures are all unrelated. test_capi, test__interpreters, or test_urllib2net etc. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add SIMD optimization for
bytes.hex(),bytearray.hex(), andbinascii.hexlify()as well ashashlib.hexdigest()methods using portable GCC/Clang vector extensions that compile to native SIMD instructions.sep=) also benefits whenbytes_per_sep >= 8Supported platforms:
-march=nativeon a Raspberry Pi 3+)__builtin_shufflevector, so the scalar path is usedThis is compile time detection of features that are always available on the target architectures. No need for runtime feature inspection.
Benchmarked using https://github.com/python/cpython/blob/0f94c061d49821a74096e57df8dff9617b80fad7/Tools/scripts/pystrhex_benchmark.py
Performance wins confirmed across the board on x86_64 (zen2), ARM64 (RPi4), ARM32 (RPi5 running 32-bit raspbian, with compiler flags to enable it), ARM64 Apple M4.
The commit history on this branch contains earlier experiments for reference.
Example benchmark results (M4):
Expand to see the table:
and if you're curious about the path not taken by the end state of this PR using AVX, here that is on a zen4: