Skip to content

Conversation

@gpshead
Copy link
Member

@gpshead gpshead commented Jan 18, 2026

Add SIMD optimization for bytes.hex(), bytearray.hex(), and binascii.hexlify() as well as hashlib .hexdigest() methods using portable GCC/Clang vector extensions that compile to native SIMD instructions.

  • Up to 11x faster for large data (1KB+)
  • 1.1-3x faster for common small data (16-64 bytes, covering md5 through sha512 digest sizes)
  • Separator insertion (sep=) also benefits when bytes_per_sep >= 8
  • Retains the existing scalar code for short inputs (<16 bytes) or platforms lacking SIMD instructions, no observable performance regressions there.

Supported platforms:

  • x86-64: SSE2 is always available, no special flags needed
  • ARM64: NEON is always available, no special flags needed
  • ARM32: Requires NEON support and appropriate compiler flags (e.g., -march=native on a Raspberry Pi 3+)
  • Windows/MSVC: Not supported; MSVC lacks __builtin_shufflevector, so the scalar path is used

This is compile time detection of features that are always available on the target architectures. No need for runtime feature inspection.

Benchmarked using https://github.com/python/cpython/blob/0f94c061d49821a74096e57df8dff9617b80fad7/Tools/scripts/pystrhex_benchmark.py

Performance wins confirmed across the board on x86_64 (zen2), ARM64 (RPi4), ARM32 (RPi5 running 32-bit raspbian, with compiler flags to enable it), ARM64 Apple M4.

The commit history on this branch contains earlier experiments for reference.

Example benchmark results (M4):

  1. bytes.hex() without separator: Scales extremely well - 1.02x at 16 bytes up to 9.8x at 4KB.
  2. bytes.hex() with sep=32: Good gains even with separators (1.38x-5x).
  3. hashlib hexdigest: Modest 7-15% improvement on the hex conversion portion. The hash computation dominates total time
Expand to see the table:
  bytes.hex() (no separator)
  ┌────────────┬───────────┬───────────┬─────────┐
  │    Size    │ Baseline  │ Optimized │ Speedup │
  ├────────────┼───────────┼───────────┼─────────┤
  │ 16 bytes   │ 22.9 ns   │ 22.4 ns   │ 1.02x   │
  ├────────────┼───────────┼───────────┼─────────┤
  │ 32 bytes   │ 28.4 ns   │ 22.7 ns   │ 1.25x   │
  ├────────────┼───────────┼───────────┼─────────┤
  │ 64 bytes   │ 44.4 ns   │ 24.4 ns   │ 1.82x   │
  ├────────────┼───────────┼───────────┼─────────┤
  │ 256 bytes  │ 154.9 ns  │ 47.6 ns   │ 3.25x   │
  ├────────────┼───────────┼───────────┼─────────┤
  │ 4096 bytes │ 1969.2 ns │ 201.6 ns  │ 9.8x    │
  └────────────┴───────────┴───────────┴─────────┘
  bytes.hex('\n', 32) (separator every 32 bytes)
  ┌────────────┬───────────┬───────────┬─────────┐
  │    Size    │ Baseline  │ Optimized │ Speedup │
  ├────────────┼───────────┼───────────┼─────────┤
  │ 32 bytes   │ 48.8 ns   │ 35.3 ns   │ 1.38x   │
  ├────────────┼───────────┼───────────┼─────────┤
  │ 64 bytes   │ 63.4 ns   │ 38.8 ns   │ 1.63x   │
  ├────────────┼───────────┼───────────┼─────────┤
  │ 256 bytes  │ 178.7 ns  │ 73.0 ns   │ 2.45x   │
  ├────────────┼───────────┼───────────┼─────────┤
  │ 512 bytes  │ 293.3 ns  │ 89.6 ns   │ 3.27x   │
  ├────────────┼───────────┼───────────┼─────────┤
  │ 4096 bytes │ 2074.2 ns │ 415.5 ns  │ 5.0x    │
  └────────────┴───────────┴───────────┴─────────┘
  hashlib hexdigest (hash + hex conversion)
  ┌───────────────────┬──────────┬───────────┬─────────┐
  │      Digest       │ Baseline │ Optimized │ Speedup │
  ├───────────────────┼──────────┼───────────┼─────────┤
  │ md5 (16 bytes)    │ 238.2 ns │ 231.7 ns  │ 1.03x   │
  ├───────────────────┼──────────┼───────────┼─────────┤
  │ sha1 (20 bytes)   │ 210.8 ns │ 197.3 ns  │ 1.07x   │
  ├───────────────────┼──────────┼───────────┼─────────┤
  │ sha256 (32 bytes) │ 214.6 ns │ 200.0 ns  │ 1.07x   │
  ├───────────────────┼──────────┼───────────┼─────────┤
  │ sha512 (64 bytes) │ 282.9 ns │ 255.9 ns  │ 1.11x   │
  └───────────────────┴──────────┴───────────┴─────────┘
and if you're curious about the path not taken by the end state of this PR using AVX, here that is on a zen4:
  bytes.hex() without separator
  ┌────────┬───────────┬─────────────────┬──────────────────┬──────────────────┐
  │  Size  │ Baseline  │     SIMD PR     │     AVX-512      │       AVX2       │
  ├────────┼───────────┼─────────────────┼──────────────────┼──────────────────┤
  │ 32 B   │ 44.7 ns   │ 27.4 ns (1.6x)  │ 29.2 ns (1.5x)   │ 29.0 ns (1.5x)   │
  ├────────┼───────────┼─────────────────┼──────────────────┼──────────────────┤
  │ 64 B   │ 64.5 ns   │ 28.3 ns (2.3x)  │ 29.2 ns (2.2x)   │ 29.4 ns (2.2x)   │
  ├────────┼───────────┼─────────────────┼──────────────────┼──────────────────┤
  │ 128 B  │ 104.8 ns  │ 31.7 ns (3.3x)  │ 29.0 ns (3.6x)   │ 30.8 ns (3.4x)   │
  ├────────┼───────────┼─────────────────┼──────────────────┼──────────────────┤
  │ 256 B  │ 185.8 ns  │ 45.0 ns (4.1x)  │ 35.9 ns (5.2x)   │ 40.4 ns (4.6x)   │
  ├────────┼───────────┼─────────────────┼──────────────────┼──────────────────┤
  │ 512 B  │ 361.1 ns  │ 75.3 ns (4.8x)  │ 55.0 ns (6.6x)   │ 61.4 ns (5.9x)   │
  ├────────┼───────────┼─────────────────┼──────────────────┼──────────────────┤
  │ 4096 B │ 2242.6 ns │ 278.1 ns (8.1x) │ 138.5 ns (16.2x) │ 174.0 ns (12.9x) │
  └────────┴───────────┴─────────────────┴──────────────────┴──────────────────┘
  The SIMD PR (SSE2/SSSE3) delivers strong speedups across the board, reaching 8x at 4KB.
  The AVX variants push further - AVX-512 hits 16x at 4KB, AVX2 achieves 13x.

gpshead and others added 16 commits January 18, 2026 02:04
Add AVX2-accelerated hexlify for the no-separator path when converting
bytes to hexadecimal strings. This processes 32 bytes per iteration
instead of 1, using:

- SIMD nibble extraction (shift + mask)
- Arithmetic nibble-to-hex conversion (branchless)
- Interleave operations for correct output ordering

Runtime CPU detection via CPUID ensures AVX2 is only used when
available. Falls back to scalar code for inputs < 32 bytes or when
AVX2 is not supported.

Performance improvement (bytes.hex() no separator):
- 32 bytes:   1.3x faster
- 64 bytes:   1.7x faster
- 128 bytes:  3.0x faster
- 256 bytes:  4.0x faster
- 512 bytes:  4.9x faster
- 4096 bytes: 11.9x faster

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add AVX-512 accelerated hexlify for the no-separator path when
available. This processes 64 bytes per iteration using:

- AVX-512F, AVX-512BW for 512-bit operations
- AVX-512VBMI for efficient byte-level permutation (permutex2var_epi8)
- Masked blend for branchless nibble-to-hex conversion

Runtime detection via CPUID checks for all three required extensions.
Falls back to AVX2 for 32-63 byte remainders, then scalar for <32 bytes.

CPU hierarchy:
- AVX-512 (F+BW+VBMI): 64 bytes/iteration, uses for inputs >= 64 bytes
- AVX2: 32 bytes/iteration, uses for inputs >= 32 bytes
- Scalar: remaining bytes

Expected performance improvement over AVX2 for large inputs (4KB+)
due to doubled throughput per iteration.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add NEON vectorized implementation for AArch64 that processes 16 bytes
per iteration using 128-bit NEON registers. Uses the same nibble-to-hex
arithmetic approach as AVX2/AVX-512 versions.

NEON is always available on AArch64, so no runtime detection is needed.
The implementation uses vzip1q_u8/vzip2q_u8 for interleaving high/low
nibbles into the correct output order.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add SSE2 vectorized implementation that processes 16 bytes per iteration.
SSE2 is always available on x86-64 (part of AMD64 baseline), so no runtime
detection is needed.

This provides SIMD acceleration for all x86-64 machines, even those without
AVX2. The dispatch now cascades: AVX-512 (64+ bytes) → AVX2 (32+ bytes) →
SSE2 (16+ bytes) → scalar.

Benchmarks show ~5-6% improvement for 16-20 byte inputs, which is useful
for common hash digest sizes (MD5=16 bytes, SHA1=20 bytes).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Benchmarks showed SSE2 performs nearly as well as AVX2 for most input
sizes (within 5% up to 256 bytes, within 8% at 512+ bytes). Since SSE2
is always available on x86-64 (part of the baseline), this eliminates:

- Runtime CPU feature detection via CPUID
- ~200 lines of AVX2/AVX-512 intrinsics code
- Maintenance burden of multiple SIMD implementations

The simpler SSE2-only approach provides most of the performance benefit
with significantly less code complexity.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…sions

Replace separate platform-specific SSE2 and NEON implementations with a
single unified implementation using GCC/Clang vector extensions. The
portable code uses __builtin_shufflevector for interleave operations,
which compiles to native SIMD instructions:
- x86-64: punpcklbw/punpckhbw (SSE2)
- ARM64: zip1/zip2 (NEON)

This eliminates code duplication while maintaining SIMD performance.
Requires GCC 12+ or Clang 3.0+ on x86-64 or ARM64.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Extend the portable SIMD hexlify to handle separator cases where
bytes_per_sep >= 16. Uses in-place shuffle: SIMD hexlify to output
buffer, then work backwards to insert separators via memmove.

For 4096 bytes with sep=32: ~3.3µs (vs ~7.3µs for sep=1 scalar).
Useful for hex dump style output like bytes.hex('\n', 32).

Also adds benchmark for newline separator every 32 bytes.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Lower the threshold from abs_bytes_per_sep >= 16 to >= 8 for the SIMD
hexlify + memmove shuffle path. Benchmarks show this is worthwhile for
sep=8 and above, but memmove overhead negates benefits for smaller values.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
GCC's vector extensions generate inefficient code for unsigned byte
comparison (hi > nine): psubusb + pcmpeqb + pcmpeqb (3 instructions).

By casting to signed bytes before comparison, GCC generates the
efficient pcmpgtb instruction instead. This is safe because nibble
values (0-15) are within signed byte range.

This reduces the SIMD loop from 29 to 25 instructions, matching the
performance of explicit SSE2 intrinsics while keeping the portable
vector extensions approach.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Extract the scalar hexlify loop into _Py_hexlify_scalar() which is
shared between the SIMD fallback path and the main non-SIMD path.
Uses table lookup via Py_hexdigits for consistency.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Extend portable SIMD support to ARM32 when NEON is available.
The __builtin_shufflevector interleave compiles to vzip instructions
on ARMv7 NEON, similar to zip1/zip2 on ARM64.

NEON is optional on 32-bit ARM (unlike ARM64 where it's mandatory),
so we check for __ARM_NEON in addition to __arm__.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add targeted tests for corner cases relevant to SIMD optimization:

- test_hex_simd_boundaries: Test lengths around the 16-byte SIMD
  threshold (14, 15, 16, 17, 31, 32, 33, 64, 65 bytes)

- test_hex_nibble_boundaries: Test the 9/10 nibble value boundary
  where digits become letters, verifying the signed comparison
  optimization works correctly

- test_hex_simd_separator: Test SIMD separator insertion path
  (triggered when sep >= 8 and len >= 16) with various group
  sizes and both positive/negative bytes_per_sep

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@gpshead gpshead self-assigned this Jan 18, 2026
@gpshead gpshead added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jan 18, 2026
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @gpshead for commit 5fc294c 🤖

Results will be shown at:

https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F143991%2Fmerge

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jan 18, 2026
@gpshead
Copy link
Member Author

gpshead commented Jan 18, 2026

buildbot failures are all unrelated. test_capi, test__interpreters, or test_urllib2net etc.

@gpshead gpshead changed the title gh-XXXXXX: Add portable SIMD optimization for bytes.hex() gh-144015: Add portable SIMD optimization for bytes.hex() Jan 18, 2026
@gpshead gpshead marked this pull request as ready for review January 18, 2026 19:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants