gh-144015: Add portable SIMD optimization for bytes.hex() #143991

gpshead · 2026-01-18T09:06:26Z

Add SIMD optimization for bytes.hex(), bytearray.hex(), and binascii.hexlify() as well as hashlib .hexdigest() methods using portable GCC/Clang vector extensions that compile to native SIMD instructions.

Up to 11x faster for large data (1KB+)
1.1-3x faster for common small data (16-64 bytes, covering md5 through sha512 digest sizes)
Separator insertion (sep=) also benefits when bytes_per_sep >= 8
Retains the existing scalar code for short inputs (<16 bytes) or platforms lacking SIMD instructions, no observable performance regressions there.

Supported platforms:

x86-64: SSE2 is always available, no special flags needed
ARM64: NEON is always available, no special flags needed
ARM32: Requires NEON support and appropriate compiler flags (e.g., -march=native on a Raspberry Pi 3+)
Windows/MSVC: Not supported; MSVC lacks __builtin_shufflevector, so the scalar path is used

This is compile time detection of features that are always available on the target architectures. No need for runtime feature inspection.

Benchmarked using https://github.com/python/cpython/blob/0f94c061d49821a74096e57df8dff9617b80fad7/Tools/scripts/pystrhex_benchmark.py

Performance wins confirmed across the board on x86_64 (zen2), ARM64 (RPi4), ARM32 (RPi5 running 32-bit raspbian, with compiler flags to enable it), ARM64 Apple M4.

The commit history on this branch contains earlier experiments for reference.

Issue: Speed up bytes.hex() and related pystrhex.c users using SIMD #144015

Example benchmark results (M4):

bytes.hex() without separator: Scales extremely well - 1.02x at 16 bytes up to 9.8x at 4KB.
bytes.hex() with sep=32: Good gains even with separators (1.38x-5x).
hashlib hexdigest: Modest 7-15% improvement on the hex conversion portion. The hash computation dominates total time

Expand to see the table:

  bytes.hex() (no separator)
  ┌────────────┬───────────┬───────────┬─────────┐
  │    Size    │ Baseline  │ Optimized │ Speedup │
  ├────────────┼───────────┼───────────┼─────────┤
  │ 16 bytes   │ 22.9 ns   │ 22.4 ns   │ 1.02x   │
  ├────────────┼───────────┼───────────┼─────────┤
  │ 32 bytes   │ 28.4 ns   │ 22.7 ns   │ 1.25x   │
  ├────────────┼───────────┼───────────┼─────────┤
  │ 64 bytes   │ 44.4 ns   │ 24.4 ns   │ 1.82x   │
  ├────────────┼───────────┼───────────┼─────────┤
  │ 256 bytes  │ 154.9 ns  │ 47.6 ns   │ 3.25x   │
  ├────────────┼───────────┼───────────┼─────────┤
  │ 4096 bytes │ 1969.2 ns │ 201.6 ns  │ 9.8x    │
  └────────────┴───────────┴───────────┴─────────┘
  bytes.hex('\n', 32) (separator every 32 bytes)
  ┌────────────┬───────────┬───────────┬─────────┐
  │    Size    │ Baseline  │ Optimized │ Speedup │
  ├────────────┼───────────┼───────────┼─────────┤
  │ 32 bytes   │ 48.8 ns   │ 35.3 ns   │ 1.38x   │
  ├────────────┼───────────┼───────────┼─────────┤
  │ 64 bytes   │ 63.4 ns   │ 38.8 ns   │ 1.63x   │
  ├────────────┼───────────┼───────────┼─────────┤
  │ 256 bytes  │ 178.7 ns  │ 73.0 ns   │ 2.45x   │
  ├────────────┼───────────┼───────────┼─────────┤
  │ 512 bytes  │ 293.3 ns  │ 89.6 ns   │ 3.27x   │
  ├────────────┼───────────┼───────────┼─────────┤
  │ 4096 bytes │ 2074.2 ns │ 415.5 ns  │ 5.0x    │
  └────────────┴───────────┴───────────┴─────────┘
  hashlib hexdigest (hash + hex conversion)
  ┌───────────────────┬──────────┬───────────┬─────────┐
  │      Digest       │ Baseline │ Optimized │ Speedup │
  ├───────────────────┼──────────┼───────────┼─────────┤
  │ md5 (16 bytes)    │ 238.2 ns │ 231.7 ns  │ 1.03x   │
  ├───────────────────┼──────────┼───────────┼─────────┤
  │ sha1 (20 bytes)   │ 210.8 ns │ 197.3 ns  │ 1.07x   │
  ├───────────────────┼──────────┼───────────┼─────────┤
  │ sha256 (32 bytes) │ 214.6 ns │ 200.0 ns  │ 1.07x   │
  ├───────────────────┼──────────┼───────────┼─────────┤
  │ sha512 (64 bytes) │ 282.9 ns │ 255.9 ns  │ 1.11x   │
  └───────────────────┴──────────┴───────────┴─────────┘

and if you're curious about the path not taken by the end state of this PR using AVX, here that is on a zen4:

  bytes.hex() without separator
  ┌────────┬───────────┬─────────────────┬──────────────────┬──────────────────┐
  │  Size  │ Baseline  │     SIMD PR     │     AVX-512      │       AVX2       │
  ├────────┼───────────┼─────────────────┼──────────────────┼──────────────────┤
  │ 32 B   │ 44.7 ns   │ 27.4 ns (1.6x)  │ 29.2 ns (1.5x)   │ 29.0 ns (1.5x)   │
  ├────────┼───────────┼─────────────────┼──────────────────┼──────────────────┤
  │ 64 B   │ 64.5 ns   │ 28.3 ns (2.3x)  │ 29.2 ns (2.2x)   │ 29.4 ns (2.2x)   │
  ├────────┼───────────┼─────────────────┼──────────────────┼──────────────────┤
  │ 128 B  │ 104.8 ns  │ 31.7 ns (3.3x)  │ 29.0 ns (3.6x)   │ 30.8 ns (3.4x)   │
  ├────────┼───────────┼─────────────────┼──────────────────┼──────────────────┤
  │ 256 B  │ 185.8 ns  │ 45.0 ns (4.1x)  │ 35.9 ns (5.2x)   │ 40.4 ns (4.6x)   │
  ├────────┼───────────┼─────────────────┼──────────────────┼──────────────────┤
  │ 512 B  │ 361.1 ns  │ 75.3 ns (4.8x)  │ 55.0 ns (6.6x)   │ 61.4 ns (5.9x)   │
  ├────────┼───────────┼─────────────────┼──────────────────┼──────────────────┤
  │ 4096 B │ 2242.6 ns │ 278.1 ns (8.1x) │ 138.5 ns (16.2x) │ 174.0 ns (12.9x) │
  └────────┴───────────┴─────────────────┴──────────────────┴──────────────────┘
  The SIMD PR (SSE2/SSSE3) delivers strong speedups across the board, reaching 8x at 4KB.
  The AVX variants push further - AVX-512 hits 16x at 4KB, AVX2 achieves 13x.

Add AVX2-accelerated hexlify for the no-separator path when converting bytes to hexadecimal strings. This processes 32 bytes per iteration instead of 1, using: - SIMD nibble extraction (shift + mask) - Arithmetic nibble-to-hex conversion (branchless) - Interleave operations for correct output ordering Runtime CPU detection via CPUID ensures AVX2 is only used when available. Falls back to scalar code for inputs < 32 bytes or when AVX2 is not supported. Performance improvement (bytes.hex() no separator): - 32 bytes: 1.3x faster - 64 bytes: 1.7x faster - 128 bytes: 3.0x faster - 256 bytes: 4.0x faster - 512 bytes: 4.9x faster - 4096 bytes: 11.9x faster Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add AVX-512 accelerated hexlify for the no-separator path when available. This processes 64 bytes per iteration using: - AVX-512F, AVX-512BW for 512-bit operations - AVX-512VBMI for efficient byte-level permutation (permutex2var_epi8) - Masked blend for branchless nibble-to-hex conversion Runtime detection via CPUID checks for all three required extensions. Falls back to AVX2 for 32-63 byte remainders, then scalar for <32 bytes. CPU hierarchy: - AVX-512 (F+BW+VBMI): 64 bytes/iteration, uses for inputs >= 64 bytes - AVX2: 32 bytes/iteration, uses for inputs >= 32 bytes - Scalar: remaining bytes Expected performance improvement over AVX2 for large inputs (4KB+) due to doubled throughput per iteration. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add NEON vectorized implementation for AArch64 that processes 16 bytes per iteration using 128-bit NEON registers. Uses the same nibble-to-hex arithmetic approach as AVX2/AVX-512 versions. NEON is always available on AArch64, so no runtime detection is needed. The implementation uses vzip1q_u8/vzip2q_u8 for interleaving high/low nibbles into the correct output order. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add SSE2 vectorized implementation that processes 16 bytes per iteration. SSE2 is always available on x86-64 (part of AMD64 baseline), so no runtime detection is needed. This provides SIMD acceleration for all x86-64 machines, even those without AVX2. The dispatch now cascades: AVX-512 (64+ bytes) → AVX2 (32+ bytes) → SSE2 (16+ bytes) → scalar. Benchmarks show ~5-6% improvement for 16-20 byte inputs, which is useful for common hash digest sizes (MD5=16 bytes, SHA1=20 bytes). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Benchmarks showed SSE2 performs nearly as well as AVX2 for most input sizes (within 5% up to 256 bytes, within 8% at 512+ bytes). Since SSE2 is always available on x86-64 (part of the baseline), this eliminates: - Runtime CPU feature detection via CPUID - ~200 lines of AVX2/AVX-512 intrinsics code - Maintenance burden of multiple SIMD implementations The simpler SSE2-only approach provides most of the performance benefit with significantly less code complexity. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…sions Replace separate platform-specific SSE2 and NEON implementations with a single unified implementation using GCC/Clang vector extensions. The portable code uses __builtin_shufflevector for interleave operations, which compiles to native SIMD instructions: - x86-64: punpcklbw/punpckhbw (SSE2) - ARM64: zip1/zip2 (NEON) This eliminates code duplication while maintaining SIMD performance. Requires GCC 12+ or Clang 3.0+ on x86-64 or ARM64. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Extend the portable SIMD hexlify to handle separator cases where bytes_per_sep >= 16. Uses in-place shuffle: SIMD hexlify to output buffer, then work backwards to insert separators via memmove. For 4096 bytes with sep=32: ~3.3µs (vs ~7.3µs for sep=1 scalar). Useful for hex dump style output like bytes.hex('\n', 32). Also adds benchmark for newline separator every 32 bytes. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Lower the threshold from abs_bytes_per_sep >= 16 to >= 8 for the SIMD hexlify + memmove shuffle path. Benchmarks show this is worthwhile for sep=8 and above, but memmove overhead negates benefits for smaller values. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

GCC's vector extensions generate inefficient code for unsigned byte comparison (hi > nine): psubusb + pcmpeqb + pcmpeqb (3 instructions). By casting to signed bytes before comparison, GCC generates the efficient pcmpgtb instruction instead. This is safe because nibble values (0-15) are within signed byte range. This reduces the SIMD loop from 29 to 25 instructions, matching the performance of explicit SSE2 intrinsics while keeping the portable vector extensions approach. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Extract the scalar hexlify loop into _Py_hexlify_scalar() which is shared between the SIMD fallback path and the main non-SIMD path. Uses table lookup via Py_hexdigits for consistency. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Extend portable SIMD support to ARM32 when NEON is available. The __builtin_shufflevector interleave compiles to vzip instructions on ARMv7 NEON, similar to zip1/zip2 on ARM64. NEON is optional on 32-bit ARM (unlike ARM64 where it's mandatory), so we check for __ARM_NEON in addition to __arm__. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add targeted tests for corner cases relevant to SIMD optimization: - test_hex_simd_boundaries: Test lengths around the 16-byte SIMD threshold (14, 15, 16, 17, 31, 32, 33, 64, 65 bytes) - test_hex_nibble_boundaries: Test the 9/10 nibble value boundary where digits become letters, verifying the signed comparison optimization works correctly - test_hex_simd_separator: Test SIMD separator insertion path (triggered when sep >= 8 and len >= 16) with various group sizes and both positive/negative bytes_per_sep Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

bedevere-bot · 2026-01-18T09:17:16Z

🤖 New build scheduled with the buildbot fleet by @gpshead for commit 5fc294c 🤖

Results will be shown at:

https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F143991%2Fmerge

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

gpshead · 2026-01-18T18:44:15Z

buildbot failures are all unrelated. test_capi, test__interpreters, or test_urllib2net etc.

gpshead and others added 16 commits January 18, 2026 02:04

measure more

2fe987c

remove benchmark data

b2dd34e

explain more in the comments

0f94c06

remove the microbenchmark code

e8650f3

gpshead self-assigned this Jan 18, 2026

gpshead added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jan 18, 2026

bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jan 18, 2026

gpshead changed the title ~~gh-XXXXXX: Add portable SIMD optimization for bytes.hex()~~ gh-144015: Add portable SIMD optimization for bytes.hex() Jan 18, 2026

bedevere-app bot mentioned this pull request Jan 18, 2026

Speed up bytes.hex() and related pystrhex.c users using SIMD #144015

Open

gpshead marked this pull request as ready for review January 18, 2026 19:01

bedevere-app bot added the awaiting core review label Jan 18, 2026

gpshead assigned serhiy-storchaka Jan 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-144015: Add portable SIMD optimization for bytes.hex() #143991

gh-144015: Add portable SIMD optimization for bytes.hex() #143991

gpshead commented Jan 18, 2026 •

edited

Loading

Uh oh!

bedevere-bot commented Jan 18, 2026

Uh oh!

gpshead commented Jan 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

gh-144015: Add portable SIMD optimization for bytes.hex() #143991

Are you sure you want to change the base?

gh-144015: Add portable SIMD optimization for bytes.hex() #143991

Conversation

gpshead commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bedevere-bot commented Jan 18, 2026

Uh oh!

gpshead commented Jan 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gpshead commented Jan 18, 2026 •

edited

Loading