gh-147991: Speed up tomllib import time by vstinner · Pull Request #147992 · python/cpython

vstinner · 2026-04-02T03:07:26Z

Defer regular expressions import until the first datetime, localtime or non-trivial number (other that just decimal digits) is met.

Issue: Improve tomllib startup time #147991

Defer regular expressions import until the first datetime, localtime or non-trivial number (other that just decimal digits) is met.

vstinner · 2026-04-02T03:22:41Z

It might be interesting to replace from types import MappingProxyType with built-in frozendict. But currently, the GitHub Action CI runs mypy with Python 3.12 which doesn't have frozendict.

vstinner

I marked added constants and functions as private by adding _ prefix. I'm not sure if it's needed, all other _parser APIs are "public" (no underscore prefix).

Lib/tomllib/_parser.py

hugovk · 2026-04-02T09:53:15Z

It might be interesting to replace from types import MappingProxyType with built-in frozendict. But currently, the GitHub Action CI runs mypy with Python 3.12 which doesn't have frozendict.

Adding # type: ignore[name-defined] is a quick fix.

This:

diff --git a/Lib/tomllib/_parser.py b/Lib/tomllib/_parser.py
index b59d0f7d54b..96f189537cf 100644
--- a/Lib/tomllib/_parser.py
+++ b/Lib/tomllib/_parser.py
@@ -4,7 +4,7 @@
 
 from __future__ import annotations
 
-from types import MappingProxyType
+__lazy_modules__ = ["tomllib._re"]
 
 from ._re import (
     RE_DATETIME,
@@ -42,7 +42,7 @@
 KEY_INITIAL_CHARS: Final = BARE_KEY_CHARS | frozenset("\"'")
 HEXDIGIT_CHARS: Final = frozenset("abcdef" "ABCDEF" "0123456789")
 
-BASIC_STR_ESCAPE_REPLACEMENTS: Final = MappingProxyType(
+BASIC_STR_ESCAPE_REPLACEMENTS: Final = frozendict(  # type: ignore[name-defined]
     {
         "\\b": "\u0008",  # backspace
         "\\t": "\u0009",  # tab

Gets us from 4ms:

To ~0ms:

Lib/tomllib/_parser.py

eendebakpt · 2026-04-02T10:15:52Z

Lib/tomllib/_parser.py

+        if pos >= end:
+            break
+    else:
+        if src[pos] != "\n":


Can this happen? We could just return None and fall back to the original path.

Yes, in many cases. See the added test_parse_simple_number(). Examples:

The test is true when parsing 1979-05-27: we cannot parse the date.

The test is false when parsing 1\n (ex: value = 1\n) or 23, 24]\n (ex: list = [23, 24]\n)

Lib/tomllib/_parser.py

vstinner · 2026-04-02T10:56:34Z

I updated the PR to replace types.MappingProxyType with frozendict type thanks to # type: ignore[name-defined] annotation (to please mypy gods).

I ran benchmarks on the latest PR using Python built in release mode (gcc -O3) on Fedora 43:

According to -X importtime, with this change, import tomllib takes 828 us instead of 9.0 ms on main (10.9x faster).
Using python -m pyperf command with ./python -S, with this change, import tomllib takes 0.98 ms instead of 9.8 ms (10x faster).

vstinner · 2026-04-02T15:17:28Z

Ok, the PR is now ready for review. cc @hauntsaninja @encukou

I updated the PR to use public names. I also fixed tests for hex/oct/bin numbers.

hugovk · 2026-04-02T15:19:57Z

cc also Tomli maintainer @hukkin.

hukkin · 2026-04-02T21:35:34Z

Hi! 👋

I have an old Tomli branch where I've attempted to do very similar things, but it went into the discard pile, IIRC either because distlib's executable wrapper imports re module so trying to avoid the import didn't help (the situation today is very different because both pip and uv override the executable wrapper), or perhaps it was because the re module was faster at parsing integers than pure Python so the optimization seemed case dependent and controversial. Can't remember exactly 😄

The decimal parsing code I had was mostly as follows. (I've slightly simplified (the original parses underscored decimals too) and commented.)

# If one of these follows a "simple decimal" it could mean that
# the value is actually something else (float, datetime...) so
# optimized parsing should be abandoned.
ILLEGAL_AFTER_SIMPLE_DECIMAL: Final = frozenset(
    "eE."  # decimal
    "xbo"  # hex, bin, oct
    "-"  # datetime
    ":"  # localtime
    "_0123456789"  # complex decimal
)


def try_simple_decimal(src: str, pos: Pos) -> None | tuple[Pos, int]:
    """Parse a "simple" decimal integer.

    An optimization that tries to parse a simple decimal integer
    without underscores. Returns `None` if there's any uncertainty
    on correctness.
    """
    start_pos = pos

    if src.startswith(("+", "-"), pos):
        pos += 1

    if src.startswith("0", pos):
        pos += 1
    elif src.startswith(("1", "2", "3", "4", "5", "6", "7", "8", "9"), pos):
        pos = skip_chars(src, pos, "0123456789")
    else:
        return None

    try:
        next_char = src[pos]
    except IndexError:
        next_char = None
    if next_char in ILLEGAL_AFTER_SIMPLE_DECIMAL:
        return None

    return pos, int(src[start_pos:pos])


def parse_value(
    src: str, pos: Pos, parse_float: ParseFloat, nest_lvl: int
) -> tuple[Pos, Any]:
    ...
    simple_dec_result = try_simple_decimal(src, pos)
    if simple_dec_result is not None:
        return simple_dec_result
    ...

This should do similar things as what you have. One difference is that at least to me ILLEGAL_AFTER_SIMPLE_DECIMAL here seems easier to prove correct (by looking at tomllib's code) than NUMBER_END_CHARS.

I'd personally name the function something other than parse_something simply because no other parse_ function in tomllib returns a None. They all successfully parse or raise an error.

Co-authored-by: Taneli Hukkinen <hukkinen@eurecom.fr>

vstinner · 2026-04-02T22:28:47Z

@hukkin: Hi! Oh, it's great that you already explored the "simple decimal number parser" strategy. I really like your implementation, it looks way better than mine! So I simply copy/pasted your code and I added you as a co-author.

I don't know how stdlib tomllib is maintained. Should I contribute this change to https://github.com/hukkin/tomli first? Or is it ok to land such change in the stdlib tomllib module first?

vstinner · 2026-04-02T22:50:07Z

"Tests / CIFuzz / python3-libraries (address)" failed: it generated a TOML file of 1860 characters with with 593 [ array opening character and no ] array closing charracter. tomllib.load() fails with RecursionError. But if I use sys.setrecursionlimit(10_000), tomllib raises tomllib.TOMLDecodeError: Unclosed array (at line 1, column 1553) as expected. So it's a false alarm and I suggest ignoring it for now.

hauntsaninja

The lazy modules change looks good to me.

It would be good to benchmark try_simple_decimal. If it's slower, I'm not sure that part of the PR is worth it... you end up saving the one-time cost of an import at the time of first load and only in a fraction of TOML documents. (If PEP 829 is accepted and there is no use for numbers in site.toml, then we can reconsider)

vstinner · 2026-04-03T12:39:22Z

It would be good to benchmark try_simple_decimal.

I wrote a benchmark on parsing TOML with integer keys, 1000 lines of:

small: small1 = 123
bigint: bigint1 = 9999999999999999999999999999999999999999 (40 digits)
sep: sep1 = 123_456_789

Results on Python built in release mode (gcc -O3) on Fedora 43:

Benchmark	ref	lazy
parse small int	3.88 ms	2.70 ms: 1.44x faster
parse big int	4.81 ms	4.17 ms: 1.15x faster
parse sep int	3.67 ms	4.05 ms: 1.10x slower
Geometric mean	(ref)	1.15x faster

Oh, I'm surprised that this change makes "small int" and "big int" cases actually faster. I expected a regex (RE_NUMBER) to be faster than a dummy loop in Python (skip_chars(src, pos, "0123456789")).

The "sep int" is expected to be slower since try_simple_decimal() parses 3 digits before giving up on _ separator, and then falls back on the regular RE_NUMBER parser.

Benchmark code:

Details

import pyperf
import os.path
import tomllib

LINES = 10**3
SMALL_TOML = 'bench_small.toml'
BIGINT_TOML = 'bench_bigint.toml'
SEP_TOML = 'bench_sep.toml'

def create_smallint_toml(filename):
    with open(filename, 'w', encoding='utf8') as fp:
        print('[section]', file=fp)
        for i in range(1, LINES + 1):
            print(f'small{i} = 123', file=fp)

def create_bigint_toml(filename):
    bigint = 10 ** 40 - 1
    with open(filename, 'w', encoding='utf8') as fp:
        print('[section]', file=fp)
        for i in range(1, LINES + 1):
            print(f'bigint{i} = {bigint}', file=fp)

def create_sep_toml(filename):
    with open(filename, 'w', encoding='utf8') as fp:
        print('[section]', file=fp)
        for i in range(1, LINES + 1):
            print(f'sep{i} = 123_456_789', file=fp)

if not os.path.exists(SMALL_TOML):
    create_smallint_toml(SMALL_TOML)
if not os.path.exists(BIGINT_TOML):
    create_bigint_toml(BIGINT_TOML)
if not os.path.exists(SEP_TOML):
    create_sep_toml(SEP_TOML)

def parse_toml(filename):
    with open(filename, 'rb') as fp:
        tomllib.load(fp)

runner = pyperf.Runner()
runner.bench_func('parse small int', parse_toml, SMALL_TOML)
runner.bench_func('parse big int', parse_toml, BIGINT_TOML)
runner.bench_func('parse sep int', parse_toml, SEP_TOML)

hukkin · 2026-04-03T13:06:33Z

I don't know how stdlib tomllib is maintained. Should I contribute this change to https://github.com/hukkin/tomli first? Or is it ok to land such change in the stdlib tomllib module first?

I'm not aware of any official guidance. I believe so far most changes have gone into Tomli first. One benefit is that Tomli runs all the tests in https://github.com/toml-lang/toml-test. Tomli also has tox -e benchmark and tox -e benchmark-import for performance and import time benchmarking.

It would be good to benchmark try_simple_decimal. If it's slower, I'm not sure that part of the PR is worth it... you end up saving the one-time cost of an import at the time of first load and only in a fraction of TOML documents. (If PEP 829 is accepted and there is no use for numbers in site.toml, then we can reconsider)

I agree with this. If my memory (from past attempts to improve Tomli's performance) serves right, re based parsing is more comparable to mypyc compiled Python than interpreted Python, and regex based code is much more concise (and subjectively easier to read).

I'm not familiar with site.toml. Does it always have decimal integers? E.g. in pyproject.toml I feel like integers are surprisingly rarely used.

I wrote a benchmark on parsing TOML with integer keys, 1000 lines of:

The results seem surprising to me too! Could this be because the parser now prioritizes decimal integers over floats, datetimes and local times? A document with mixed data types may yield different results? It could also be that interpreted Python is quite a bit faster now than a few years ago, which is great!

vstinner · 2026-04-03T14:01:31Z

@hukkin:

I believe so far most changes have gone into Tomli first.

Ok, I created hukkin/tomli#292 which is based on this PR.

I'm not familiar with site.toml. Does it always have decimal integers? E.g. in pyproject.toml I feel like integers are surprisingly rarely used.

<package>.site.toml comes from the draft PEP 829 "Structured Startup Configuration via .site.toml Files" which is still being discussed. The TOML example uses the optional schema_version = 1 line which is the only line containing an integer.

[metadata]
schema_version = 1

[paths]
dirs = ["../lib", "/opt/mylib", "{sitedir}/extra"]

[entrypoints]
init = ["foo.startup:initialize", "foo.plugins"]

I consider that making tomllib import time faster is interesting even if PEP 829 is not adopted.

pythongh-147991: Speed up tomllib import time

50d5739

Defer regular expressions import until the first datetime, localtime or non-trivial number (other that just decimal digits) is met.

bedevere-app bot mentioned this pull request Apr 2, 2026

Improve tomllib startup time #147991

Open

vstinner commented Apr 2, 2026

View reviewed changes

Lib/tomllib/_parser.py Outdated Show resolved Hide resolved

hugovk reviewed Apr 2, 2026

View reviewed changes

Lib/tomllib/_parser.py Show resolved Hide resolved

eendebakpt reviewed Apr 2, 2026

View reviewed changes

vstinner added 4 commits April 2, 2026 12:32

Use lazy import

1182135

Enhance _parse_simple_number() to handle more cases

4899c2e

Remove duplicated test

4d11aa9

Replace types.MappingProxyType with frozendict

ce57c76

vstinner added 2 commits April 2, 2026 17:11

Use public names

63fd2b1

Fix tests on hex/oct/bin numbers

ed127fa

vstinner marked this pull request as ready for review April 2, 2026 15:17

vstinner requested review from encukou and hauntsaninja as code owners April 2, 2026 15:17

bedevere-app bot added the awaiting core review label Apr 2, 2026

vstinner and others added 3 commits April 3, 2026 00:11

Rename to try_simple_decimal()

9be2a80

Use Taneli's implementation of try_simple_decimal()

a207fbb

Co-authored-by: Taneli Hukkinen <hukkinen@eurecom.fr>

Rename res to number

82d4e2b

hauntsaninja reviewed Apr 2, 2026

View reviewed changes

vstinner mentioned this pull request Apr 3, 2026

Use Python 3.15 lazy import hukkin/tomli#292

Open

Uh oh!

Conversation

vstinner commented Apr 2, 2026 • edited by bedevere-app bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vstinner commented Apr 2, 2026

Uh oh!

vstinner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hugovk commented Apr 2, 2026

Uh oh!

Uh oh!

eendebakpt Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

vstinner Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vstinner commented Apr 2, 2026

Uh oh!

vstinner commented Apr 2, 2026

Uh oh!

hugovk commented Apr 2, 2026

Uh oh!

hukkin commented Apr 2, 2026

Uh oh!

vstinner commented Apr 2, 2026

Uh oh!

vstinner commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hauntsaninja left a comment

Choose a reason for hiding this comment

Uh oh!

vstinner commented Apr 3, 2026

Uh oh!

hukkin commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vstinner commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

vstinner commented Apr 2, 2026 •

edited by bedevere-app bot

Loading

vstinner commented Apr 2, 2026 •

edited

Loading

hukkin commented Apr 3, 2026 •

edited

Loading