gh-147991: Speed up tomllib import time#147992
gh-147991: Speed up tomllib import time#147992vstinner wants to merge 10 commits intopython:mainfrom
Conversation
Defer regular expressions import until the first datetime, localtime or non-trivial number (other that just decimal digits) is met.
|
It might be interesting to replace |
vstinner
left a comment
There was a problem hiding this comment.
I marked added constants and functions as private by adding _ prefix. I'm not sure if it's needed, all other _parser APIs are "public" (no underscore prefix).
Lib/tomllib/_parser.py
Outdated
| if pos >= end: | ||
| break | ||
| else: | ||
| if src[pos] != "\n": |
There was a problem hiding this comment.
Can this happen? We could just return None and fall back to the original path.
There was a problem hiding this comment.
Yes, in many cases. See the added test_parse_simple_number(). Examples:
- The test is true when parsing
1979-05-27: we cannot parse the date. - The test is false when parsing
1\n(ex:value = 1\n) or23, 24]\n(ex:list = [23, 24]\n)
|
I updated the PR to replace I ran benchmarks on the latest PR using Python built in release mode (
|
|
Ok, the PR is now ready for review. cc @hauntsaninja @encukou I updated the PR to use public names. I also fixed tests for hex/oct/bin numbers. |
|
cc also Tomli maintainer @hukkin. |
|
Hi! 👋 I have an old Tomli branch where I've attempted to do very similar things, but it went into the discard pile, IIRC either because distlib's executable wrapper imports The decimal parsing code I had was mostly as follows. (I've slightly simplified (the original parses underscored decimals too) and commented.) # If one of these follows a "simple decimal" it could mean that
# the value is actually something else (float, datetime...) so
# optimized parsing should be abandoned.
ILLEGAL_AFTER_SIMPLE_DECIMAL: Final = frozenset(
"eE." # decimal
"xbo" # hex, bin, oct
"-" # datetime
":" # localtime
"_0123456789" # complex decimal
)
def try_simple_decimal(src: str, pos: Pos) -> None | tuple[Pos, int]:
"""Parse a "simple" decimal integer.
An optimization that tries to parse a simple decimal integer
without underscores. Returns `None` if there's any uncertainty
on correctness.
"""
start_pos = pos
if src.startswith(("+", "-"), pos):
pos += 1
if src.startswith("0", pos):
pos += 1
elif src.startswith(("1", "2", "3", "4", "5", "6", "7", "8", "9"), pos):
pos = skip_chars(src, pos, "0123456789")
else:
return None
try:
next_char = src[pos]
except IndexError:
next_char = None
if next_char in ILLEGAL_AFTER_SIMPLE_DECIMAL:
return None
return pos, int(src[start_pos:pos])
def parse_value(
src: str, pos: Pos, parse_float: ParseFloat, nest_lvl: int
) -> tuple[Pos, Any]:
...
simple_dec_result = try_simple_decimal(src, pos)
if simple_dec_result is not None:
return simple_dec_result
...This should do similar things as what you have. One difference is that at least to me I'd personally name the function something other than |
Co-authored-by: Taneli Hukkinen <hukkinen@eurecom.fr>
|
@hukkin: Hi! Oh, it's great that you already explored the "simple decimal number parser" strategy. I really like your implementation, it looks way better than mine! So I simply copy/pasted your code and I added you as a co-author. I don't know how stdlib tomllib is maintained. Should I contribute this change to https://github.com/hukkin/tomli first? Or is it ok to land such change in the stdlib tomllib module first? |
|
"Tests / CIFuzz / python3-libraries (address)" failed: it generated a TOML file of 1860 characters with with 593 |
hauntsaninja
left a comment
There was a problem hiding this comment.
The lazy modules change looks good to me.
It would be good to benchmark try_simple_decimal. If it's slower, I'm not sure that part of the PR is worth it... you end up saving the one-time cost of an import at the time of first load and only in a fraction of TOML documents. (If PEP 829 is accepted and there is no use for numbers in site.toml, then we can reconsider)
I wrote a benchmark on parsing TOML with integer keys, 1000 lines of:
Results on Python built in release mode (
Oh, I'm surprised that this change makes "small int" and "big int" cases actually faster. I expected a regex ( The "sep int" is expected to be slower since Benchmark code: Detailsimport pyperf
import os.path
import tomllib
LINES = 10**3
SMALL_TOML = 'bench_small.toml'
BIGINT_TOML = 'bench_bigint.toml'
SEP_TOML = 'bench_sep.toml'
def create_smallint_toml(filename):
with open(filename, 'w', encoding='utf8') as fp:
print('[section]', file=fp)
for i in range(1, LINES + 1):
print(f'small{i} = 123', file=fp)
def create_bigint_toml(filename):
bigint = 10 ** 40 - 1
with open(filename, 'w', encoding='utf8') as fp:
print('[section]', file=fp)
for i in range(1, LINES + 1):
print(f'bigint{i} = {bigint}', file=fp)
def create_sep_toml(filename):
with open(filename, 'w', encoding='utf8') as fp:
print('[section]', file=fp)
for i in range(1, LINES + 1):
print(f'sep{i} = 123_456_789', file=fp)
if not os.path.exists(SMALL_TOML):
create_smallint_toml(SMALL_TOML)
if not os.path.exists(BIGINT_TOML):
create_bigint_toml(BIGINT_TOML)
if not os.path.exists(SEP_TOML):
create_sep_toml(SEP_TOML)
def parse_toml(filename):
with open(filename, 'rb') as fp:
tomllib.load(fp)
runner = pyperf.Runner()
runner.bench_func('parse small int', parse_toml, SMALL_TOML)
runner.bench_func('parse big int', parse_toml, BIGINT_TOML)
runner.bench_func('parse sep int', parse_toml, SEP_TOML) |
I'm not aware of any official guidance. I believe so far most changes have gone into Tomli first. One benefit is that Tomli runs all the tests in https://github.com/toml-lang/toml-test. Tomli also has
I agree with this. If my memory (from past attempts to improve Tomli's performance) serves right, I'm not familiar with site.toml. Does it always have decimal integers? E.g. in pyproject.toml I feel like integers are surprisingly rarely used.
The results seem surprising to me too! Could this be because the parser now prioritizes decimal integers over floats, datetimes and local times? A document with mixed data types may yield different results? It could also be that interpreted Python is quite a bit faster now than a few years ago, which is great! |
Ok, I created hukkin/tomli#292 which is based on this PR.
[metadata]
schema_version = 1
[paths]
dirs = ["../lib", "/opt/mylib", "{sitedir}/extra"]
[entrypoints]
init = ["foo.startup:initialize", "foo.plugins"]I consider that making |


Defer regular expressions import until the first datetime, localtime or non-trivial number (other that just decimal digits) is met.