gh-146073: Add fitness/exit quality mechanism for JIT trace frontend#147966
gh-146073: Add fitness/exit quality mechanism for JIT trace frontend#147966cocolato wants to merge 4 commits intopython:mainfrom
Conversation
| /* Default fitness configuration values for trace quality control. | ||
| * These can be overridden via PYTHON_JIT_FITNESS_* environment variables. */ | ||
| #define FITNESS_INITIAL 1000 | ||
| #define FITNESS_INITIAL_SIDE 800 | ||
| #define FITNESS_PER_INSTRUCTION 2 | ||
| #define FITNESS_BRANCH_BIASED 5 | ||
| #define FITNESS_BRANCH_UNBIASED 25 | ||
| #define FITNESS_BACKWARD_EDGE 80 | ||
| #define FITNESS_FRAME_ENTRY 10 | ||
|
|
||
| /* Default exit quality constants for fitness-based trace termination. | ||
| * Higher values mean better places to stop the trace. | ||
| * These can be overridden via PYTHON_JIT_EXIT_QUALITY_* environment variables. */ | ||
| #define EXIT_QUALITY_ENTER_EXECUTOR 500 | ||
| #define EXIT_QUALITY_DEFAULT 200 | ||
| #define EXIT_QUALITY_SPECIALIZABLE 50 |
There was a problem hiding this comment.
I ran some benchmarks with the current configuration:
Performance version: 1.14.0
Python version: 3.15.0a7+ (64-bit) revision 2f9438a25f
Report on Linux-6.6.87.2-microsoft-standard-WSL2-x86_64-with-glibc2.41
Number of logical CPUs: 12
Start date: 2026-04-01 21:04:33.548904
End date: 2026-04-01 21:09:52.782976
+----------------------+---------------+--------------+--------------+------------------------+
| Benchmark | baseline.json | fitness.json | Change | Significance |
+======================+===============+==============+==============+========================+
| chaos | 44.2 ms | 45.1 ms | 1.02x slower | Not significant |
+----------------------+---------------+--------------+--------------+------------------------+
| deltablue | 2.14 ms | 2.23 ms | 1.04x slower | Significant (t=-9.76) |
+----------------------+---------------+--------------+--------------+------------------------+
| fannkuch | 246 ms | 256 ms | 1.04x slower | Significant (t=-7.84) |
+----------------------+---------------+--------------+--------------+------------------------+
| float | 46.1 ms | 49.6 ms | 1.07x slower | Significant (t=-6.55) |
+----------------------+---------------+--------------+--------------+------------------------+
| go | 82.8 ms | 91.2 ms | 1.10x slower | Significant (t=-11.45) |
+----------------------+---------------+--------------+--------------+------------------------+
| json_dumps | 7.59 ms | 8.32 ms | 1.10x slower | Significant (t=-19.49) |
+----------------------+---------------+--------------+--------------+------------------------+
| json_loads | 18.4 us | 18.9 us | 1.03x slower | Significant (t=-6.08) |
+----------------------+---------------+--------------+--------------+------------------------+
| nbody | 50.2 ms | 50.7 ms | 1.01x slower | Not significant |
+----------------------+---------------+--------------+--------------+------------------------+
| pickle_pure_python | 279 us | 262 us | 1.07x faster | Significant (t=13.59) |
+----------------------+---------------+--------------+--------------+------------------------+
| pidigits | 137 ms | 139 ms | 1.02x slower | Not significant |
+----------------------+---------------+--------------+--------------+------------------------+
| pyflate | 264 ms | 274 ms | 1.04x slower | Significant (t=-8.33) |
+----------------------+---------------+--------------+--------------+------------------------+
| raytrace | 252 ms | 211 ms | 1.19x faster | Significant (t=40.18) |
+----------------------+---------------+--------------+--------------+------------------------+
| regex_compile | 102 ms | 103 ms | 1.01x slower | Not significant |
+----------------------+---------------+--------------+--------------+------------------------+
| regex_effbot | 2.16 ms | 2.17 ms | 1.01x slower | Not significant |
+----------------------+---------------+--------------+--------------+------------------------+
| richards | 16.0 ms | 16.0 ms | 1.00x faster | Not significant |
+----------------------+---------------+--------------+--------------+------------------------+
| spectral_norm | 60.8 ms | 57.3 ms | 1.06x faster | Significant (t=2.43) |
+----------------------+---------------+--------------+--------------+------------------------+
| telco | 6.02 ms | 5.80 ms | 1.04x faster | Significant (t=5.54) |
+----------------------+---------------+--------------+--------------+------------------------+
| unpickle_pure_python | 177 us | 170 us | 1.04x faster | Significant (t=5.28) |
+----------------------+---------------+--------------+--------------+------------------------+
| xml_etree_generate | 78.5 ms | 74.4 ms | 1.06x faster | Significant (t=5.83) |
+----------------------+---------------+--------------+--------------+------------------------+
| xml_etree_iterparse | 75.6 ms | 71.5 ms | 1.06x faster | Significant (t=4.14) |
+----------------------+---------------+--------------+--------------+------------------------+
| xml_etree_parse | 134 ms | 120 ms | 1.12x faster | Significant (t=6.42) |
+----------------------+---------------+--------------+--------------+------------------------+
| xml_etree_process | 48.7 ms | 49.4 ms | 1.02x slower | Not significant |
+----------------------+---------------+--------------+--------------+------------------------+
Fidget-Spinner
left a comment
There was a problem hiding this comment.
Great work. Thanks for doing this!
|
I also forgot to mention: but we should adjust for the following:
|
markshannon
left a comment
There was a problem hiding this comment.
This looks good, and thanks for doing this.
Overall, the approach looks good.
I originally had in a mind that the fitness would reduce geometrically, not arithmetically, but your arithmetic approach looks easier to reason about and at least as good in terms of trace quality.
We need to reduce the number of configurable parameters to just one or two.
Then we can make sure that fitness and penalties are set such that we are guaranteed not to overflow the trace or optimizer buffers.
I'm not sure that rewinding is worth it. As long as "good" exits have a much higher score than "bad" exits, then we should (almost) always end up at a good exit.
Python/optimizer.c
Outdated
| // Check if fitness is depleted — should we stop the trace? | ||
| if (ts->fitness < eq && | ||
| !(progress_needed && uop_buffer_length(trace) < CODE_SIZE_NO_PROGRESS)) { | ||
| // Prefer stopping at the best recorded exit point |
There was a problem hiding this comment.
Just stop here. Backing up could give us a better exit, but it might give us a worse trace. And it is more complex to implement.
Python/optimizer.c
Outdated
| else { | ||
| // No valid best exit — stop at current position | ||
| ADD_TO_TRACE(_EXIT_TRACE, 0, 0, target); | ||
| uop_buffer_last(trace)->operand1 = true; // is_control_flow |
There was a problem hiding this comment.
This doesn't count as control flow. It is a terminator, not a branch.
| trace->end -= needs_guard_ip; | ||
|
|
||
| int space_needed = expansion->nuops + needs_guard_ip + 2 + (!OPCODE_HAS_NO_SAVE_IP(opcode)); | ||
| if (uop_buffer_remaining_space(trace) < space_needed) { |
There was a problem hiding this comment.
| assert (uop_buffer_remaining_space(trace) > space_needed); |
If we choose the fitness and exit values correctly, we can't run out of space.
Python/optimizer.c
Outdated
| _PyJitTracerTranslatorState *ts_depth = &tracer->translator_state; | ||
| if (ts_depth->frame_depth <= 0) { | ||
| // Underflow | ||
| ts_depth->fitness -= (int32_t)tstate->interp->opt_config.fitness_frame_entry * 2; |
There was a problem hiding this comment.
I think this is fundamentally different to making a call, so should have its own distinct (and probably larger) penalty.
| // Underflow | ||
| ts_depth->fitness -= (int32_t)tstate->interp->opt_config.fitness_frame_entry * 2; | ||
| } | ||
| ts_depth->frame_depth = ts_depth->frame_depth <= 0 ? 0 : ts_depth->frame_depth - 1; |
There was a problem hiding this comment.
| ts_depth->frame_depth = ts_depth->frame_depth <= 0 ? 0 : ts_depth->frame_depth - 1; | |
| ts_depth->frame_depth = ts_depth->frame_depth <= 0 ? 0 : ts_depth->frame_depth - 1; | |
| ts_depth->fitness -= penalty; |
We should increase the fitness on returning, we want to inline calls to small functions, so we shouldn't penalize it.
|
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
Background
See: #146073
Introduced a preliminary fitness/exit quality mechanism for JIT trace frontend has, enabling the tracer to:
ENTER_EXECUTOR) to avoid stopping on instructions that can be optimized