Skip to content

gh-146073: Add fitness/exit quality mechanism for JIT trace frontend#147966

Open
cocolato wants to merge 4 commits intopython:mainfrom
cocolato:jit-tracer-fitness
Open

gh-146073: Add fitness/exit quality mechanism for JIT trace frontend#147966
cocolato wants to merge 4 commits intopython:mainfrom
cocolato:jit-tracer-fitness

Conversation

@cocolato
Copy link
Copy Markdown
Member

@cocolato cocolato commented Apr 1, 2026

Background

See: #146073

Introduced a preliminary fitness/exit quality mechanism for JIT trace frontend has, enabling the tracer to:

  • Stop proactively: Stop at the appropriate time without waiting for the buffer to fill up or hitting a dead end
  • Stop at good locations: Prioritize stopping at the entry points of existing executors (ENTER_EXECUTOR) to avoid stopping on instructions that can be optimized
  • Control frame depth: Apply a penalty that increases with depth for function call inlining

Comment on lines +18 to +33
/* Default fitness configuration values for trace quality control.
* These can be overridden via PYTHON_JIT_FITNESS_* environment variables. */
#define FITNESS_INITIAL 1000
#define FITNESS_INITIAL_SIDE 800
#define FITNESS_PER_INSTRUCTION 2
#define FITNESS_BRANCH_BIASED 5
#define FITNESS_BRANCH_UNBIASED 25
#define FITNESS_BACKWARD_EDGE 80
#define FITNESS_FRAME_ENTRY 10

/* Default exit quality constants for fitness-based trace termination.
* Higher values mean better places to stop the trace.
* These can be overridden via PYTHON_JIT_EXIT_QUALITY_* environment variables. */
#define EXIT_QUALITY_ENTER_EXECUTOR 500
#define EXIT_QUALITY_DEFAULT 200
#define EXIT_QUALITY_SPECIALIZABLE 50
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran some benchmarks with the current configuration:

Performance version: 1.14.0
Python version: 3.15.0a7+ (64-bit) revision 2f9438a25f
Report on Linux-6.6.87.2-microsoft-standard-WSL2-x86_64-with-glibc2.41
Number of logical CPUs: 12
Start date: 2026-04-01 21:04:33.548904
End date: 2026-04-01 21:09:52.782976

+----------------------+---------------+--------------+--------------+------------------------+
| Benchmark            | baseline.json | fitness.json | Change       | Significance           |
+======================+===============+==============+==============+========================+
| chaos                | 44.2 ms       | 45.1 ms      | 1.02x slower | Not significant        |
+----------------------+---------------+--------------+--------------+------------------------+
| deltablue            | 2.14 ms       | 2.23 ms      | 1.04x slower | Significant (t=-9.76)  |
+----------------------+---------------+--------------+--------------+------------------------+
| fannkuch             | 246 ms        | 256 ms       | 1.04x slower | Significant (t=-7.84)  |
+----------------------+---------------+--------------+--------------+------------------------+
| float                | 46.1 ms       | 49.6 ms      | 1.07x slower | Significant (t=-6.55)  |
+----------------------+---------------+--------------+--------------+------------------------+
| go                   | 82.8 ms       | 91.2 ms      | 1.10x slower | Significant (t=-11.45) |
+----------------------+---------------+--------------+--------------+------------------------+
| json_dumps           | 7.59 ms       | 8.32 ms      | 1.10x slower | Significant (t=-19.49) |
+----------------------+---------------+--------------+--------------+------------------------+
| json_loads           | 18.4 us       | 18.9 us      | 1.03x slower | Significant (t=-6.08)  |
+----------------------+---------------+--------------+--------------+------------------------+
| nbody                | 50.2 ms       | 50.7 ms      | 1.01x slower | Not significant        |
+----------------------+---------------+--------------+--------------+------------------------+
| pickle_pure_python   | 279 us        | 262 us       | 1.07x faster | Significant (t=13.59)  |
+----------------------+---------------+--------------+--------------+------------------------+
| pidigits             | 137 ms        | 139 ms       | 1.02x slower | Not significant        |
+----------------------+---------------+--------------+--------------+------------------------+
| pyflate              | 264 ms        | 274 ms       | 1.04x slower | Significant (t=-8.33)  |
+----------------------+---------------+--------------+--------------+------------------------+
| raytrace             | 252 ms        | 211 ms       | 1.19x faster | Significant (t=40.18)  |
+----------------------+---------------+--------------+--------------+------------------------+
| regex_compile        | 102 ms        | 103 ms       | 1.01x slower | Not significant        |
+----------------------+---------------+--------------+--------------+------------------------+
| regex_effbot         | 2.16 ms       | 2.17 ms      | 1.01x slower | Not significant        |
+----------------------+---------------+--------------+--------------+------------------------+
| richards             | 16.0 ms       | 16.0 ms      | 1.00x faster | Not significant        |
+----------------------+---------------+--------------+--------------+------------------------+
| spectral_norm        | 60.8 ms       | 57.3 ms      | 1.06x faster | Significant (t=2.43)   |
+----------------------+---------------+--------------+--------------+------------------------+
| telco                | 6.02 ms       | 5.80 ms      | 1.04x faster | Significant (t=5.54)   |
+----------------------+---------------+--------------+--------------+------------------------+
| unpickle_pure_python | 177 us        | 170 us       | 1.04x faster | Significant (t=5.28)   |
+----------------------+---------------+--------------+--------------+------------------------+
| xml_etree_generate   | 78.5 ms       | 74.4 ms      | 1.06x faster | Significant (t=5.83)   |
+----------------------+---------------+--------------+--------------+------------------------+
| xml_etree_iterparse  | 75.6 ms       | 71.5 ms      | 1.06x faster | Significant (t=4.14)   |
+----------------------+---------------+--------------+--------------+------------------------+
| xml_etree_parse      | 134 ms        | 120 ms       | 1.12x faster | Significant (t=6.42)   |
+----------------------+---------------+--------------+--------------+------------------------+
| xml_etree_process    | 48.7 ms       | 49.4 ms      | 1.02x slower | Not significant        |
+----------------------+---------------+--------------+--------------+------------------------+

Copy link
Copy Markdown
Member

@Fidget-Spinner Fidget-Spinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work. Thanks for doing this!

@Fidget-Spinner
Copy link
Copy Markdown
Member

I also forgot to mention: but we should adjust for the following:

  1. Penalize frame depth underflow more so than normal frame push/pop.
  2. Stop tracing when we hit MAX_ABSTRACT_FRAME_DEPTH, as we can't optimize it anyways.

Copy link
Copy Markdown
Member

@markshannon markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, and thanks for doing this.

Overall, the approach looks good.

I originally had in a mind that the fitness would reduce geometrically, not arithmetically, but your arithmetic approach looks easier to reason about and at least as good in terms of trace quality.

We need to reduce the number of configurable parameters to just one or two.
Then we can make sure that fitness and penalties are set such that we are guaranteed not to overflow the trace or optimizer buffers.

I'm not sure that rewinding is worth it. As long as "good" exits have a much higher score than "bad" exits, then we should (almost) always end up at a good exit.

// Check if fitness is depleted — should we stop the trace?
if (ts->fitness < eq &&
!(progress_needed && uop_buffer_length(trace) < CODE_SIZE_NO_PROGRESS)) {
// Prefer stopping at the best recorded exit point
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just stop here. Backing up could give us a better exit, but it might give us a worse trace. And it is more complex to implement.

else {
// No valid best exit — stop at current position
ADD_TO_TRACE(_EXIT_TRACE, 0, 0, target);
uop_buffer_last(trace)->operand1 = true; // is_control_flow
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't count as control flow. It is a terminator, not a branch.

trace->end -= needs_guard_ip;

int space_needed = expansion->nuops + needs_guard_ip + 2 + (!OPCODE_HAS_NO_SAVE_IP(opcode));
if (uop_buffer_remaining_space(trace) < space_needed) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert (uop_buffer_remaining_space(trace) > space_needed);

If we choose the fitness and exit values correctly, we can't run out of space.

_PyJitTracerTranslatorState *ts_depth = &tracer->translator_state;
if (ts_depth->frame_depth <= 0) {
// Underflow
ts_depth->fitness -= (int32_t)tstate->interp->opt_config.fitness_frame_entry * 2;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fundamentally different to making a call, so should have its own distinct (and probably larger) penalty.

// Underflow
ts_depth->fitness -= (int32_t)tstate->interp->opt_config.fitness_frame_entry * 2;
}
ts_depth->frame_depth = ts_depth->frame_depth <= 0 ? 0 : ts_depth->frame_depth - 1;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ts_depth->frame_depth = ts_depth->frame_depth <= 0 ? 0 : ts_depth->frame_depth - 1;
ts_depth->frame_depth = ts_depth->frame_depth <= 0 ? 0 : ts_depth->frame_depth - 1;
ts_depth->fitness -= penalty;

We should increase the fitness on returning, we want to inline calls to small functions, so we shouldn't penalize it.

@bedevere-app
Copy link
Copy Markdown

bedevere-app bot commented Apr 1, 2026

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants