gh-127958: Trace from RESUME in the JIT#145905
gh-127958: Trace from RESUME in the JIT#145905Fidget-Spinner wants to merge 42 commits intopython:mainfrom
Conversation
This reverts commit 6b65f76.
|
Do you know why the comprehensions benchmark is so much slower, and do you have a link to the x86 results? The code looks good, but I'm a bit concerned about our increasingly ad-hoc approach to when to stop tracing. |
Updated the original description with the x86 results.
Ok |
markshannon
left a comment
There was a problem hiding this comment.
The comprehensions benchmark is so much slower due to RESUME_CHECK_JIT being that much slower than RESUME_CHECK last I checked.
I think that will need to be addressed before we can merge this. I've one suggestion that might help.
| continue | ||
| executors.append(exec) | ||
| seen.add(id(exec)) | ||
| # Add the side exit executors as well. |
There was a problem hiding this comment.
I'm a bit surprised this doesn't break other tests.
Maybe change the signature of this function to get_all_executors(func, side_exits=False) and only get the side exits for the tests that need them?
| return testfunc(x-1) | ||
|
|
||
| sys.setrecursionlimit(TIER2_RESUME_THRESHOLD * 2) | ||
| testfunc(TIER2_RESUME_THRESHOLD) |
There was a problem hiding this comment.
| testfunc(TIER2_RESUME_THRESHOLD) | |
| for _ in range((TIER2_RESUME_THRESHOLD + 99)//100) | |
| testfunc(101) |
Recursing too deep might have some odd interactions with stack chunking, causing a stack chunk overflow to deopt at the wrong point.
| self.assertEqual(opnames[:load_attr_top].count("_GUARD_TYPE_VERSION"), 1) | ||
| self.assertEqual(opnames[call:load_attr_bottom].count("_CHECK_VALIDITY"), 2) | ||
| if ex is not None: | ||
| opnames = list(iter_opnames(ex)) |
There was a problem hiding this comment.
Why is the assert no longer valid?
| start_line, end_line, start_col, end_col = pos | ||
| if i == 0 and start_col == end_col == 0: | ||
| # ignore the RESUME in the beginning | ||
| if i == 0 or i == 1: |
There was a problem hiding this comment.
| if i == 0 or i == 1: | |
| if i <= 1: |
|
|
||
| long resume_threshold = interp->opt_config.resume_initial_value + 2; | ||
| if (PyModule_Add(module, "TIER2_RESUME_THRESHOLD", | ||
| // + 1 more due to one loop spent on tracing. |
There was a problem hiding this comment.
The +2 above and this comment don't match up
| PAUSE_ADAPTIVE_COUNTER(this_instr[1].counter); | ||
| } | ||
| DISPATCH_GOTO(); | ||
| DISPATCH_GOTO_NON_TRACING(); |
There was a problem hiding this comment.
Why make this change? They should be equivalent at this point, but DISPATCH_GOTO is generally faster.
| #if _Py_TIER2 | ||
| // 0 for success, -1 for error. | ||
| static int | ||
| static Py_NO_INLINE int |
|
|
||
| // We want to trace over RESUME traces. Otherwise, functions with lots of RESUME | ||
| // end up with many fragmented traces which perform badly. | ||
| // See for example, the richards benchmark in pyperformance. |
There was a problem hiding this comment.
I don't think this comment belongs here.
This function is for translating a single bytecode; it is not concerned with trace formation
| void | ||
| _Py_Specialize_Resume(_Py_CODEUNIT *instr, PyThreadState *tstate) | ||
| { | ||
| if (tstate->tracing == 0 && instr->op.code == RESUME) { |
There was a problem hiding this comment.
Why are we excluding (sys.monitoring) tracing functions? Those are likely to be hot.
| if (!IS_JIT_TRACING() && backoff_counter_triggers(counter) && | ||
| this_instr->op.code == JUMP_BACKWARD_JIT && | ||
| if (!IS_JIT_TRACING() && | ||
| (backoff_counter_triggers(counter) && |
There was a problem hiding this comment.
!IS_JIT_TRACING() is mostly true and backoff_counter_triggers(counter) is mostly false, so this would be faster as backoff_counter_triggers(counter) && !IS_JIT_TRACING() && ...
|
When you're done making the requested changes, leave the comment: |
This adds a CACHE entry to RESUME and a RESUME_CHECK_JIT opcode.
Performance numbers are rather promising:
telcowith a 16% improvement!!!:Mean +- std dev: [telco_noresume] 169 ms +- 5 ms -> [telco_resume] 145 ms +- 8 ms: 1.16x fasterI'm currently running the full pyperformance on macOS and Linux.0% faster on Linux x86_64, 3.7% faster on macOS AArch64RESUMEin the JIT #127958