gh-127958: Trace from RESUME in the JIT#145905
gh-127958: Trace from RESUME in the JIT#145905Fidget-Spinner wants to merge 47 commits intopython:mainfrom
Conversation
This reverts commit 6b65f76.
|
Do you know why the comprehensions benchmark is so much slower, and do you have a link to the x86 results? The code looks good, but I'm a bit concerned about our increasingly ad-hoc approach to when to stop tracing. |
Updated the original description with the x86 results.
Ok |
markshannon
left a comment
There was a problem hiding this comment.
The comprehensions benchmark is so much slower due to RESUME_CHECK_JIT being that much slower than RESUME_CHECK last I checked.
I think that will need to be addressed before we can merge this. I've one suggestion that might help.
Lib/test/test_capi/test_opt.py
Outdated
| continue | ||
| executors.append(exec) | ||
| seen.add(id(exec)) | ||
| # Add the side exit executors as well. |
There was a problem hiding this comment.
I'm a bit surprised this doesn't break other tests.
Maybe change the signature of this function to get_all_executors(func, side_exits=False) and only get the side exits for the tests that need them?
There was a problem hiding this comment.
I think this was leftover from an old commit, so I'm removing this.
Lib/test/test_capi/test_opt.py
Outdated
| return testfunc(x-1) | ||
|
|
||
| sys.setrecursionlimit(TIER2_RESUME_THRESHOLD * 2) | ||
| testfunc(TIER2_RESUME_THRESHOLD) |
There was a problem hiding this comment.
| testfunc(TIER2_RESUME_THRESHOLD) | |
| for _ in range((TIER2_RESUME_THRESHOLD + 99)//100) | |
| testfunc(101) |
Recursing too deep might have some odd interactions with stack chunking, causing a stack chunk overflow to deopt at the wrong point.
Lib/test/test_capi/test_opt.py
Outdated
| self.assertEqual(opnames[:load_attr_top].count("_GUARD_TYPE_VERSION"), 1) | ||
| self.assertEqual(opnames[call:load_attr_bottom].count("_CHECK_VALIDITY"), 2) | ||
| if ex is not None: | ||
| opnames = list(iter_opnames(ex)) |
There was a problem hiding this comment.
Why is the assert no longer valid?
There was a problem hiding this comment.
The test tests for invalidation. From my understanding: the test file may invalidate executors at different times now (due to presence of more RESUME executors, or other reasons, I'm not too clear). So it's possible for there to be no more executors after invalidation in the original test.
I think the proper fix is instead to lower the numbers for the loop in this specific test so that it completes immediately rather than running for awhile.
Lib/test/test_compile.py
Outdated
| start_line, end_line, start_col, end_col = pos | ||
| if i == 0 and start_col == end_col == 0: | ||
| # ignore the RESUME in the beginning | ||
| if i == 0 or i == 1: |
There was a problem hiding this comment.
| if i == 0 or i == 1: | |
| if i <= 1: |
Modules/_testinternalcapi.c
Outdated
|
|
||
| long resume_threshold = interp->opt_config.resume_initial_value + 2; | ||
| if (PyModule_Add(module, "TIER2_RESUME_THRESHOLD", | ||
| // + 1 more due to one loop spent on tracing. |
There was a problem hiding this comment.
The +2 above and this comment don't match up
Python/bytecodes.c
Outdated
| PAUSE_ADAPTIVE_COUNTER(this_instr[1].counter); | ||
| } | ||
| DISPATCH_GOTO(); | ||
| DISPATCH_GOTO_NON_TRACING(); |
There was a problem hiding this comment.
Why make this change? They should be equivalent at this point, but DISPATCH_GOTO is generally faster.
There was a problem hiding this comment.
woops leftover from old commits.
Python/ceval.c
Outdated
| #if _Py_TIER2 | ||
| // 0 for success, -1 for error. | ||
| static int | ||
| static Py_NO_INLINE int |
There was a problem hiding this comment.
i'll just remove it, but I'm pretty sure it should be Py_NO_INLINE.
Python/optimizer.c
Outdated
|
|
||
| // We want to trace over RESUME traces. Otherwise, functions with lots of RESUME | ||
| // end up with many fragmented traces which perform badly. | ||
| // See for example, the richards benchmark in pyperformance. |
There was a problem hiding this comment.
I don't think this comment belongs here.
This function is for translating a single bytecode; it is not concerned with trace formation
| void | ||
| _Py_Specialize_Resume(_Py_CODEUNIT *instr, PyThreadState *tstate) | ||
| { | ||
| if (tstate->tracing == 0 && instr->op.code == RESUME) { |
There was a problem hiding this comment.
Why are we excluding (sys.monitoring) tracing functions? Those are likely to be hot.
There was a problem hiding this comment.
This code is taken from the original bytecodes.c https://github.com/python/cpython/blob/main/Python/bytecodes.c#L176
Python/bytecodes.c
Outdated
| if (!IS_JIT_TRACING() && backoff_counter_triggers(counter) && | ||
| this_instr->op.code == JUMP_BACKWARD_JIT && | ||
| if (!IS_JIT_TRACING() && | ||
| (backoff_counter_triggers(counter) && |
There was a problem hiding this comment.
!IS_JIT_TRACING() is mostly true and backoff_counter_triggers(counter) is mostly false, so this would be faster as backoff_counter_triggers(counter) && !IS_JIT_TRACING() && ...
|
When you're done making the requested changes, leave the comment: |
From my investigations, this is one of the strangest slowdowns I have found so far: Leaving RESUME_CHECK_JIT in the interpreter loop but only allowing specialization to RESUME_CHECK restores the original performance. I was bout to think this is some compiler/CPU magic, but it does show up on AArch64 and on GCC/Clang as well. Luckily, I remembered RESUME is part of generator static magic, and I finally fixed the slowdown in this commit 5ccf8e9 |
|
I have made the requested changes; please review again. |
|
Thanks for making the requested changes! @markshannon: please review the changes made to this pull request. |
This adds a CACHE entry to RESUME and a RESUME_CHECK_JIT opcode.
Performance numbers are rather promising:
RESUMEin the JIT #127958