Debugging support in Wasmtime#34
Conversation
Co-authored-by: Nick Fitzgerald <fitzgen@gmail.com> Co-authored-by: Rainy Sinclair <844493+itsrainy@users.noreply.github.com>
Co-authored-by: bjorn3 <17426603+bjorn3@users.noreply.github.com>
accepted/wasmtime-debugging.md
Outdated
| features we outline in the [Live Debugging](#live-debugging) section below, the | ||
| DAP supports debugging debuggees that cannot step or set breakpoints (such as | ||
| core dumps) and debuggees that can reverse step (such as time-traveling while | ||
| replaying a recorded trace). Many other debugging protocols do not have these |
There was a problem hiding this comment.
Both the gdbserver and gdb-mi interfaces supports time-traveling. Rr makes use of it and gdb itself has a very basic reverse stepping implementation too. It also supports coredumps, so not being able to step or set breakpoints is supported too.
There was a problem hiding this comment.
I'm happy to remove this sentence, though I would again point to @fitzgen's comments about ease of initial implementation for why we shouldn't try to use gdb in our initial implementation.
| implement and verify basic protocol support before moving onto live debugging | ||
| use-cases. | ||
|
|
||
| Why the debug adapter protocol? [Like the Language Server Protocol |
There was a problem hiding this comment.
Gdbserver allows delegating all DWARF handling, expression parsing, conditional breakpoints, scripting, ... to GDB. The debug adapter protocol requires the DAP server (like wasmtime here) to implement this all afaik.
There was a problem hiding this comment.
That is true for DWARF and compiled-to-Wasm programs, but in order to provide debugging of interpreted-within-Wasm programs, we need to be in control of all this and can't rely on gdb to do it for us.
There was a problem hiding this comment.
Gdb and the debug adapter components would need roughly the same interface to interact with wasmtime, right? (stepping, breakpoints, inspecting wasm locals, ...) Could we export the interface imported by debug adapter components with a builtin adapter as the gdbserver interface? That way you can choose gdb for debugging with DWARF and the right debug adapter for interpreted code.
There was a problem hiding this comment.
One complication I see with that is that gdb currently doesn't have a wasm target and thus needs the dwarf translation we currently do, so either we add a wasm target to gdb or do the translation in the gdbserver adapter.
| resumption later on | ||
|
|
||
| 3. When an instance hits that breakpoint it can automatically continue if it | ||
| isn’t intended to stop, otherwise it can wait for user input. |
There was a problem hiding this comment.
Another option is to still use int3, but when the breakpoint is hit, interrupt all other instances that use the same module, swap the int3 away, single step the instance which shouldn't have hit it and then insert the int3 again and resume all other instances. This only slows down execution when a breakpoint is actually hit.
There was a problem hiding this comment.
I agree with @bjorn3 here -- this can be eventually implemented as an optimization in a second pass, as I'm sure we at first mostly want to see this whole contraption working.
I'd like to see, though, in the Future Work section, something that allows implementing platform-specific debugging facilities (e.g. using ptrace(2) on Unix-like systems), and leaving this approach as a baseline implementation for those host platforms where we don't yet implement them.
There was a problem hiding this comment.
Some host environments will have constraints about modifying executable code, so starting with a strategy that doesn't involve post-compilation code mutation would be ideal for maximizing the portability of debugging.
Another approach I've seen used (e.g., in Firefox's baseline JIT when debugging is enabled) is to, before every instruction, compile an inline branch on a memory location (set when stepping is enabled) that branches around the call to the debug handler function. This ends up being fairly inexpensive when stepping is not active. When stepping is active, you can then optimize the overhead as much as desired by having a separate testable memory location per { instance, function, instruction }.
There was a problem hiding this comment.
As mentioned above, I think we should focus on easy-to-correctly-implement mechanisms for stepping/breakpoints/etc...
In that light, I think all juggling of int3s, swapping them in and out, and trying to determine whether this instance is the target debuggee or not is going to make debugging buggier than it need be.
That said, I don't think literally making a libcall in between every wasm instruction is the right call either. Just below this point in the text, the document refers to "the <check for breakpoint> pseudo opcode". Thinking about this operation as a pseudo opcode, rather than literally a libcall, is the right approach IMO. I think we could, if we wanted, start with literal libcalls in the MVP.
But what I expect sits in a sweet spot with regards to balancing between performance and simplicity is to have a bitmap with a bit for every Wasm instruction in the function. If the ith instruction's bit is set, then there is a breakpoint on that instruction. The code between Wasm instructions simply loads from the bit map and masks to the appropriate bit (lots of reuse potential across each instruction's check available) and only if that bit is set will it call out to the host.
|
|
||
| Stepping into a function means pausing execution on the first instruction | ||
| _inside_ a function that is about to be called. We can accomplish this by | ||
| setting a temporary breakpoint on that first instruction, but first we need to |
There was a problem hiding this comment.
You can also use the processor's single stepping support to step until right after the call instruction without setting any breakpoints.
There was a problem hiding this comment.
That's true, but it does require us to have a bunch of per-architecture logic. I suppose the same could be said for the instrumentation-based approach, in that the instrumentation will be different snippets of machine instructions for each different architecture, but at least there the semantics of each snippet of instrumentation as a whole are identical and we don't need to worry about subtle differences that may exist between different architecture's single-stepping support.
Ultimately, I personally expect that the instrumentation approach will be easier to implement correctly (I'm not going to stop harping on this point :-p) than relying on hardware single stepping.
There was a problem hiding this comment.
I though PTRACE_SINGLESTEP was universally supported, but it seems that 32bit arm doesn't support it architecturally and the emulation support was removed in https://lore.kernel.org/linux-arm-kernel/1297248175-11952-1-git-send-email-will.deacon@arm.com/
|
so neat to see this happening! great work, everybody 🖤 |
lpereira
left a comment
There was a problem hiding this comment.
I'm really excited for this as I mentioned before! In general I like this as a first step in having debugging support in wasmtime; none of the comments I have are showstoppers.
| resumption later on | ||
|
|
||
| 3. When an instance hits that breakpoint it can automatically continue if it | ||
| isn’t intended to stop, otherwise it can wait for user input. |
There was a problem hiding this comment.
I agree with @bjorn3 here -- this can be eventually implemented as an optimization in a second pass, as I'm sure we at first mostly want to see this whole contraption working.
I'd like to see, though, in the Future Work section, something that allows implementing platform-specific debugging facilities (e.g. using ptrace(2) on Unix-like systems), and leaving this approach as a baseline implementation for those host platforms where we don't yet implement them.
| function. The call sites (store instructions) would have enough local | ||
| information available to determine if a watchpoint would trigger, and could | ||
| provide us with another implementation path that would avoid modifying the | ||
| signal handler. |
There was a problem hiding this comment.
Another approach is protecting the memory where the variable is being watched so it's read-only, and handling the fault in user mode, if possible (e.g. userfaultfd(2) on Linux). Maybe for Future Work as a possible optimization?
There was a problem hiding this comment.
Yep, we could also use regular signal handlers for this kind of thing, no need for userfaultfd in particular.
But again: I think juggling signal handlers will be more complicated and bug prone than emitting instrumentation directly.
As you say, it is an option for the future if we find that the overhead of the instrumentation is unacceptable.
| 1. When code is compiled for debugging, for every instruction emitted, emit a | ||
| call to a utility function that can check if execution should be paused: |
There was a problem hiding this comment.
Another thing that we could possibly do, if we're using DWARF, is to only insert these calls for every source line we encounter. This would still allow for line-by-line debugging while improving performance of the debugged code.
There was a problem hiding this comment.
Instruction level debugging can be quite useful when debugging UB. Maybe it could be an option to enable or disable?
There was a problem hiding this comment.
The other thing that this would preclude is switching between source-level and instruction-level debugging on-the-fly during the same debugging session, which is rarely needed but crucial when you do need it.
So I agree that this could be an option (eventually).
But I think we should really focus on making debugging simple and correct. Nothing worse than a buggy debugger that lies to you! So let's focus first on correctness and worry about the performance of debuggable code second. Not to say we shouldn't consider it at all, but as long as we don't box ourselves in architecturally, it should be able to safely set such questions aside for the most part.
| Once we have both core dumps and the live debugging capabilities described | ||
| above, we start to open the door to more interesting debugging capabilities. One | ||
| of those is the ability to record a trace in a production setting, and then | ||
| replay it in a debug environment to aid with the quick reproduction of a bug. | ||
| This is a major feature of the [rr](https://rr-project.org/) debugger, which is | ||
| able to record the results of syscalls and deterministically replay program | ||
| execution with that trace to enable easy debugging of otherwise | ||
| tricky-to-reproduce bugs. As WebAssembly execution is already well-sandboxed and | ||
| semi-deterministic, recording the result of calling imports would enable the | ||
| same sort of offline analysis of a production failure. |
| Choosing an existing debugger like gdb or lldb would compromise our three | ||
| principles outlined in the [motivation](#motivation) section — | ||
| particularly principle (2) and supporting interpreted debuggees[^possible] | ||
| — and therefore we feel that they are not viable choices. |
There was a problem hiding this comment.
If someone out there really likes gdb or lldb, I'm pretty sure there's something that bridges DAP to the gdbserver protocol, making it possible for them to use their favorite debugger. (Also worth noting that gdb already supports DAP from the get go, but only as a replacement to the gdbserver or its terminal UI.)
There was a problem hiding this comment.
Also worth noting that gdb already supports DAP from the get go, but only as a replacement to the gdbserver or its terminal UI.
I would expect it to be only as replacement to it's terminal UI. Gdbserver provides low level access to the tracee like read/write memory or registers and set breakpoint at given address, then gdb on top of that handles debuginfo, calling functions, stepping whole lines/functions at a time, set breakpoint at given function or line, ... And finally the terminal UI, gdb-mi or DAP provide an interface with which you can interact with gdb.
| Stepping out of a function is a bit easier to implement, as it will require us | ||
| to stop execution at any instruction that could cause the function to return | ||
| (branches with a sufficiently large stack depth, or a return instruction), and | ||
| then single-step out. For tail calls we have multiple options: we could treat | ||
| stepping out as breaking when the next function is entered, or we could break | ||
| once we finally return from the tail callee. |
There was a problem hiding this comment.
Thinking out loud here; this is not a fully-fleshed out thought.
If this debug_trap() function is implemented like this pseudo-code:
def debug_trap(reason):
mask = global_debug_trap_mask & reason
if mask & (TRAP_ON_FUNCTION_RETURN | TRAP_ON_INSTRUCTION | ...):
platform_debug_trap()Then this makes it possible to select the kind of single-stepping we're doing by setting/resetting bits in global_debug_trap_mask without implementing anything specific for stepping into or out of functions. This incurs in a bit more overhead than just checking a boolean flag, but allows a bit more flexibility that can help the implementation.
Can even have things like debug_trap(TRAP_ON_END_OF_LOOP) emitted after every loop to skip long, boring loops while single-stepping code. Likewise, debug_trap(TRAP_ON_FUNCTION_PROLOGUE) can be emitted to support stepping into functions and whatnot.
| * We will need to either maintain a mapping from locals to register/stack slot | ||
| for all Wasm PC points that we might hit a breakpoint or watchpoint, or force | ||
| winch to unconditionally spill locals to the stack. The latter would greatly | ||
| simplify tracking local locations, while the former would greatly increase the | ||
| amount of context we would need to pass into the utility function for | ||
| inspecting the execution state. |
There was a problem hiding this comment.
If I get a vote, I'd say spill to the stack every time, but still keep the current behavior for when debugging is not enabled.
At a later date, we can implement some sort of deoptimizer that transform every function into a trampoline function that either calls the non-debugging-instrumented code, or lazily generates a debugging-instrumented code and calls that instead. This way we can reduce the overhead for everything that we're not interested in debugging (e.g. if we're running code normally without hitting any breakpoint, the debug_trap() function should not be emitted; as soon as a breakpoint is set, the deoptimized/debuggable version of that function is emitted and the debug_trap() gets to be called, and functions will be re-generated with debugging instrumentation lazily as we reach them.)
As an alternative to the trampoline (or a transform that essentially does if (!debugging) { original_code; } else { ensure_instrumented_version_exists; call_it_instead; }), a mechanism similar to a PLT could be used instead, but would require all calls to be indirect calls when running winch for debugging. With this mechanism, we'd have a table like so:
0 func_0_orig
1 func_1_orig
2 func_2_orig
And then emit calls as the target machine equivalent of table[index_constant_for_func_n](arg1, arg2, ...) instead of func_n(arg1, arg2, ...).
If a breakpoint were set on, say, func_2, we'd replace table[2] with func_2_debug after generating the code for it with all the calls to debug_trap() as one would expect. This can be done entirely in the Winch side.
To allow for stepping across functions, the *_orig functions can check if global_debug_trap_mask is not 0 (cheap!), ask Winch to generate an instrumented version, patch the respective table entry, and call that. It doesn't need to be a tail call at this point (although it potentially could to save a stack frame for unwinding reasons), because this will happen only the first time this point is reached; every other time the instrumented version will be called automatically. This of course needs to be done partially in the Winch side, and partially by the generated code.
This seems like a more elegant solution in my opinion, including making it reentrant if the table we patch is part of each Winch state. Suffice to say, this has to be implemented directly as part of the JIT-generated code rather than relying on indirect call mechanisms from Wasm itself.
There was a problem hiding this comment.
Yes, the plan is to spill every time. I think this will need to be configurable behavior in winch, but it greatly simplifies knowing where locals live.
There was a problem hiding this comment.
I like this plan and agree with it too; I think there's a distinction or principle that might be worth making explicit in the RFC related to it:
One interesting constraint that the debug support in Cranelift traditionally had was that it must not modify the generated code; in other words, the user should be able to witness a strange behavior, attach a debugger, and be confident they are debugging exactly the same code. That constraint filtered through to how regalloc2 handles debug maps (e.g., disallows a "spill everything between instructions" approach) and consequently led to a lot of complexity.
In hindsight I think that was focusing on the wrong thing, because it was implicitly assuming that the sort of bugs the user would be debugging would have to do with Cranelift optimizations interacting with their code. (E.g., if you're gdb'ing a native program, and suspect UB-related issues in your C++, you really do want to debug the exact machine code with the issue.) But what we care about here is the Wasm virtual machine-level behavior and we're free to swap out the implementation.
I bring all this up because I think there might still be folks surprised by an "insert hooks" approach, with as much new plumbing and deoptimization as needed between Wasm instructions, and it seems worthwhile to state what we're preserving (Wasm-level semantics) and why this is OK to do.
There was a problem hiding this comment.
For cg_clif it is still important for debug info to not cause a huge perf hit as full debug info is enabled by default. Note that cg_clif currently doesn't support emitting debug info for locals though.
There was a problem hiding this comment.
There are some other tradeoffs we should mention around this choice too, I think. One aspect of the opposite end of the spectrum, the "must not modify machine code" approach, is that it can work even on release builds, in a best-effort way (values may be the infamous "<optimized out>"). This is sometimes desirable in large programs that take a while to get to the point of interest; if the continues between breakpoints are compute-intensive, basically. (The few times I've gdb'd Cranelift itself I did it on release builds!)
I think the general question to answer is the tradeoff between performance-under-debug and... all the other metrics of goodness (debug info richness, ease of debugger implementation, etc). We've swung very much to the latter here. How do we address use-cases that need the former?
One answer could be that record-and-replay addresses one common cause for that need: bug occurs very far into program execution. One could think of it as fast-forwarding with optimized code then switching (by loading the R&R snapshot and entering the debugger) to debug-mode code. To be ideally useful, we'd want R&R support for Cranelift-compiled code too. Is that tentatively part of the plan? (Maybe, if most of the logic is in Wasmtime imports/exports?)
Another answer could be on-stack replacement (OSR), and IIRC this is what SpiderMonkey does when a debug breakpoint is hit. But de-opt'ing from Cranelift to Winch code is an extremely hard problem to think about so maybe let's not cross that bridge yet!
Anyway, my opinions/setpoints on the above are: we should mention that perf is sometimes important; we should name record & replay as the answer when it's needed; we should eventually plan to implement R&R support for Cranelift-compiled code, after we do it for Winch.
There was a problem hiding this comment.
I wasn't even thinking about deoptimizations between Winch and Cranelift; mostly only within Winch so we could have code without the hooks unless we need them. We'd need to measure the performance impact of having those hooks, though, because it might be OK (and it'll have to be from the get go!).
At a much later date, however, yes, if we're debugging with Cranelift being used as a JIT compiler, it might be worth investigating using Winch as a debugging tier, or even as a tier 0 compiler like other JIT compilers do. (I agree about being able to swap the implementation as long as the wasm semantics are preserved.)
But one thing at a time!
There was a problem hiding this comment.
To be clear, my suggestion is not to do that. Rather I am suggesting to describe this tradeoff in the RFC, and point to R&R as a solution for the "fast-forward to the bug" subproblem of the general performance problem.
There was a problem hiding this comment.
I tried to capture this with an additional paragraph after these two bulleted points. What do you think?
|
As the commenting seems to have died down, I'd like to start the process to merge this RFC. Motion to FinalizeDisposition: Merge As always, details on the RFC process can be found here: https://github.com/bytecodealliance/rfcs/blob/main/accepted/rfc-process.md#making-a-decision-merge-or-close |
|
My unofficial +1 |
|
As there has been signoff from a different stakeholder organization , this RFC is entering its 10 day Final Comment Periodand the last day to raise objections before this can merge is 2024-05-24 |
|
As no objections have been raised during the final comment period, I'm going to merge this RFC. Thanks everyone! |

Rendered
Co-authored-by: Nick Fitzgerald - @fitzgen
Co-authored-by: Rainy Sinclair - @itsrainy