Pulley: a portable, optimizing interpreter for Wasmtime#35
Pulley: a portable, optimizing interpreter for Wasmtime#35fitzgen merged 2 commits intobytecodealliance:mainfrom
Conversation
|
Since no one else has commented on this I'll try to kick off discussion: I very much appreciate the explicit focus on good performance within the scope of a portable interpreter. It's non-obvious at first why this should be -- usually interpreters are seen as the lowest tier, a low-latency way to begin execution quickly -- but here explicitly we're focused on portability and then performance within that scope. IIRC from earlier discussions with you as well, the core Pulley interpreter is also Wasmtime-agnostic, yes? In other words, accesses to runtime data structures are lowered to raw loads and stores at the bytecode level; it really is close to the abstraction level that generated machine code would run at, rather than a "wasm load" opcode or somesuch that needs to perform dynamic checks; this is a strategy to benefit from the same kinds of codegen optimization (EDIT: e.g., bounds-checking cases) that we've done already for actual compiled code. This is another non-obvious but very fruitful design choice, I think. Perhaps worth adding as a benefit is that, as this backend exercises |
|
I also agree that emphasis on portability here makes the most sense in the context of Wasmtime. A major use case for a feature like this will be to take Wasmtime where Cranelift can't go which motivates both the portability and performance parts. Other than that though I don't have too much to add, just wanted to say 👍 |
Yes, this is correct. And as alluded to, I have a WIP branch that sketches out the new backend, bytecode format, and Wasmtime integration. It isn't quite ready for sharing more widely, very messy and lots of TODO comments, but it can run some very simple Wasm programs already. I also don't want to focus too much on this WIP implementation though; I want to focus on the RFC and its proposal and whether we want that shape of thing and its trade offs or not, rather than coloring those decisitons too much by what is or is not implemented. FWIW, the low-levelness of the bytecode also meant that implementing all control flow was pretty straightforward. Didn't have to implement all of if/else/loop/br/br_if/etc... Just implemented conditional and unconditional jumps and Cranelift took care of the rest. I think that is another really nice benefit of this proposed approach. (I haven't implemented br_table yet, however, which will require some more work on the part of the backend.) |
|
One question I thought of while rereading this: what about tiering? I thing @fitzgen made clear the portability benefits, but people may think, "oh, we can avoid some startup cost due to JIT compilation!" But the RFC doesn't sketch out this part of the story beyond the semi-related "no fast startup" time paragraph. What should the story be here: non-goal? open to eventual tiering? |
|
We talked a bit in the Wasmtime meeting today about this but wanted to record here as well: no this isn't intended to eventually be used for tiering compilation. Tiered compilation is a pretty major feature and would be separate from this, so I think it's safe to say it's a non-goal of this RFC. |
|
I have an idea for a way to make a fast interpreter, loosely inspired by the Chicken compiler: #[repr(usize)]
pub enum InsnResult {
Done,
UnwindStackAndContinue,
}
pub type InsnFn = unsafe fn(
pc: *const u8, // program counter
state: &mut State,
// most ABIs allow passing some arguments in registers, exploit that for fast temporaries
fast_reg0: usize,
fast_reg1: usize,
) -> InsnResult;
macro_rules! decl_opcodes {
(
$vis:vis enum $Opcode:ident {}
#[fn($pc:ident, $state:ident, $fast_reg0: ident, $fast_reg1: ident)]
match _ {
$(Self::$Variant:ident {
$(#[type = $field_ty:ty]
$field:ident,)*
} => $body:block)*
}
) => {
#[derive(Copy, Clone, Eq, PartialEq, Hash, Debug)]
#[repr(u8)]
$vis enum $Opcode {
$($Variant,)*
}
$(
#[derive(Copy, Clone, Eq, PartialEq, Hash, Debug)]
#[repr(packed)]
$vis struct $Variant {
$($vis $field: $field_ty,)*
}
impl $Variant {
$vis unsafe fn run(
mut $pc: *const u8,
$state: &mut State,
$fast_reg0: usize,
$fast_reg1: usize,
) -> InsnResult {
let Self {
$($field,)*
} = unsafe {
let insn_ptr = $pc.cast::<Opcode>().add(1).cast::<Self>();
$pc = insn_ptr.add(1).cast();
*insn_ptr
};
$body
}
}
)*
impl $Opcode {
$vis const DISPATCH: &[InsnFn] = &[
$($Variant::run,)*
];
#[inline(always)]
$vis unsafe fn run(
$pc: *const u8,
$state: &mut State,
$fast_reg0: usize,
$fast_reg1: usize,
) -> InsnResult {
unsafe {
Self::DISPATCH.get_unchecked(*$pc as usize)($pc, $state, $fast_reg0, $fast_reg1)
}
}
}
};
}
pub unsafe fn interpret(pc: *const u8, state: &mut State) {
loop {
state.depth = 0;
let fast_reg0 = state.fast_reg0;
let fast_reg1 = state.fast_reg1;
match Opcode::run(pc, state, fast_reg0, fast_reg1) {
InsnResult::UnwindStackAndContinue => continue,
InsnResult::Done => return,
}
}
}
opcodes! {
pub enum Opcode {}
#[fn(pc, state, fast_reg0, fast_reg1)]
match _ {
Self::UnwindIfTooDeep {
#[type = u16]
inc,
} => {
state.depth += inc as u32; // account for all intervening instructions too
if state.depth > 50000 {
state.pc = pc;
state.fast_reg0 = fast_reg0;
state.fast_reg1 = fast_reg1;
return InsnResult::UnwindStackAndContinue;
}
Opcode::run(pc, state, fast_reg0, fast_reg1)
}
Self::Done => {
state.pc = pc;
state.fast_reg0 = fast_reg0;
state.fast_reg1 = fast_reg1;
return InsnResult::Done;
}
Self::BrIfF0IsZ {
#[type = i32]
offset,
} => {
if fast_reg0 == 0 {
unsafe {
pc = pc.offset(offset);
}
}
Opcode::run(pc, state, fast_reg0, fast_reg1)
}
Self::BrIfF1IsZ {
#[type = i32]
offset,
} => {
if fast_reg1 == 0 {
unsafe {
pc = pc.offset(offset);
}
}
Opcode::run(pc, state, fast_reg0, fast_reg1)
}
Self::AddF0F0F1 {} => {
let fast_reg0 = fast_reg0.wrapping_add(fast_reg1);
Opcode::run(pc, state, fast_reg0, fast_reg1)
}
// add more instructions here
}
} |
|
@programmerjake I suspect the failure mode of that optimization -- a periodic giant pause where 50k calls are unwound -- might be a "medicine worse than the disease" so to speak; especially when designing a portable interpreter, we should assume that optimizations like "maybe tail call, no guarantees" will be brittle or fail on some platforms, so we should have a solid no-frills standard interpreter loop design, until and unless Rust itself provides us better primitives for this. That said I also suspect that the high order bit for discussion here is whether we want to adopt an interpreter approach for portability, and whether the Cranelift compilation to bytecode approach is desirable; we'll have plenty of time to bikeshed interpreter design details later :-) |
on platforms where |
Just pushed a commit adding this as an explicit non-goal. Doesn't necessarily mean that Wasmtime will never do that sort of thing, but it definitely isn't a goal for this work. |
|
@programmerjake thanks for taking a look at the RFC and brainstorming interpreter design! That kind of bounded recursion is indeed a cool technique; I've done similar things in the past for recursive marking in GC implementations before. However, I'd prefer (at least initially) to start with the basic We can, of course, evaluate proposed changes to improve perf once we have a thing that generally works and is not vaporware -- if the speedup-to-complexity ratio is high enough it may indeed make sense! -- but I'd prefer to start with getting something out the door that folks can start playing with first :) |
|
@programmerjake oh also, it might be fun sending a PR to https://github.com/tipo159/rust-instruction-dispatch to add this bounded-recursion approach and see how it stacks up with the existing implementation strategies they compare between. If you end up doing this, let me know, I'd certainly be interested in seeing the results! |
|
@fitzgen I made a PR here: tipo159/rust-instruction-dispatch#7 |
Awesome! Did you happen to run their benchmarks and compare this approach's performance to the others? |
|
As there seems to be pretty broad consensus and support for pursuing Pulley in the discussion here, I'd like to start the process of merging this RFC! Motion to FinalizeDisposition: Merge As always, details on the RFC process can be found here: https://github.com/bytecodealliance/rfcs/blob/main/accepted/rfc-process.md#making-a-decision-merge-or-close |
not yet, i ended up staying up waay too late, also I didn't integrate the longjmp unwinding which i think will be helpful since forcing it to use calls instead of tail calls (the |
|
As there has been signoff from two different stakeholder organizations, this RFC is entering its 10 day Final Comment Periodand the last day to raise objections before this can merge is 2024-07-11. Thanks everyone! |
|
Since no objections were raised during the final comment period, I'm going to go ahead and merge this RFC. Thanks everyone! |
This commit is the first step towards implementing bytecodealliance/rfcs#35 This commit introduces the `pulley-interpreter` crate which contains the Pulley bytecode definition, encoder, decoder, disassembler, and interpreter. This is still very much a work in progress! It is expected that we will tweak encodings and bytecode definitions, that we will overhaul the interpreter (to, for example, optionally support the unstable Rust `explicit_tail_calls` feature), and otherwise make large changes. This is just a starting point to get the ball rolling. Subsequent commits and pull requests will do things like add the Cranelift backend to produce Pulley bytecode from Wasm as well as the runtime integration to run the Pulley interpreter inside Wasmtime.
* Introduce the `pulley-interpreter` crate This commit is the first step towards implementing bytecodealliance/rfcs#35 This commit introduces the `pulley-interpreter` crate which contains the Pulley bytecode definition, encoder, decoder, disassembler, and interpreter. This is still very much a work in progress! It is expected that we will tweak encodings and bytecode definitions, that we will overhaul the interpreter (to, for example, optionally support the unstable Rust `explicit_tail_calls` feature), and otherwise make large changes. This is just a starting point to get the ball rolling. Subsequent commits and pull requests will do things like add the Cranelift backend to produce Pulley bytecode from Wasm as well as the runtime integration to run the Pulley interpreter inside Wasmtime. * remove stray fn main * Add small tests for special x registers * Remove now-unused import * always generate 0 pc rel offsets in arbitrary * Add doc_auto_cfg feature for docs.rs * enable all optional features for docs.rs * Consolidate `BytecodeStream::{advance,get1,get2,...}` into `BytecodeStream::read` * fix fuzz targets build * inherit workspace lints in pulley's fuzz crate * Merge fuzz targets into one target; fix a couple small fuzz bugs
* Introduce the `pulley-interpreter` crate This commit is the first step towards implementing bytecodealliance/rfcs#35 This commit introduces the `pulley-interpreter` crate which contains the Pulley bytecode definition, encoder, decoder, disassembler, and interpreter. This is still very much a work in progress! It is expected that we will tweak encodings and bytecode definitions, that we will overhaul the interpreter (to, for example, optionally support the unstable Rust `explicit_tail_calls` feature), and otherwise make large changes. This is just a starting point to get the ball rolling. Subsequent commits and pull requests will do things like add the Cranelift backend to produce Pulley bytecode from Wasm as well as the runtime integration to run the Pulley interpreter inside Wasmtime. * remove stray fn main * Add small tests for special x registers * Remove now-unused import * always generate 0 pc rel offsets in arbitrary * Add doc_auto_cfg feature for docs.rs * enable all optional features for docs.rs * Consolidate `BytecodeStream::{advance,get1,get2,...}` into `BytecodeStream::read` * fix fuzz targets build * inherit workspace lints in pulley's fuzz crate * Merge fuzz targets into one target; fix a couple small fuzz bugs * Add Pulley to our cargo vet config * Add pulley as a crate to publish
* Introduce the `pulley-interpreter` crate This commit is the first step towards implementing bytecodealliance/rfcs#35 This commit introduces the `pulley-interpreter` crate which contains the Pulley bytecode definition, encoder, decoder, disassembler, and interpreter. This is still very much a work in progress! It is expected that we will tweak encodings and bytecode definitions, that we will overhaul the interpreter (to, for example, optionally support the unstable Rust `explicit_tail_calls` feature), and otherwise make large changes. This is just a starting point to get the ball rolling. Subsequent commits and pull requests will do things like add the Cranelift backend to produce Pulley bytecode from Wasm as well as the runtime integration to run the Pulley interpreter inside Wasmtime. * remove stray fn main * Add small tests for special x registers * Remove now-unused import * always generate 0 pc rel offsets in arbitrary * Add doc_auto_cfg feature for docs.rs * enable all optional features for docs.rs * Consolidate `BytecodeStream::{advance,get1,get2,...}` into `BytecodeStream::read` * fix fuzz targets build * inherit workspace lints in pulley's fuzz crate * Merge fuzz targets into one target; fix a couple small fuzz bugs * Add Pulley to our cargo vet config * Add pulley as a crate to publish * Move Pulley fuzz target into top level fuzz directory
|
Just an FYI for folks here who might not have seen, but the first Pulley-related PR, introducing the skeleton of the interpreter, the bytecode format, encoder, decoder, and disassembler just landed: bytecodealliance/wasmtime#9008 (Still very much a WIP!) More stuff incoming soon. Won't cross-post everything over here though, just this initial message. |
Introduce Pulley — a portable, optimizing interpreter — to Wasmtime.
Rendered RFC