Pulley: a portable, optimizing interpreter for Wasmtime by fitzgen · Pull Request #35 · bytecodealliance/rfcs

fitzgen · 2024-06-12T23:13:03Z

Introduce Pulley — a portable, optimizing interpreter — to Wasmtime.

Rendered RFC

cfallin · 2024-06-14T19:41:36Z

Since no one else has commented on this I'll try to kick off discussion: I very much appreciate the explicit focus on good performance within the scope of a portable interpreter. It's non-obvious at first why this should be -- usually interpreters are seen as the lowest tier, a low-latency way to begin execution quickly -- but here explicitly we're focused on portability and then performance within that scope.

IIRC from earlier discussions with you as well, the core Pulley interpreter is also Wasmtime-agnostic, yes? In other words, accesses to runtime data structures are lowered to raw loads and stores at the bytecode level; it really is close to the abstraction level that generated machine code would run at, rather than a "wasm load" opcode or somesuch that needs to perform dynamic checks; this is a strategy to benefit from the same kinds of codegen optimization (EDIT: e.g., bounds-checking cases) that we've done already for actual compiled code. This is another non-obvious but very fruitful design choice, I think.

Perhaps worth adding as a benefit is that, as this backend exercises cranelift-wasm and associated lowering of JIT-side accesses to runtime data structures, and otherwise behaves very similarly to generated machine code, it might serve as a nice debugging tool for engine developers. That's a niche audience but if, for example, we can more easily see, step through, and debug what is happening with a GC fastpath without having to fall back to rr and stepping through assembly, that might be a win. Likewise for testing one could imagine checking raw loads/store opcodes as in-range for known VM data structures (and perhaps carry through metadata to signal intent?); a sort of "built-in ASan/Valgrind" done more easily than in actual JIT code.

alexcrichton · 2024-06-17T16:28:37Z

I also agree that emphasis on portability here makes the most sense in the context of Wasmtime. A major use case for a feature like this will be to take Wasmtime where Cranelift can't go which motivates both the portability and performance parts.

Other than that though I don't have too much to add, just wanted to say 👍

fitzgen · 2024-06-17T17:58:04Z

IIRC from earlier discussions with you as well, the core Pulley interpreter is also Wasmtime-agnostic, yes? In other words, accesses to runtime data structures are lowered to raw loads and stores at the bytecode level; it really is close to the abstraction level that generated machine code would run at, rather than a "wasm load" opcode or somesuch that needs to perform dynamic checks; this is a strategy to benefit from the same kinds of codegen optimization (EDIT: e.g., bounds-checking cases) that we've done already for actual compiled code. This is another non-obvious but very fruitful design choice, I think.

Yes, this is correct.

And as alluded to, I have a WIP branch that sketches out the new backend, bytecode format, and Wasmtime integration. It isn't quite ready for sharing more widely, very messy and lots of TODO comments, but it can run some very simple Wasm programs already. I also don't want to focus too much on this WIP implementation though; I want to focus on the RFC and its proposal and whether we want that shape of thing and its trade offs or not, rather than coloring those decisitons too much by what is or is not implemented.

FWIW, the low-levelness of the bytecode also meant that implementing all control flow was pretty straightforward. Didn't have to implement all of if/else/loop/br/br_if/etc... Just implemented conditional and unconditional jumps and Cranelift took care of the rest. I think that is another really nice benefit of this proposed approach. (I haven't implemented br_table yet, however, which will require some more work on the part of the backend.)

abrown · 2024-06-19T15:07:29Z

One question I thought of while rereading this: what about tiering? I thing @fitzgen made clear the portability benefits, but people may think, "oh, we can avoid some startup cost due to JIT compilation!" But the RFC doesn't sketch out this part of the story beyond the semi-related "no fast startup" time paragraph. What should the story be here: non-goal? open to eventual tiering?

alexcrichton · 2024-06-20T17:46:44Z

We talked a bit in the Wasmtime meeting today about this but wanted to record here as well: no this isn't intended to eventually be used for tiering compilation. Tiered compilation is a pretty major feature and would be separate from this, so I think it's safe to say it's a non-goal of this RFC.

programmerjake · 2024-06-20T23:39:52Z

I have an idea for a way to make a fast interpreter, loosely inspired by the Chicken compiler:
basically use calls which generally will optimize to tail calls except that we also occasionally return to a top-level loop to unwind the stack in case the compiler didn't actually generate tail calls:

#[repr(usize)]
pub enum InsnResult {
    Done,
    UnwindStackAndContinue,
}
pub type InsnFn = unsafe fn(
    pc: *const u8, // program counter
    state: &mut State,
    // most ABIs allow passing some arguments in registers, exploit that for fast temporaries
    fast_reg0: usize,
    fast_reg1: usize,
) -> InsnResult;

macro_rules! decl_opcodes {
    (
        $vis:vis enum $Opcode:ident {}
        #[fn($pc:ident, $state:ident, $fast_reg0: ident, $fast_reg1: ident)]
        match _ {
            $(Self::$Variant:ident {
                $(#[type = $field_ty:ty]
                $field:ident,)*
            } => $body:block)*
        }
    ) => {
        #[derive(Copy, Clone, Eq, PartialEq, Hash, Debug)]
        #[repr(u8)]
        $vis enum $Opcode {
            $($Variant,)*
        }

        $(
        #[derive(Copy, Clone, Eq, PartialEq, Hash, Debug)]
        #[repr(packed)]
        $vis struct $Variant {
            $($vis $field: $field_ty,)*
        }

        impl $Variant {
            $vis unsafe fn run(
                mut $pc: *const u8,
                $state: &mut State,
                $fast_reg0: usize,
                $fast_reg1: usize,
            ) -> InsnResult {
                 let Self {
                     $($field,)*
                 } = unsafe {
                     let insn_ptr = $pc.cast::<Opcode>().add(1).cast::<Self>();
                     $pc = insn_ptr.add(1).cast();
                     *insn_ptr
                 };
                 $body
            }
        }
        )*

        impl $Opcode {
            $vis const DISPATCH: &[InsnFn] = &[
                $($Variant::run,)*
            ];
            #[inline(always)]
            $vis unsafe fn run(
                $pc: *const u8,
                $state: &mut State,
                $fast_reg0: usize,
                $fast_reg1: usize,
            ) -> InsnResult {
                unsafe {
                    Self::DISPATCH.get_unchecked(*$pc as usize)($pc, $state, $fast_reg0, $fast_reg1)
                }
            }
        }
    };
}

pub unsafe fn interpret(pc: *const u8, state: &mut State) {
    loop {
        state.depth = 0;
        let fast_reg0 = state.fast_reg0;
        let fast_reg1 = state.fast_reg1;
        match Opcode::run(pc, state, fast_reg0, fast_reg1) {
            InsnResult::UnwindStackAndContinue => continue,
            InsnResult::Done => return,
        }
    }
}

opcodes! {
    pub enum Opcode {}
    #[fn(pc, state, fast_reg0, fast_reg1)]
    match _ {
        Self::UnwindIfTooDeep {
            #[type = u16]
            inc,
        } => {
            state.depth += inc as u32; // account for all intervening instructions too
            if state.depth > 50000 {
                state.pc = pc;
                state.fast_reg0 = fast_reg0;
                state.fast_reg1 = fast_reg1;
                return InsnResult::UnwindStackAndContinue;
            }
            Opcode::run(pc, state, fast_reg0, fast_reg1)
        }
        Self::Done => {
            state.pc = pc;
            state.fast_reg0 = fast_reg0;
            state.fast_reg1 = fast_reg1;
            return InsnResult::Done;
        }
        Self::BrIfF0IsZ {
            #[type = i32]
            offset,
        } => {
            if fast_reg0 == 0 {
                unsafe {
                    pc = pc.offset(offset);
                }
            }
            Opcode::run(pc, state, fast_reg0, fast_reg1)
        }
        Self::BrIfF1IsZ {
            #[type = i32]
            offset,
        } => {
            if fast_reg1 == 0 {
                unsafe {
                    pc = pc.offset(offset);
                }
            }
            Opcode::run(pc, state, fast_reg0, fast_reg1)
        }
        Self::AddF0F0F1 {} => {
            let fast_reg0 = fast_reg0.wrapping_add(fast_reg1);
            Opcode::run(pc, state, fast_reg0, fast_reg1)
        }
        // add more instructions here
    }
}

cfallin · 2024-06-21T00:15:32Z

@programmerjake I suspect the failure mode of that optimization -- a periodic giant pause where 50k calls are unwound -- might be a "medicine worse than the disease" so to speak; especially when designing a portable interpreter, we should assume that optimizations like "maybe tail call, no guarantees" will be brittle or fail on some platforms, so we should have a solid no-frills standard interpreter loop design, until and unless Rust itself provides us better primitives for this.

That said I also suspect that the high order bit for discussion here is whether we want to adopt an interpreter approach for portability, and whether the Cranelift compilation to bytecode approach is desirable; we'll have plenty of time to bikeshed interpreter design details later :-)

programmerjake · 2024-06-21T00:21:32Z

@programmerjake I suspect the failure mode of that optimization -- a periodic giant pause where 50k calls are unwound -- might be a "medicine worse than the disease" so to speak

on platforms where longjmp just loads the new stack pointer and doesn't unwind, using that could eliminate the giant pause...or the limit could be set to a much lower value such as 1000 so you get a bunch of really short pauses. you can kinda think of the loop { match opcode { ... } } option as setting the limit to 1.

fitzgen · 2024-06-26T19:43:32Z

@alexcrichton @abrown

Tiered compilation is a pretty major feature and would be separate from this, so I think it's safe to say it's a non-goal of this RFC.

Just pushed a commit adding this as an explicit non-goal.

Doesn't necessarily mean that Wasmtime will never do that sort of thing, but it definitely isn't a goal for this work.

fitzgen · 2024-06-26T19:57:41Z

@programmerjake thanks for taking a look at the RFC and brainstorming interpreter design! That kind of bounded recursion is indeed a cool technique; I've done similar things in the past for recursive marking in GC implementations before.

However, I'd prefer (at least initially) to start with the basic loop { match opcode { .. } } form by default but write the code with some light macro_rules usage where we can have a cargo feature that enables the explicit tail calls nightly Rust feature and uses tail calls instead of loop and match. This way, users can opt into guaranteed tail calls, getting our ~exact desired codegen for the interpreter loop, if they really need that performance and are willing to work with potential breakage due to working with unstable, cutting-edge Rust features. And otherwise, we keep the interpreter loop as complexity-free as we can.

We can, of course, evaluate proposed changes to improve perf once we have a thing that generally works and is not vaporware -- if the speedup-to-complexity ratio is high enough it may indeed make sense! -- but I'd prefer to start with getting something out the door that folks can start playing with first :)

fitzgen · 2024-06-26T20:10:12Z

@programmerjake oh also, it might be fun sending a PR to https://github.com/tipo159/rust-instruction-dispatch to add this bounded-recursion approach and see how it stacks up with the existing implementation strategies they compare between. If you end up doing this, let me know, I'd certainly be interested in seeing the results!

programmerjake · 2024-06-28T12:09:13Z

@fitzgen I made a PR here: tipo159/rust-instruction-dispatch#7

fitzgen · 2024-06-28T16:01:01Z

@fitzgen I made a PR here: tipo159/rust-instruction-dispatch#7

Awesome! Did you happen to run their benchmarks and compare this approach's performance to the others?

fitzgen · 2024-06-28T16:08:40Z

As there seems to be pretty broad consensus and support for pursuing Pulley in the discussion here, I'd like to start the process of merging this RFC!

Motion to Finalize

Disposition: Merge

As always, details on the RFC process can be found here: https://github.com/bytecodealliance/rfcs/blob/main/accepted/rfc-process.md#making-a-decision-merge-or-close

programmerjake · 2024-06-28T21:14:40Z

@fitzgen I made a PR here: tipo159/rust-instruction-dispatch#7

Awesome! Did you happen to run their benchmarks and compare this approach's performance to the others?

not yet, i ended up staying up waay too late, also I didn't integrate the longjmp unwinding which i think will be helpful since forcing it to use calls instead of tail calls (the force_use_stack feature) drastically reduced performance by like 3x on a Ryzen 7950x.

elliottt

🎉

fitzgen · 2024-07-01T21:11:20Z

As there has been signoff from two different stakeholder organizations, this RFC is entering its 10 day

Final Comment Period

and the last day to raise objections before this can merge is 2024-07-11.

Thanks everyone!

fitzgen · 2024-07-12T16:51:15Z

Since no objections were raised during the final comment period, I'm going to go ahead and merge this RFC. Thanks everyone!

This commit is the first step towards implementing bytecodealliance/rfcs#35 This commit introduces the `pulley-interpreter` crate which contains the Pulley bytecode definition, encoder, decoder, disassembler, and interpreter. This is still very much a work in progress! It is expected that we will tweak encodings and bytecode definitions, that we will overhaul the interpreter (to, for example, optionally support the unstable Rust `explicit_tail_calls` feature), and otherwise make large changes. This is just a starting point to get the ball rolling. Subsequent commits and pull requests will do things like add the Cranelift backend to produce Pulley bytecode from Wasm as well as the runtime integration to run the Pulley interpreter inside Wasmtime.

* Introduce the `pulley-interpreter` crate This commit is the first step towards implementing bytecodealliance/rfcs#35 This commit introduces the `pulley-interpreter` crate which contains the Pulley bytecode definition, encoder, decoder, disassembler, and interpreter. This is still very much a work in progress! It is expected that we will tweak encodings and bytecode definitions, that we will overhaul the interpreter (to, for example, optionally support the unstable Rust `explicit_tail_calls` feature), and otherwise make large changes. This is just a starting point to get the ball rolling. Subsequent commits and pull requests will do things like add the Cranelift backend to produce Pulley bytecode from Wasm as well as the runtime integration to run the Pulley interpreter inside Wasmtime. * remove stray fn main * Add small tests for special x registers * Remove now-unused import * always generate 0 pc rel offsets in arbitrary * Add doc_auto_cfg feature for docs.rs * enable all optional features for docs.rs * Consolidate `BytecodeStream::{advance,get1,get2,...}` into `BytecodeStream::read` * fix fuzz targets build * inherit workspace lints in pulley's fuzz crate * Merge fuzz targets into one target; fix a couple small fuzz bugs

* Introduce the `pulley-interpreter` crate This commit is the first step towards implementing bytecodealliance/rfcs#35 This commit introduces the `pulley-interpreter` crate which contains the Pulley bytecode definition, encoder, decoder, disassembler, and interpreter. This is still very much a work in progress! It is expected that we will tweak encodings and bytecode definitions, that we will overhaul the interpreter (to, for example, optionally support the unstable Rust `explicit_tail_calls` feature), and otherwise make large changes. This is just a starting point to get the ball rolling. Subsequent commits and pull requests will do things like add the Cranelift backend to produce Pulley bytecode from Wasm as well as the runtime integration to run the Pulley interpreter inside Wasmtime. * remove stray fn main * Add small tests for special x registers * Remove now-unused import * always generate 0 pc rel offsets in arbitrary * Add doc_auto_cfg feature for docs.rs * enable all optional features for docs.rs * Consolidate `BytecodeStream::{advance,get1,get2,...}` into `BytecodeStream::read` * fix fuzz targets build * inherit workspace lints in pulley's fuzz crate * Merge fuzz targets into one target; fix a couple small fuzz bugs * Add Pulley to our cargo vet config * Add pulley as a crate to publish

* Introduce the `pulley-interpreter` crate This commit is the first step towards implementing bytecodealliance/rfcs#35 This commit introduces the `pulley-interpreter` crate which contains the Pulley bytecode definition, encoder, decoder, disassembler, and interpreter. This is still very much a work in progress! It is expected that we will tweak encodings and bytecode definitions, that we will overhaul the interpreter (to, for example, optionally support the unstable Rust `explicit_tail_calls` feature), and otherwise make large changes. This is just a starting point to get the ball rolling. Subsequent commits and pull requests will do things like add the Cranelift backend to produce Pulley bytecode from Wasm as well as the runtime integration to run the Pulley interpreter inside Wasmtime. * remove stray fn main * Add small tests for special x registers * Remove now-unused import * always generate 0 pc rel offsets in arbitrary * Add doc_auto_cfg feature for docs.rs * enable all optional features for docs.rs * Consolidate `BytecodeStream::{advance,get1,get2,...}` into `BytecodeStream::read` * fix fuzz targets build * inherit workspace lints in pulley's fuzz crate * Merge fuzz targets into one target; fix a couple small fuzz bugs * Add Pulley to our cargo vet config * Add pulley as a crate to publish * Move Pulley fuzz target into top level fuzz directory

fitzgen · 2024-07-25T23:12:39Z

Just an FYI for folks here who might not have seen, but the first Pulley-related PR, introducing the skeleton of the interpreter, the bytecode format, encoder, decoder, and disassembler just landed: bytecodealliance/wasmtime#9008

(Still very much a WIP!)

More stuff incoming soon. Won't cross-post everything over here though, just this initial message.

[New RFC] Pulley: a portable, optimizing interpreter for Wasmtime

607417c

Add "tiering up" as an explicit non-goal

b67e6da

programmerjake mentioned this pull request Jun 28, 2024

add demo for using tail calls -- with unwinding for if the calls aren't optimized to tail calls tipo159/rust-instruction-dispatch#7

Open

jameysharp approved these changes Jul 1, 2024

View reviewed changes

elliottt approved these changes Jul 1, 2024

View reviewed changes

alexcrichton approved these changes Jul 1, 2024

View reviewed changes

cfallin approved these changes Jul 1, 2024

View reviewed changes

tschneidereit mentioned this pull request Jul 9, 2024

Draft implementation of wasm shaper harfbuzz/rustybuzz#122

Merged

fitzgen merged commit de8616b into bytecodealliance:main Jul 12, 2024

fitzgen deleted the pulley branch July 12, 2024 16:51

bjorn3 mentioned this pull request Jul 22, 2024

Does wasmtime support interpreter? bytecodealliance/wasmtime#8984

Closed

fitzgen mentioned this pull request Jul 24, 2024

Introduce the pulley-interpreter crate bytecodealliance/wasmtime#9008

Merged

Conversation

fitzgen commented Jun 12, 2024

Rendered RFC

Uh oh!

cfallin commented Jun 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexcrichton commented Jun 17, 2024

Uh oh!

fitzgen commented Jun 17, 2024

Uh oh!

abrown commented Jun 19, 2024

Uh oh!

alexcrichton commented Jun 20, 2024

Uh oh!

programmerjake commented Jun 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cfallin commented Jun 21, 2024

Uh oh!

programmerjake commented Jun 21, 2024

Uh oh!

fitzgen commented Jun 26, 2024

Uh oh!

fitzgen commented Jun 26, 2024

Uh oh!

fitzgen commented Jun 26, 2024

Uh oh!

programmerjake commented Jun 28, 2024

Uh oh!

fitzgen commented Jun 28, 2024

Uh oh!

fitzgen commented Jun 28, 2024

Motion to Finalize

Uh oh!

programmerjake commented Jun 28, 2024

Uh oh!

elliottt left a comment

Choose a reason for hiding this comment

Uh oh!

fitzgen commented Jul 1, 2024

Final Comment Period

Uh oh!

fitzgen commented Jul 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fitzgen commented Jul 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

cfallin commented Jun 14, 2024 •

edited

Loading

programmerjake commented Jun 20, 2024 •

edited

Loading

fitzgen commented Jul 12, 2024 •

edited

Loading