Challenge 2: #debugging. You spend a ton of time writing the compiler, fingers crossed for getting at least the hello world working, but instead you get table/memory out of bounds! Unreachable instruction executed! What on earth went wrong!?? It's totally not a shame to not get things right on the first try. Especially compilers, one small mistake may be amplified repeatedly at compile-time, making the output a pile of trash.
But you're unlucky if targetting #wasm. You may have searched the internet and found blog posts about source maps, dwarf, v8 inspector, or some wasm engine claiming to support debugging via lldb/gdb. My own experience as of today: they are extremely fragile if not to say non-existent. Give them a try anyway, but keep in mind they don't qualify as your lifeboat, and to get dwarf stuff working you need a ton of extra effort during code generation!
There're still some strategies you can follow.
First and foremost: crash early. Instrument your code aggressively, whenever you doubt if a property holds at runtime, assert it. It's common that the runtime state is already corrupt but the module runs longer and trips on other seemingly irrelevant places. You may also dump logs, they do help sometimes.
Next: shrink it. Use wasm-reduce in #binaryen to shrink the wasm module, or even better, use #creduce to shrink the miscompiled module's assembly source (if you know it's the crime scene), or the offending input that triggers the bug. Shrinking is an absolute must to minimize the debugging overhead. In the worst case you don't get additional insight, but at least you get some coffee breaks to relax :/
Sometimes you have an alternative compiler which emits correct wasm from the same input, which can be regarded as the source of truth. Luckily this was the case for #ghc wasm backend! GHC has target-specific assembly generators, but also a target-independent c generator, which is meant to ease porting GHC to new platforms. And it was tremendously useful when I debugged the wasm backend's code generator part; I even spent extra effort to make callconv & symbol names coherent between the two codegens, mixed good/bad objects at link-time, this was super useful when narrowing down the actual crime scenes.
Another low effort thing to try, especially if your compiler piggybacks on other toolchains like #llvm or binaryen: turn off any optimization. If you're lucky, it's someone else's bug :)
#llvm #ghc #creduce #binaryen #wasm #debugging
Now it may be good timing for a "what i wish i knew when writing a #wasm targetting compiler" thread.
A seldom mentioned challenge is separate compilation. Lots of compilers have separate compiling/linking steps, only emitting final wasm module during linking. So what does the compiling step emit as output object files? It can be some ad hoc IR to support link time optimization, or it can be something that looks like wasm but the data segments & functions can be re-arranged at link time.
When I first worked on asterius, I wrote a custom linker that used the #binaryen library to emit final wasm. The object files were just serialized haskell datatypes that models the binaryen IR. This approach has cost me dearly: the custom object format and linking logic proved to be very slow, and what was worse, it didn't support linking with objects produced by other toolchains, like C/C++ compiled by clang. So it was incredibly hard to make asterius use #ghc rts proper, and I was obviously not as good as Simon Marlow or other people in re-implementing the entire ghc rts!
Don't get me wrong, binaryen is not to blame, it's a fantastic tool. But binaryen doesn't support emitting object files that conform to the #llvm wasm linking convention. If your compiler targets wasm and you've firmly made up your mind to not support linking any C/C++/Rust code, then have fun cooking your own linker. Otherwise if you want to use binaryen other than a post-linking optimizer, you really really should think twice. There is much to lose in this choice.
Fine, you may say, let's emit llvm-compatible wasm objects. How to get started? There's a spec at https://github.com/WebAssembly/tool-conventions/blob/main/Linking.md, each object itself is a valid wasm module with custom sections containing linking metadata. You've consumed the spec, good luck playing around with LEB128 encodings and calculating binary offsets lol.
But decent platforms all have their assembly languages that liberates you from 0s and 1s and allow you to focus on symbols and entities right? For wasm, the official textual format doesn't help in separate compilation. Luckily (o really?), there's a completely undocumented assembly language for wasm too! Just compile some C stuff, use -S instead of -c, and cargo cult your journey starting from there! I repeat, no documents, no tutorials, and occasionally you need to dig into the llvm codebasee and be prepared to file bugs when working with this format. Which is still the best bet in wasm separate compilation today.