Mike's corner of the web.

Adventures in WebAssembly object files and linking

Tuesday 20 April 2021 20:10

I've been tinkering with a toy compiler, and recently added support for compiling to WebAssembly (aka Wasm). At first, I compiled to the text format (aka .wat), and also wrote the runtime using the text format. I didn't want to write the entire runtime directly in Wasm though, so I looked into how to write the runtime in C instead. I managed to get things working after a few days of tinkering and some helpful advice, but since WebAssembly is comparatively new, guides were a bit thin on the ground. This is my attempt, then, to capture what I did in case it's of use to anyone.

My compiler originally worked by outputting everything -- code compiled from my language and the runtime alike -- into a single .wat file, and then compiling that into a WebAssembly module. Since I wanted to use different compilers for different parts of the final code -- my compiler for my language, Clang for the runtime written in C -- outputting everything into one enormous .wat file wasn't possible any more. Instead, Clang can produce a WebAssembly object file, which meant I needed to change my compiler to similarly produce WebAssembly object files, and then link them together using wasm-ld. In other words, at a high-level, I wanted to be able do something like this:

clang --target=wasm32 -c runtime.c -o runtime.o
my-compiler program.src -o program.o
wasm-ld runtime.o program.o -o program.wasm

WebAssembly object files are structured as WebAssembly modules with custom sections in. Since the text format doesn't support custom sections, the first step was to change my compiler to output in the binary format instead. Fortunately, having previously read the WebAssembly spec to understand the structure of WebAssembly and the text format, this was reasonably straightforward. The only bit I found particularly fiddly was dealing with LEB128 encoding.

The structure of object files aren't part of the WebAssembly spec itself, and so are documented separately. I'll leave the details to the docs, and just mention at a high level the changes I had to make compared to directly producing a Wasm module:

  • Instead of defining memory in the module and exporting it, import __linear_memory from the env module.

  • Instead of defining a table in the module, import __indirect_function_table from the env module.

  • Instead of having an immutable global containing a memory address, have a mutable global with the memory address being written as part of the initialisation function. This was necessary since the memory address needs to be relocatable, but relocations aren't valid in the global section.

  • Whenever emitting relocatable data -- in my case, this was type indices, function indices, global indices, table entry indices and memory addresses in the code section -- write the data as maximally padded LEB128 encoded integers so that updated values can be written without affecting the position of other bytes. For instance, the function index 3 should be written as 0x83 0x80 0x80 0x80 0x00.

  • Add a linking section i.e. a custom section with the name "linking".

  • Make sure all of the data that need to be contiguous are in the same data segment. For instance, my compiler compiles strings to a {length, UTF-8 data} struct. Previously, my compiler was generating this as two data segments, one for the length, and one for the UTF-8 data. Since the linker can, in principle (I think!), rearrange data segments, I changed this to be represented by a single data segment.

  • Write all of the data segments into a WASM_SEGMENT_INFO subsection in the linking section. Since each segment requires a name, I just gave each segment the name "DATA_SEGMENT_${dataSegmentIndex}".

  • Add a symbol table to the linking section. For my compiler, the symbols I needed to emit were:

    • Functions (both imported and defined)
    • Globals
    • Memory addresses (for my compiler, these have a 1:1 mapping with the data segments, so I just generated the name "DATA_${dataSegmentIndex}")
  • Instead of emitting a start section that points to the initialisation function, emit a WASM_INIT_FUNCS subsection with a single entry that points to the initialisation function in the linking section. Since the function is referenced by symbol index, not the function index, this section needs to go after the symbol table.

  • Add a relocation custom section for the code section. Anywhere in the code section that references relocatable entities should have a relocation entry. Note that the indices are for symbol entries, not the indices that are used by instructions (such as function indices). For my compiler, I emit relocation entries for:

    • Function indices (arguments to call instructions)
    • Global indices (arguments to global.set and global.get instructions)
    • Type indices (arguments to call_indirect instructions)
    • Memory addresses (arguments to i32.const instructions to produce values eventually used by load and store instructions)
    • Table entry indices (arguments to i32.const instructions to produce values eventually used by call_indirect instructions). Note that the index in the relocation entry itself should reference the index of function symbol, not a table entry index.

I didn't have any relocatable data in my data section, so I didn't need to worry about that.

With those changes, I was able to make an object file that could be combined with the object file from Clang to make the final Wasm module.

One thing worth noting is that a __wasm_call_ctors function is synthesised in the final Wasm module, which calls all of the initialisation functions. LLVM 12 has a change to wasm-ld which means that any entry point (considered to be any exported function) has a call to __wasm_call_ctors automatically inserted if there isn't an existing explicit call to __wasm_call_ctors. In other words, if you're using LLVM 12, you probably don't need to worry about calling __wasm_call_ctors yourself.

One last thought: I struggled to work out what the right place to ask for advice was, such as a mailing list or forum. I stumbled across the WebAssembly Discord server after I'd already found answers and gotten some helpful advice on a GitHub issue, but it seems pretty active, so that might a good starting place if you have questions or get stuck. If there's anywhere else with an active community, I'd love to hear about it!

Topics: WebAssembly

Thoughts? Comments? Feel free to drop me an email at hello@zwobble.org. You can also find me on Twitter as @zwobble.