Mike's corner of the web.

Adventures in WebAssembly object files and linking

Tuesday 20 April 2021 20:10

I've been tinkering with a toy compiler, and recently added support for compiling to WebAssembly (aka Wasm). At first, I compiled to the text format (aka .wat), and also wrote the runtime using the text format. I didn't want to write the entire runtime directly in Wasm though, so I looked into how to write the runtime in C instead. I managed to get things working after a few days of tinkering and some helpful advice, but since WebAssembly is comparatively new, guides were a bit thin on the ground. This is my attempt, then, to capture what I did in case it's of use to anyone.

My compiler originally worked by outputting everything -- code compiled from my language and the runtime alike -- into a single .wat file, and then compiling that into a WebAssembly module. Since I wanted to use different compilers for different parts of the final code -- my compiler for my language, Clang for the runtime written in C -- outputting everything into one enormous .wat file wasn't possible any more. Instead, Clang can produce a WebAssembly object file, which meant I needed to change my compiler to similarly produce WebAssembly object files, and then link them together using wasm-ld. In other words, at a high-level, I wanted to be able do something like this:

clang --target=wasm32 -c runtime.c -o runtime.o
my-compiler program.src -o program.o
wasm-ld runtime.o program.o -o program.wasm

WebAssembly object files are structured as WebAssembly modules with custom sections in. Since the text format doesn't support custom sections, the first step was to change my compiler to output in the binary format instead. Fortunately, having previously read the WebAssembly spec to understand the structure of WebAssembly and the text format, this was reasonably straightforward. The only bit I found particularly fiddly was dealing with LEB128 encoding.

The structure of object files aren't part of the WebAssembly spec itself, and so are documented separately. I'll leave the details to the docs, and just mention at a high level the changes I had to make compared to directly producing a Wasm module:

  • Instead of defining memory in the module and exporting it, import __linear_memory from the env module.

  • Instead of defining a table in the module, import __indirect_function_table from the env module.

  • Instead of having an immutable global containing a memory address, have a mutable global with the memory address being written as part of the initialisation function. This was necessary since the memory address needs to be relocatable, but relocations aren't valid in the global section.

  • Whenever emitting relocatable data -- in my case, this was type indices, function indices, global indices, table entry indices and memory addresses in the code section -- write the data as maximally padded LEB128 encoded integers so that updated values can be written without affecting the position of other bytes. For instance, the function index 3 should be written as 0x83 0x80 0x80 0x80 0x00.

  • Add a linking section i.e. a custom section with the name "linking".

  • Make sure all of the data that need to be contiguous are in the same data segment. For instance, my compiler compiles strings to a {length, UTF-8 data} struct. Previously, my compiler was generating this as two data segments, one for the length, and one for the UTF-8 data. Since the linker can, in principle (I think!), rearrange data segments, I changed this to be represented by a single data segment.

  • Write all of the data segments into a WASM_SEGMENT_INFO subsection in the linking section. Since each segment requires a name, I just gave each segment the name "DATA_SEGMENT_${dataSegmentIndex}".

  • Add a symbol table to the linking section. For my compiler, the symbols I needed to emit were:

    • Functions (both imported and defined)
    • Globals
    • Memory addresses (for my compiler, these have a 1:1 mapping with the data segments, so I just generated the name "DATA_${dataSegmentIndex}")
  • Instead of emitting a start section that points to the initialisation function, emit a WASM_INIT_FUNCS subsection with a single entry that points to the initialisation function in the linking section. Since the function is referenced by symbol index, not the function index, this section needs to go after the symbol table.

  • Add a relocation custom section for the code section. Anywhere in the code section that references relocatable entities should have a relocation entry. Note that the indices are for symbol entries, not the indices that are used by instructions (such as function indices). For my compiler, I emit relocation entries for:

    • Function indices (arguments to call instructions)
    • Global indices (arguments to global.set and global.get instructions)
    • Type indices (arguments to call_indirect instructions)
    • Memory addresses (arguments to i32.const instructions to produce values eventually used by load and store instructions)
    • Table entry indices (arguments to i32.const instructions to produce values eventually used by call_indirect instructions). Note that the index in the relocation entry itself should reference the index of function symbol, not a table entry index.

I didn't have any relocatable data in my data section, so I didn't need to worry about that.

With those changes, I was able to make an object file that could be combined with the object file from Clang to make the final Wasm module.

One thing worth noting is that a __wasm_call_ctors function is synthesised in the final Wasm module, which calls all of the initialisation functions. LLVM 12 has a change to wasm-ld which means that any entry point (considered to be any exported function) has a call to __wasm_call_ctors automatically inserted if there isn't an existing explicit call to __wasm_call_ctors. In other words, if you're using LLVM 12, you probably don't need to worry about calling __wasm_call_ctors yourself.

One last thought: I struggled to work out what the right place to ask for advice was, such as a mailing list or forum. I stumbled across the WebAssembly Discord server after I'd already found answers and gotten some helpful advice on a GitHub issue, but it seems pretty active, so that might a good starting place if you have questions or get stuck. If there's anywhere else with an active community, I'd love to hear about it!

Topics: WebAssembly

GraphQL composition can radically simplify data query maintenance

Saturday 27 March 2021 12:00

Using GraphQL has plenty of well-documented benefits, avoiding under- and over-fetching, strong typing, and tooling support among them. However, I think the value of the composability of GraphQL tends to be significantly underplayed. This composibility allows the data requirements for a component to be specified alongside the component itself, making the code easier to understand, and radically easier and safer to change.

Let's take an example. Suppose I'm building a music application in React, and I have a page showing a playlist of tracks. I might write a PlaylistPage component, which in turn uses a PlaylistTracks component to render the tracks in a playlist:

// PlaylistPage.js

const playlistQuery = graphql`
    query PlaylistQuery($playlistId: ID!) {
        playlist(id: $playlistId) {
            name
            tracks {
                id
                title
            }
        }
    }
`;

function PlaylistPage(props) {
    const {playlistId} = props;
    const result = useQuery(playlistQuery, {playlistId: playlistId});

    if (result.type === "loaded") {
        const {playlist} = result.data;
        return (
            <Page>
                <Heading>{playlist.name}</Heading>
                <PlaylistTracks playlist={playlist} />
            </Page>
        );
    } else ...
}

// PlaylistTracks.js

function PlaylistTracks(props) {
    const {tracks} = props;
    return (
        <ul>
            {tracks.map(track => (
                <li key={track.id}>
                    {track.title}
                </li>
            ))}
        </ul>
    );
}

The query for our page currently includes everything needed to render the page, much the same as if we'd used a REST endpoint to fetch the data. The name field is needed since that's used directly in the PlaylistPage component. The tracks field, with id and title subfields, is also needed since it's used by PlaylistTracks. This seems manageable for the moment: if we stop using PlaylistTracks in PlaylistPage, we should remove the tracks field from the query. If we start using the artist field on a playlist in PlaylistTracks, we should add the artist field to the query for PlaylistPage.

Now imagine that PlaylistPage uses many more components that each take the playlist as a prop, and those components might in turn use other components. If you stopped using a component, would you know which fields were safe to remove from the query? Imagine if a deeply nested component is used on many pages. If you change the component to use another field, are you sure you've updated all of the queries to now include that field?

As an alternative, we can define a fragment for PlaylistTracks that we can then use in PlaylistPage:

// PlaylistPage.js

const playlistQuery = graphql`
    query PlaylistQuery($playlistId: ID!) {
        playlist(id: $playlistId) {
            name
            ${PlaylistTracks.playlistFragment}
        }
    }
`;

export default function PlaylistPage(props) {
    const result = useQuery(playlistQuery, {playlistId: props.playlistId});

    if (result.type === "loaded") {
        return (
            <Page>
                <Heading>{playlist.name}</Heading>
                <PlaylistTracks playlist={result.data.playlist} />
            </Page>
        );
    } else ...
}

// PlaylistTracks.js

export default function PlaylistTracks(props) {
    return (
        <ul>
            {props.tracks.map(track => (
                <li key={track.id}>
                    {track.title}
                </li>
            ))}
        </ul>
    );
}

PlaylistTracks.playlistFragment = graphql`
    ... on Playlist {
        id
        title
    }
`;

If we stop using PlaylistTracks in PlaylistPage, it's very clear which part of the query we should also remove: ${PlaylistTracks.playlistFragment}. If we start using the artist field on a playlist in PlaylistTracks, we can add the artist field to playlistFragment on PlaylistTracks, without having to directly edit playlistQuery. If the component is used in twenty different pages and therefore twenty different queries need to fetch the data for PlaylistTracks, we still only need to update that one fragment.

On larger codebases, I've found this approach of colocating components with their data requirements radically simplifies data handling. Changes to a single component only require editing that one component, even when the data requirements change, with no worrying about whether the data is available, or why some data is queried and whether it can be safely removed.

Topics: Software design

GraphJoiner: Implementing GraphQL with joins

Thursday 27 October 2016 20:23

I've had the chance to play around with GraphQL recently, and so far have found it all quite pleasant. Compared to fixed REST endpoints, being able to describe the data you need has resulted in less churn in the API. It also allows descriptions of the data needed to be co-located with the components actually using the data, making the code easier to understand and change.

However, most implementations of GraphQL libraries require some batching logic to behave efficiently. For instance, consider the request:

{
    books(genre: "comedy") {
        title
        author {
            name
        }
    }
}

A naive GraphQL implementation would issue one SQL query to get the list of all books in the comedy genre, and then N queries to get the author of each book (where N is the number of books returned by the first query).

The standard solution I've seen is to batch together requests. In node.js this is reasonably straightforward using promises and the event loop: don't start fulfiling the promise for each author until other requests in the same tick in the event loop have run, at which point you can fulfil those promises in bulk. The downside? This is trickier to implement in other languages that aren't built around an event loop.

Another issue with the standard implementation is performance. Normally, you define a resolver that gets executed on each field. In our example, the resolver for books and author would issue a request to get data from the database, but most resolvers just read a field: for instance, the resolver for title can just read the title field from the book that we got back from the database. The problem? If some resolvers can return asynchronous results, then you always need to handle the possibility of an asynchronous result. When dealing with larger responses, this can mean most of the time is spent in the overhead in invoking resolvers.

As an alternative, I suggest that you can get the data you want in three steps. First, get all of the books in the comedy genre with the requested scalar fields (i.e. their titles), along with their author IDs. Next, get all of the authors of books in the comedy genre with the requested scalar fields (i.e. their names), along with their IDs. Finally, using the IDs that we've fetched, join the authors onto the books.

In other words: rather than calling resolve for every field in the response, and batching together requests, I suggest that you can execute intermediate queries for each non-trivial field in the request, and then join the results together. As proof that the idea is at least somewhat workable, I've created a library called GraphJoiner, available as node-graphjoiner for node.js and python-graphjoiner for Python. When a response contains arrays with hundreds or thousands of results, I've found not having to call resolvers on every response field has massively reduced execution time (it was previously the majority of time spent handling the request, far exceeding the time to actually execute the SQL query). Hopefully someone else finds it useful!

Topics: Software design

Power à la Carte: fine-grained control of power in programming languages

Monday 28 March 2016 11:05

Proposal: general purpose programming languages should provide fine-grained control of the power that a particular module can use. "Power" includes language features, or what modules it is allowed to depend on, whether built into the language, from another package, or within the same codebase.

Suppose I'm writing some code to calculate how much tax someone should pay in a given year. I might want to forbid the use of floating point arithmetic in any of these calculations, but this isn't possible in most languages. In this proposal, I can declare what domain a module belongs to. For each domain, I can then declare what language features it's allowed to use. In this case, I'd be able to declare code as belonging to the "tax calculation" domain, and ensure that that domain cannot use floating point.

Take mutability as another example. I find most code I write is easier to reason about if it doesn't use mutability, but there are a small set of problems that I find are best expressed with mutable code, either for clarity or performance. By choosing exactly when mutability is permitted, I can still use mutability when needed without introducing its complexity into the rest of my codebase. Provided that the code doesn't mutate its arguments nor global state, the rest of the code doesn't need to know that mutability is being used: the contract of the functions can still be pure.

For each domain, I can also declare what other domains that domain is allowed to access. This allows enforcement of separation of concerns i.e. we can ensure that domains aren't intertwined when they shouldn't be. For instance, how tax is calculated shouldn't depend on personal details of an individual, such as their name. This also allows us to see what domains are used without having to read and understand the code of a module.

As another example, suppose I'm writing code to calculate the trajectory of a spacecraft. If I represent values such as distance or speed as integers, then it's possible to do something nonsensical like adding together two distances that use different units, or add together a distance and a speed. Within the domain of the trajectory calculation, I could represent these values with specialised data types preventing the incorrect use of these types. At some point, I may wish to unwrap these values, say to print them, but I can enforce that such unwrapping never happens during calculations. I'll also need to create these wrapped values in the first place, but I can declare and ensure that this value creation only occurs at well-defined boundaries, such as when reading sensors, and never directly within the calculation code.

In other words: the "mechanics" domain uses the "integer arithmetic" domain in its implementation, but that fact is private to the "mechanics" domain. In the "trajectory" domain, I explicitly declare that I can use values from the "mechanics" domain (such as adding together distances or dividing a distance by a time to get a speed), but that doesn't allow me to get their underlying integer representation nor create them from integers. In the "sensor" domain, I explicitly declare that I can create these values, meaning I can read the raw integers from some hardware, and turn them into their appropriate data types.

In the previous example, we saw how we wanted the fact that the "mechanics" domain uses the "integer arithmetic" domain to be private, but there are times when we don't want to allow this hiding. For instance, it's useful to definitively know whether a piece of code might write to disk, even if the write doesn't happen directly in that code.

I might also want to be able to enforce separation of concerns at a much higher level. Being able to enforce this separation is one of the cited benefits of a microservice architecture: it would be nice to be able to get this benefit without having to build a distributed system.

I believe part of the potential benefit of this is from going through the process of declaring what other domains a domain can access. It's often extremely easy to pull in another dependency or to rely on another part of our code without thinking about whether that makes conceptual sense. This is still possible in the system I describe, but by making the programmer be explicit, they are prompted to stop and think. I've made many bad decisions because I didn't even realise I was making a decision.

Some of this separation of domains is already possible: although I can't restrict access to language features, I have some control over what other modules each module can access: for instance, by using separate jars in Java, or assemblies in .NET. However, these are quite coarse and heavyweight: it's hard to use them to express the level of fine-grained control that I've described. The other downside is that by requiring a dependency, you normally end up being able to access its transitive dependencies: again, something that's often undesirable as previously described.

Using something like algebraic effect systems is probably more complex than necessary. The use of a language feature or domain could be modelled as an effect, but tracking this on an expression granularity is probably unhelpful: instead, tracking it by module is both simpler and more appropriate.

In fact, something like Java's annotations, C#'s attributes or Python's decorators are probably sufficient to express what domain each module resides in. You'd then need to separately define what power each domain has, which could be written in a comparatively simple declaration language. I suspect the difficulty comes not in implementing such a checker, but in working out how to use effectively, particularly what granularity is appropriate.

Topics: Software design, Language design

Nope: a statically-typed subset of Python that compiles to JS and C#

Sunday 22 March 2015 15:55

Nope is a statically-typed subset of Python that uses comments as type annotations. For instance, here's the definition of a Fibonacci function:

#:: int -> int
def fib(n):
    seq = [0, 1]
    for i in range(2, n + 1):
        seq.append(seq[i - 1] + seq[i - 2])
    
    return seq[n]

print(fib(10))

And here's a generic identity function:

#:: T => T -> T
def identity(value):
    return value

Since the types are just comments, any Nope program is directly executable as a Python program without any extra dependencies.

Having written your program with type annotations, you can now compile it to some horrible-looking JavaScript or C# and run it:

$ python3 fib.py
55
$ nope compile fib.py --backend=dotnet --output-dir=/tmp/fib
$ /tmp/fib/fib.exe
55
$ nope compile fib.py --backend=node --output-dir=/tmp/fib
$ node /tmp/fib/fib.js
55

Why?

A little while ago, I wrote mammoth.js, a library for converting Word documents to HTML. Since some people found it useful, I ported it to Python. Both implementations are extremely similar, and don't heavily exploit the dynamic features of either language. Therefore, a third port, even to a statically-typed language such as C# or Java, is likely to end up looking extremely similar.

This led to the question: if I annotated the Python implementation with appropriate types, could I use it to generate vaguely sensible C# or Java code? Nope is an experiment to find out the answer.

Should I use it?

This project is primarily an experiment and a bit of fun. Feel free to have a play around, but I'd strongly suggest not using it for anything remotely practical. Not that you'd be able to anyway: many essential features are still missing, while existing features are likely to change. The type-checking is sufficiently advanced to allow partial type-checking of the Python port of Mammoth.

I discuss a few alternatives a little later on, and explain how Nope differs. In particular, there are a number of existing type checkers for Python, the most prominent being mypy.

Examples

Simple functions

Functions must have a preceding comment as a type annotation.

#:: int -> int
def triple(value):
    return value * 3
Functions with optional and named arguments

Optional arguments are indicated with a leading question mark, and must have a default value of None. For instance:

#:: name: str, ?salutation: str -> str
def greeting(name, salutation=None):
    if salutation is None:
        salutation = "Hello"
    
    print(greeting + " " + name)

print(greeting("Alice"))
print(greeting("Bob", salutation="Hi"))

Note that the type of salutation changes as it is reassigned in a branch of the if-else statement. It is initially of type str | none, which is the union of the formal argument type and none (since it's optional). After the if-else statement, it is of the narrower type str, which allows it to be safely used in the string concatenation.

Variables

Most of the time, Nope can infer a suitable type for variables. However, there are occasions where an explicit type is required, such as when inferring the type of empty lists. In these cases, an explicit type can be specified in a similar manner to functions:

#:: list[int]
x = []
Classes

When annotating the self argument of a method, you can use the explicit name of the class:

class Greeter(object):
    #:: Greeter, str -> str
    def hello(self, name):
        return "Hello " + name

For convenience, Nope also introduces Self into the scope of the class, which can be used instead of referring to the containing class directly:

class Greeter(object):
    #:: Self, str -> str
    def hello(self, name):
        return "Hello " + name

As with local variables, instance variables assigned in __init__ can have type annotations, but will often be fine using the inferred type:

class Greeter(object):
    #:: Self, str -> none
    def __init__(self, salutation):
        self._salutation = salutation

    #:: Self, str -> str
    def hello(self, name):
        return self._salutation + " " + name

Generic classes are also supported, although the exact syntax might change:

#:generic T
class Result(object):
    #:: Self, T, list[str] -> none
    def __init__(self, value, messages):
        self.value = value
        self.messages = messages
Code transformations

To preserve some of the advantages of working in a dynamic langauge, Nope supports code transformations: given the AST of a module, a transformation returns an AST that should be used for type-checking and code generation. So long as the transformation and the original runtime behaviour are consistent, this allows you to use code such as collections.namedtuple:

import collections

User = collections.namedtuple("User", [
    #:: str
    "username",
    #:: str
    "password",
])

The current implementation is a bit of a hack, but the ultimate goal is to let a user specify transformations to apply to their code. Ideally, this would allow Python libraries such as SQLAlchemy to be supported in a type-safe manner.

And the rest

Nope also supports while and for loops, try statements, raise statements, destructuring assignment and plenty more. However, I've left them out of this post since they look the same as they do in Python. The only difference is that Nope will detect when inappropriate types are used, such as when trying to raise a value that isn't an exception.

I've started using Nope on a branch of Mammoth. Only some modules are currently being type-checked by Mammoth, such as html_generation.

If you're feeling brave, Nope has a set of execution tests that check and compile sample programs. It's the not the greatest codebase in the world with many unhanded or improperly handled cases, but feel free to read the tests if you want to dive in and see exactly what Nope supports. At the moment, Nope compiles to Python (which means just copying the files verbatim), JavaScript and C# with varying degrees of feature-completeness. The C# implementation in particular has huge scope for optimisation (since it currently relies heavily on (ab)using dynamic), but should already be fast enough for many uses.

Type system

Nope ends up with a few different kinds of type in its type system. It would be nice to be able combine some of these, but for the time being I've preferred to avoid introducing extra complexity into existing types. At the moment, Nope supports:

  • Ordinary classes, made by using class in the usual way. Since inheritance is not yet supported, a type T is a subclass of itself and no other ordinary classes.
  • Function types, such as:
    • int, str -> str (Requires two positional arguments of type int and str respectively, and returns a str.)
    • int, x: int, ?y: str -> int (Has two required arguments, the second of which can be passed by the name x. Has an optional third argument called y.)
  • Structural types. For instance, we might define a structural type HasLength with the attribute __len__ of type -> int (takes no arguments, returns an integer). Any type with an appropriately typed attribute __len__ would be a subtype of HasLength. Currently not definable by users, but should be.
  • Generic types. Nope currently supports generic functions, generic classes and generic structural types. For instance, the generic structural type Iterable has a single formal type parameter, T, and an attribute __iter__ of type -> Iterator[T], where Iterator is also a generic structural type.
  • Type unions. For instance, a variable of type int | str could hold an int or a str at runtime.

What about IronPython or Jython?

Both IronPython and Jython aim to be Python implementations, rather than implementing a restricted subset. This allows them to run plenty of Python code with little to no modification.

Nope differs in that it aims to allow the generation of code without any additional runtime dependencies. Since the code is statically typed, it should also allow better integration with the platform, such as auto-complete or Intellisense in IDEs such as Visual Studio, Eclipse and IntelliJ.

What about mypy?

mypy is an excellent project that can type check Python code. If you want a project that you can practically use, then mypy is by far more appropriate. There are other type checkers that also use Python annotations to specify types, such as obiwan.

At this point, Nope differs in two main regards. Firstly, Nope's scope is slightly different in that I'm aiming to compile to other languages.

The second main difference is that Nope aims to have zero runtime dependencies. Since mypy uses Python annotations to add type information, programs written using mypy have mypy as a runtime dependency. This also allows some meta-programming with somewhat consistent type annotations, as shown in the collections.namedtuple from earlier:

import collections

User = collections.namedtuple("User", [
    #:: str
    "username",
    #:: str
    "password",
])

What next?

I'm only working on Nope in my spare time, so progress is a little slow. The aim is to get a C# version of Mammoth working by using Nope in the next few months. Although there might not be feature parity to begin with, I'd be delighted if I got far enough to get the core program working.

If, in a surprising turn of events, this turns out to work, I'd be interested to add effects to the type system. Koka is the most notable example I know of such a system, although I'll happily admit to being some unknowledgeable in this area. Thoughts welcome!

Topics: Software design, Language design

Code smears: code smells that spread across your codebase

Monday 5 January 2015 21:34

The term "code smells" gives us a handy shorthand for signs that our code might not be especially clean. In my experience, the worst sort of code smell are "code smears": those smells that tend to spread their badness across your codebase, which also makes them more difficult to fix as time goes on. For instance, suppose a function has too many arguments, making it difficult to see the purpose of each argument. That code smell isn't confined to the function definition: it's going to appear every time that the function is called.

Code smells tell us that there's probably room for improvement. Ideally, we'd avoid all code smells, but there often comes a point where our time is better spent elsewhere. Yet not all code smells are created equal. Some code smells can be fixed later with little or no impact on the rest of the codebase: for instance, if I take an extremely long function and break it up into smaller functions that compose together nicely, nobody who calls the function needs to be aware that anything has changed.

On the other hand, some code smells tend to make their presence felt across a codebase: for instance, if a function takes four boolean arguments, this leads to somewhat mysterious function calls such as update(true, false, true, true). Fixing this requires changing everywhere that calls that function. This task can range from tedious, if the function is called in many places, to impossible, if third parties are calling into our code. As time goes on, it's likely that there will be more callers of the our function, and so more work to clean it up. I'd suggest that this second group of code smells is more severe than the first, and therefore deserves the more severe name of code smears.

As a further incentive to avoid code smears moreso than code smells, I've found that code smears are often quicker and easier to fix so long as you do so as soon as they appear. Splitting up a long method can be time-consuming and difficult. Giving a function a better name or grouping together related arguments into a single object can often be accomplished in a couple of minutes provided that the function is only used in one or two places under your control.

Sometimes, it's a good trade-off not to fix a code smell. It's rarely a good trade-off to leave behind a code smear.

Topics: Software design

The Dangerous Allure of Code Reuse

Tuesday 20 May 2014 20:42

Back when I was learning to code, one of the ideas that kept cropping up was the idea of code reuse. If one of the main benefits of functions is the ability to reuse a bit of code, doesn't it follow that we should have future code reuse in mind when writing functions? My experience says: nope. It's this thinking that leads me to create functions with six Boolean arguments, or to create functions with complex, deeply nested switching logic. By following principles such as "Don't repeat yourself" (DRY) and "You ain't gonna need it" (YAGNI), and creating short functions that operate at a single level of abstraction, I find that useful functions pop out without consciously thinking about code reuse.

If I try to write a function with code reuse in mind, it becomes tempting to write features that might be useful. In other words, it encourages the opposite of YAGNI (you ain't gonna need it). Far better to have a tightly focused function that does exactly what I need, no more nor less. Sometimes that function will already exist exactly as we need it. Great! Use it.

But what if there's an existing function that's similar, but would require some changes for our required functionality? Should I create a new function or should I modify the existing function? It's tempting to tweak the existing function in the name of code reuse, but I'd suggest that we should err on the side of creating new functions.

One of my guiding principles is also one of the main benefits of functions: abstraction. Imagine that there's already a function that does exactly what you need. What would it be called? What concept does it represent? Now take a look at the existing function. Are they the same concept? If the concepts are related but distinct, that suggests they belong in different functions. If it happens that they have similar implementations, then you'll find the same consideration happening recursively: is this sub-concept in my new function actually the same sub-concept in the existing function? And so on. Although you might have created a new function, it might not actually lead to much new code if it shares many of the same underlying concepts.

I try to write short functions that operate at a single level of abstraction. The result is that I write plenty of short functions that only get used in one place, at least initially. Such functions might end up getting used in other places, but they weren't really designed with reuse in mind: instead, the foremost consideration was whether that function represented a coherent abstraction of a concept.

This isn't to say that I completely ignore code reuse when writing functions. Sometimes, when weighing up different design choices that are similarly elegant for my current use-case, I'll pick a design based on which one is likely to be best suited to usages that I anticipate in the future. However, I won't add anything specifically for code reuse: I've still chosen a design which does only what I need, but is also amenable to anticipated future change. If that future change never happens, then the function still makes sense and doesn't have extra unused functionality.

To summarise: I try to avoid crowbarring in functionality to existing functions just because it's similar. Separate concepts should go in separate functions, even if they share underlying concepts. Better to have two coherent, small, simple concepts than a larger but ill-defined concept. Code reuse is a good thing, but it's not something that I actively design for.

Footnote: for brevity, I've only talked about functions in this post, but it applies equally to classes and other such constructs for the same reasons.

Topics: Software design

Beware Common Sense

Monday 19 May 2014 13:42

My hodge-podge page on test-driven development originally had a part that read:

Even when following techniques such as TDD, common sense must prevail!

This was in reference to making new test cases pass by just hard-coding new values, something that I refuted by claiming that the implementation was "obviously" not the one we wanted.

After Graham Helliwell reminded me of Uncle Bob's Transformation Priority Premise, I changed the page to something a little more helpful.

In retrospect, appealing to common sense was a copout. A few years ago, I decided to stop using phrases such as "obviously" and "of course" in my writing. The reason? If it is actually obvious, then it doesn't need stating. If it does need stating, then it's not as obvious as I think, and I run the risk of making the reader feel stupid for no good reason.

An appeal to common sense is much the same. When I say common sense, which I really mean is: from my own experience and knowledge, the "right thing" to do is second nature, and I don't have to think too hard about it. This is the "curse of knowledge": it might be common sense to me, but one of the points of writing is to explain ideas and concepts to others. An appeal to common sense is giving up that point while running the risk of insulting the reader. Being forced to consciously go through a process is a useful exercise in and of itself, making explanations of common sense useful to both listener and explainer.

Topics:

Converting docx to clean HTML: handling the XML structure mismatch

Tuesday 17 December 2013 08:11

One of my recent side projects is Mammoth, which converts docx files produced by Microsoft Word into HTML. It aims to produce clean HTML by using semantic information in the original document, such as the styles applied to each paragraph, rather than trying to exactly copy the font, size, colour, and so on. I wrote Mammoth so that editors wouldn't have to spend hours manually converting Word documents into HTML. Although we're converting XML to XML, there's quite a mismatch in structure. This blog post describes how Mammoth handles the mismatch. If you're interested in trying it out, you can find a Python version (including a CLI) and a JavaScript version.

The docx format stores each paragraph as a distinct w:p element. Each paragraph optionally has a style. For instance, the following docx XML represents a heading followed by an ordinary paragraph [1].

<w:p style="Heading1>A Study in Scarlet</w:p>
<w:p>In the year 1878 I took my degree</w:p>

We'd like to convert this to an h1 element and a p element:

<h1>A Study in Scarlet</h1>
<p>In the year 1878 I took my degree</p>

This seems fairly straightforward: we take each paragraph from the docx XML, and convert it to an HTML element depending on the style. We can use a small DSL to let the user control how to map docx styles to HTML elements without having to write any code. In this case, we might write:

p.Heading1 => h1:fresh
p => p:fresh

To the left of the arrow, we have a paragraph matcher. p.Heading1 from the first rule matches any paragraph with the style Heading1, while p from the second rule matches any paragraph. To the right of the arrow, we have an HTML path. To process a docx paragraph:

  • Find the first rule where its paragraph matcher matches the current docx paragraph
  • Generate HTML to satisfy the HTML path. h1 is satisfied if there's a top-level h1 i.e. an h1 with no parents. h1:fresh means generate a fresh (i.e. newly-opened) top-level h1 element. We'll see a little later why this notion of freshness is useful.

Things become a bit more tricky when we'd expect to generate some nested HTML, such as lists. For instance, consider the following list:

  • Apple
  • Banana

One way of representing this in docx is:

<w:p style="Bullet1">Apple</w:p>
<w:p style="Bullet1">Banana</w:p>

Note that there's no nesting of elements, even though the two docx paragraphs are part of the same structure (in this case, a list). The only way to tell that these bullets are in the same list is by inspecting the style of sibling elements. Compare this to the HTML we expect to generate:

<ul>
  <li>Apple</li>
  <li>Banana</li>
</ul>

To generate this HTML, you can write the following rule:

p.Bullet1 => ul > li:fresh

The HTML path uses > to indicate children. In this case, the HTML path is satisfied when there's a top-level ul with a fresh li as a child. Let's see how this example works by processing each docx paragraph.

The first paragraph matches p.Bullet1, so we require a top-level ul with a fresh li as a child. Since we have no open elements, we open both elements followed by the text of the paragraph:

<ul>
  <li>Apple

The second paragraph also requires a top-level ul with a fresh li as a child. We close and open the li since it needs to be fresh, but leave the ul alone:

<ul>
  <li>Apple</li>
  <li>Banana

Finally, we close all elements at the end of the document:

<ul>
  <li>Apple</li>
  <li>Banana</li>
</ul>

The key is that HTML elements aren't closed after processing a docx paragraph. Instead, HTML elements are kept open in case following docx paragraphs are actually part of the same structure. An element will eventually be closed either by processing a docx paragraph that isn't part of the same structure, or by reaching the end of the document.

A more complicated case is that of nested lists. For instance, given the list:

  • Fruit
    • Apple
    • Banana
  • Vegetable
    • Cucumber
    • Lettuce

This would be represented in docx by:

<w:p style="Bullet1">Fruit</w:p>
<w:p style="Bullet2">Apple</w:p>
<w:p style="Bullet2">Banana</w:p>
<w:p style="Bullet1">Vegetable</w:p>
<w:p style="Bullet2">Cucumber</w:p>
<w:p style="Bullet2">Lettuce</w:p>

And we'd like to generate this HTML:

<ul>
  <li>
    Fruit
    <ul>
      <li>Apple</li>
      <li>Banana</li>
    </ul>
  </li>
  <li>
    Vegetable
    <ul>
      <li>Cucumber</li>
      <li>Lettuce</li>
    </ul>
  </li>
</ul>

In this case, we need two rules: one each for Bullet1 and Bullet2:

p.Bullet1 => ul > li:fresh
p.Bullet2 => ul > li > ul > li:fresh

To see how this works, let's follow step by step. We start by processing the first docx paragraph. This has the style Bullet1, which requires a ul and li element to be open. This generates the following HTML:

<ul>
  <li>
    Fruit

The second paragraph has the style Bullet2, which means we need to satisfy the HTML path ul > li > ul > li:fresh. Since the ul and li from processing the first docx paragraph have been left open, we only need to generate the second set of ul and li elements, giving the HTML:

<ul>
  <li>
    Fruit
    <ul>
      <li>Apple

The third paragraph also has the style Bullet2. The first three elements of the style rule (ul > li > ul) are already satisfied, but the final li needs to be fresh. Therefore, we close the currently open li, and then open a new li:

<ul>
  <li>
    Fruit
    <ul>
      <li>Apple</li>
      <li>Banana

The fourth paragraph has the style Bullet1. The first element of the style rule (ul) is satisfied, but the li needs to be fresh. Therefore, we close the outer li, along with its children, before opening a fresh li:

<ul>
  <li>
    Fruit
    <ul>
      <li>Apple</li>
      <li>Banana</li>
    </ul>
  </li>
  <li>
    Vegetable

The processing of the final two paragraphs proceeds in the same way as before, giving us the HTML:

<ul>
  <li>
    Fruit
    <ul>
      <li>Apple</li>
      <li>Banana</li>
    </ul>
  </li>
  <li>
    Vegetable
    <ul>
      <li>Cucumber</li>
      <li>Lettuce

Since we've reached the end of the document, all that remains is to close all open elements:

<ul>
  <li>
    Fruit
    <ul>
      <li>Apple</li>
      <li>Banana</li>
    </ul>
  </li>
  <li>
    Vegetable
    <ul>
      <li>Cucumber</li>
      <li>Lettuce</li>
    </ul>
  </li>
</ul>

I've left plenty of details out, such as handling of hyperlinks and images, but this gives an overview of how Mammoth deals with the greatest mismatch between the structure of docx XML and HTML.

[1] If you go and look at an actual docx file, you'll discover that the XML is more complicated than what I've presented. I've only included the bits that matter for an overview.

Topics: Algorithms, Programs

Fun with Prolog: write an algorithm, then run it backwards

Sunday 10 November 2013 21:21

Compared to most other languages, Prolog encourages you to write code in a highly declarative style. One of the results is that you can write an algorithm, and then run the same algorithm "backwards" without any additional code.

For instance, suppose you want to find out whether a list is a palindrome or not. We write a predicate like so:

palindrome(L) :- reverse(L, L).

We can read this as: palindrome(L) is true if reverse(L, L) is true. In turn, reverse(L1, L2) is true when L1 is the reverse of L2. We try out the palindrome predicate in the interpreter:

?- palindrome([]).
true.

?- palindrome([1]).
true.

?- palindrome([1, 1]).
true.

?- palindrome([1, 2]).
false.

?- palindrome([1, 2, 1]).
true.

So far, not that different from any other programming language. However, if we set some of the elements of the list to be variables, Prolog tries to fill in the blanks -- that is, it tries to find values for those variables so that the predicate is true. For instance:

?- palindrome([1, A]).
A = 1.

In the above, Prolog tells us that the list [1, A] is a palindrome if A has the value 1. We can do something a bit more fancy if we use a variable for the tail of the list, rather than just one element. [1 | A] means a list with 1 as the first element, with any remaining elements represented by A.

?- palindrome([1 | A]).
A = [1]

Prolog tells us that [1 | A] is a palindrome if A has the value [1]. However, if we hit the semicolon in the interpreter, Prolog gives us another value for A that satisfies the predicate.

?- palindrome([1, 2 | A]).
A = [1] ;
A = [2, 1]

Now Prolog is telling us that [2, 1] is another value for A that satisfies the predicate. If we hit semicolon again, we get another result:

?- palindrome([1, 2 | A]).
A = [1] ;
A = [2, 1] ;
A = [_G313, 2, 1]

This time, Prolog says A = [_G313, 2, 1] satifies the predicate. The value _G313 means that any value would be valid in that position. Another hit of the semicolon, and another possibility:

?- palindrome([1, 2 | A]).
A = [1] ;
A = [2, 1] ;
A = [_G313, 2, 1] ;
A = [_G313, _G313, 2, 1]

We still have _G313, but this time it appears twice. The first and second element of A can be anything so long as they're the same value. We can keep hitting semicolon, and Prolog will keep giving us possibilities:

?- palindrome([1, 2 | A]).
A = [1] ;
A = [2, 1] ;
A = [_G313, 2, 1] ;
A = [_G313, _G313, 2, 1] ;
A = [_G313, _G319, _G313, 2, 1] ;
A = [_G313, _G319, _G319, _G313, 2, 1] ;
A = [_G20, _G26, _G32, _G26, _G20, 2, 1] ;
A = [_G20, _G26, _G32, _G32, _G26, _G20, 2, 1]

In each of these possibilities, Prolog correctly determines which of the elements in the list must be the same. Now for one last example: what if we don't put any constraints on the list?

?- palindrome(A).
A = [] ;
A = [_G295] ;
A = [_G295, _G295] ;
A = [_G295, _G301, _G295] ;
A = [_G295, _G301, _G301, _G295] ;
A = [_G295, _G301, _G307, _G301, _G295] ;
A = [_G295, _G301, _G307, _G307, _G301, _G295] ;
A = [_G295, _G301, _G307, _G313, _G307, _G301, _G295]

Once again, Prolog generates possibilities for palindromes, telling us which elements need to be the same, but otherwise not putting any restrictions on values.

In summary, we wrote some code to tell us whether lists were palindromes or not, but that same code can be used to generate palindromes. As another example, we might want to implement run-length encoding of lists:

?- encode([a, a, a, b, b, a, c, c], X).
X = [[3, a], [2, b], [1, a], [2, c]] .

Once we've written encode, to work in one direction (turning ordinary lists into run-length-encoded lists), we can use the same predicate to work in the other direction (turning run-length-encoded lists into ordinary lists):

?- encode(X, [[3, a], [2, b], [1, a], [2, c]]).
X = [a, a, a, b, b, a, c, c] .

For the interested, the implementation can be found as a GitHub gist. One caveat is that the implementation of encode has to be written carefully so that it works in both directions. Although this might be harder (and much less efficient) than writing two separate predicates, one for encoding and one for decoding, using a single predicate gives a high degree of confidence that the decode operation is correctly implemented as the inverse of the encode operation. Writing a version of encode that actually works in both directions is an interesting challenge, and also the topic of another blog post.

(Thanks to Ninety-Nine Prolog Problems for inspiration for examples.)

Topics: Prolog, Language design