Mike's corner of the web.

Archive: Software development

My tiny side project has had more impact than my decade in the software industry

Sunday 1 August 2021 12:55

Way back in 2013, I started mammoth.js, a library that converts Word documents into HTML. It's not a large project - roughly 3,000 lines - nor is it particularly exciting.

I have a strong suspicion, though, that that tiny side project has had more of a positive impact than the entirety of the decade I've spent working as a software developer.

I wrote the original version on a Friday afternoon at work, when I realised some of my colleagues were spending hours and hours each week painstakingly copying text from a Word document into our CMS and formatting it. I wrote a tool to automate the process, taking advantage of our consistent in-house styling to map Word styles to the right CSS classes rather than producing the soup of HTML that Word would produce. It wasn't perfect - my colleagues would normally still have to tweak a few things - but I'd guess it saved them over 90% of the time they were spending before on a menial and repetitive task.

Since it seemed like this was likely a problem that other people had, I made an open source implementation on my own time, first in JavaScript, later with ports to Python and Java. Since then, I've had messages from people telling me how much time it's saved them: perhaps the most heartwarming being from someone telling me that the hours they saved each week were being spent with their son instead.

I don't know what the total amount of time saved is, but I'm almost certain that it's at least hundreds of times more than the time I've spent working on the tool.

Admittedly, I've not really done all that much development on the project in recent years. The stability of the docx format means that the core functionality continues to work without changes, and most people use the same, small subset of features, so adding support for more cases and more features has rapidly diminishing returns. The nature of the project means that I don't actually need to support all that much of docx: since it tries to preserve semantic information by converting Word styles to CSS classes, rather than producing a high fidelity copy in HTML as Word does, it can happily ignore most of the actual details of Word formatting.

By comparison, having worked as a software developer for over a decade, the impact of the stuff I actually got paid to do seems underwhelming.

I've tried to pick companies working on domains that seem useful: developer productivity, treating diseases, education. While my success in those jobs has been variable - in some cases, I'm proud of what I accomplished, in others I'm pretty sure my net effect was, at best, zero - I'd have a tough time saying that the cumulative impact was greater than my little side project.

Sometimes I wonder whether it'd be possible to earn a living off mammoth. Although there's an option to donate - I currently get a grand total of £1.15 a week from regular donations - it's not something I push very hard. There are specific use cases that are more involved that I'll probably never be able to support in my spare time - for instance, support for equations - so potentially there's money to be made there.

I'm not sure it would make me any happier though. If I were a solo developer, I'd probably miss working with other people, and I'm not sure I really have the temperament to do the work to get enough sales to live off.

Somehow, though, it feels like a missed opportunity. Working on tools where the benefit is immediately visible is extremely satisfying, and there's probably plenty of domains where software could still help without requiring machine learning or a high-growth startup backed by venture capital. I'm just not sure what the best way is for someone like me to actually do so.

Topics: Software development

External code quality and libification

Tuesday 26 February 2013 20:11

If you ask a programmer to list symptoms of low code quality, they could probably produce a long list: deeply nested conditionals and loops, long methods, overly terse variable names. Most of these code smells tend to focus on the implementation of the code. They're about internal code quality.

External code quality instead asks you to consider the programmer that has to call your code. When trying to judge how easily somebody else can you use your code, you might ask yourself:

  • Do the class and method names describe what the caller wants to accomplish?
  • How many times must we call into your code to complete a single, discrete task?
  • Does your code have minimal dependencies on other parts of your codebase and external libraries?

As an example, consider this snippet of Java to write an XML document to an OutputStream:

import org.w3c.dom.*;
import java.io.*;
import javax.xml.transform.*;
import javax.xml.transform.dom.*;
import javax.xml.transform.stream.*;

private static final void writeDoc(Document document, OutputStream output)
        throws IOException {
    try {
        Transformer transformer =
        transformer.transform(new DOMSource(document), new StreamResult(output));
    } catch (TransformerException e) {
        throw new AssertionError(e); // Can't happen!

While there are probably good reasons for all of those methods, and there are cases where having a high level of control is valuable, this isn't a good API for our user that just wants to write out their XML document to an output stream.

  • Do the class and method names describe what they want to accomplish? We want to write out our XML document, and instead we're talking about TransformerFactory and OutputKeys.DOCTYPE_SYSTEM.
  • How many times must we call into your code to complete a single, discrete task? Writing out an XML document seems simple, but we have to create an instance of a transformer factory, then ask it for a transformer, set the output property (whatever that is), wrap up our document and output stream, before we can finally use the transformer to write out our document.
  • Does your code have minimal dependencies on other parts of your codebase and external libraries? The code above actually does quite well here, since that snippet should work on a normal installation of Java.

So, why is it valuable to distinguish between internal and external code quality? The effect of low internal code quality is contained within a small scope (by definition!). I'm certainly not advocating one letter names for all local variables, but cleaning up that code is comparatively straightforward compared to improving an API. The effects of low external code quality tend to pervade your entire system. If you change the signature of a method, you now have to change every use of that method.

When writing code, we often trade off code quality against speed of execution. Even when writing good quality code, we're not going to spend weeks refactoring to make it perfect. I'm suggesting that we should be spending more time worrying about the external quality of our code. Internal quality is important, but it's not as important.

A good measure of whether a piece of your code has minimal dependencies is to try "libifying" it: turn it into an independent library. If the code you write frequently depends on large parts of the entire system, then it probably depends on too much. Once you've split out your code into a separate library, there's a good chance that external code quality will improve. For starters, once you've pulled out that code, you're unlikely to accidentally introduce new dependencies that aren't really required. Beyond that: when you've written a bad API deep within the internals of your large system, it's easy to ignore. If you've split it out into a library, it's much harder to ignore whether your library makes it hard or easy to do what it says on the tin.

Decomposing your code into libraries has plenty of advantages, such as code reuse and being able to test components independently. But I have a hypothesis that aggressively libifying your code will leave you with a much higher quality of code in the long run.

Topics: Software development, Software design

The best retrospectives are in the middle of a project, not the end

Sunday 24 February 2013 22:52

Retrospectives are unfortunately named. The name (correctly) suggests looking back over what has gone before, but I've noticed this leads many people to run retrospectives after a project has finished. The other part of a retrospective is looking forward: how can we improve in the future? What can we do differently? What can we try?

Retrospectives after a completed project can certainly be educational, but the lessons learnt and things to do in the future tend to be somewhat abstract and vague. Since the project is over, you can't make immediate changes over the next couple of weeks, so there's little motivation to come up with concrete actions. Retrospectives are about improvement, but in this case you're often improving the vague notion of a similar project in the future.

On the other hand, if you run a retrospective in the middle of a project, you can try out new ideas quickly, perhaps as soon as you leave the retrospective. These ideas will hopefully improve your working life within the next couple of weeks, rather than affecting some vague future project. This gives a strong incentive to come up with useful, concrete actions. If you're running regular retrospectives, you also have the opportunity to experiment and iterate on ideas.

Retrospectives shouldn't be held at the end of a project out of a sense of obligation, or the need to learn something from a failed project. Regular retrospectives in the middle of a project give the best chance for real improvement.

Topics: Software development

Don't make big decisions, make big decisions irrelevant

Monday 10 December 2012 11:42

We're often faced with decisions that we'll have to live with for a long time. What language should we write our application in? What framework should we use? What will our architecture look like? We spend lots of time and effort in trying to find the right answer, but we often forget the alternative: instead of making this big decision, could we make the decision irrelevant?

Suppose you need to pick a language to build your system in. This is tricky since it often takes months or even years to discover all the annoyances and issues of a language, by which point rewriting the entire system in another language is impractical. An alternative is to split your system up into components, and make communication between components language-agnostic, for instance by only allowing communication over HTTP. Then, the choice of language affects only a single component, rather than the entire system. You could change the language each component is written in one-by-one, or leave older components that don't need much development in their original language. Regardless, picking the “wrong” language no longer has such long-lasting effects.

This flexibility in language isn't without cost though – now you potentially have to know multiple languages to work on a system, rather than just one. What if there's a component written in language that nobody on the team understands anymore? There's also the overhead of using HTTP. Not only is an HTTP request slower than an ordinary function call, it makes the call-site more complicated.

Making any big decision irrelevant has a cost associated with it, but confidently making the “right” decision upfront is often impossible. For any big decision, it's worth considering: what's the cost of making the wrong decision versus the cost of making the decision irrelevant?

Topics: Software development

Modularity through HTTP

Monday 10 December 2012 10:41

As programmers, we spend quite a lot of effort in pursuit of some notion of modularity. We hope that this allows us to solve problems more easily by splitting them up, as well as then letting us reuse parts of the code in other applications. Plenty of attempts have been made to get closer to this ideal, object-orientation perhaps being the most obvious example, yet one of the most successful approaches to modularity is almost accidental: the web.

Modularity makes our code easier to reason about by allowing us to take our large problem, split it into small parts, and solve those small parts without having to worry about the whole. Programming languages give us plenty of ways to do this, functions and classes among them. So far, so good. But modularity has some other benefits that we’d like to be able to take advantage of. If I’ve written an independent module, say to send out e-mails to my customers, I’d like to be able to reuse that module in another application. And by creating DLLs or JARs or your platform’s package container of choice, you can do just that – provided your new application is on the same platform. Want to use a Java library from C#? Well, good luck – it might be possible, but it’s not going to be smooth sailing.

What’s more, just because the library exists, it doesn’t mean it’s going to be a pleasant experience. If nobody can understand the interface to your code, nobody’s going to use it. Let’s say we want to write out an XML document to an output stream in Java. You’d imagine this would be a simple one-liner. You’d be wrong:

import org.w3c.dom.*;
import java.io.*;
import javax.xml.transform.*;
import javax.xml.transform.dom.*;
import javax.xml.transform.stream.*;

private static final void writeDoc(Document doc, OutputStream out) 
        throws IOException{
    try {
        Transformer t = TransformerFactory.newInstance().newTransformer();
            OutputKeys.DOCTYPE_SYSTEM, doc.getDoctype().getSystemId());
        t.transform(new DOMSource(doc), new StreamResult(out));
    } catch(TransformerException e) {
       throw new AssertionError(e); // Can't happen!

The result is that most of the code we write is just a variation on a theme. Odds are, somebody else has written the code before. Despite our best efforts, we’ve fallen a little short.

However, the web brings us a little closer to the ideal. If I want to send e-mails to my customers, I could write my own e-mail sending library. More likely, I’d use an existing one for my language. But even then, I probably wouldn’t have some niceties like A/B testing or DKIM signing. Instead, I could just fire some HTTP at MailChimp, and get a whole slew of features without getting anywhere near the code that implements them.

The web is inherently language agnostic. So long as your language can send and receive text over HTTP, and probably parse some JSON, you’re about as well-equipped as everybody else. Instead of building libraries for a specific language, you can build a service that can be used from virtually every language.

The text-based nature of HTTP also helps to limit on the complexity of the API. As SOAP will attest, you can still make a horrible mess using HTTP, but that horrible mess is plain to see. Complex data structures are tedious to marshal to and from text, providing a strong incentive to keep things simple. Spotting the complexities in a class hierarchy is often not as easy.

HTTP doesn’t solve every problem – using it inside an inner loop that’s executed thousands of times per second probably isn’t such a good idea. What’s more, this approach might introduce some new problems. For instance, if we’re combining existing applications using HTTP for communication, we often need to add a thin shim to each application. For instance, you might need to write a small plugin in PHP if you want to integrate WordPress into your system. Now, instead of a system written in one language, you’ve got to maintain a system with several distinct languages and platforms.

Even then, we should strive to avoid reimplementing the same old thing. As programmers, we consistently underestimate the cost of building a system, not to mention ongoing maintenance. By integrating existing applications, even if they’re in an unfamiliar languages, we save ourselves those development and maintenance costs, as well as being able to pick the best solution for our problem. Thanks to the web, HTTP is often the easiest way to get there.

In case you recognised the topic, an edited version of this post was used as the Simple-Talk editorial a few months ago.

Topics: Software development, Software design

Peaks and troughs in software development

Monday 20 August 2012 19:47

The problem with a smooth development process is that every day is pretty much the same as the last. You might be writing great code and solving interesting problems with other passionate people, but constantly working on the same thing can begin to feel dull or even frustrating. By having a silky-smooth development process with reliable code and regular releases, you've removed those natural peaks and troughs, like the high of fixing another critical bug in production before you head home and crash. I think it was Steve Freeman who once mentioned that sometimes it's valuable to put some of those peaks and troughs back in, but preferably without putting critical bugs back in.

For instance, I like the idea of spending one day a week working on unprioritised work. It might be that the developers are keen to try out a new rendering architecture that'll halve page load times, or that there's a piece of code that can be turned into a separate library that'll be useful on other projects. Maybe there's a little visual bug that's never going to be deemed important enough to be prioritised, but a developer takes enough pride in their work to spend half an hour fixing it. This feels like a peak to me: there's a lot of value to the product in polishing the user experience, in refactoring the code, and trying out risky ideas, and the developers get to scratch some of their own itches.

However, it's regularity can make it feel routine, and you're still working on the same product. As useful as these small, regular peaks and troughs are, I think you also need the occasional Everest. Maybe it's saying “This week, I'm going to try something I've never tried before that's completely unrelated to the project”. Or perhaps you need a Grand Canyon: “Today, we're just going to concentrate on being better programmers by doing a code retreat”. Finding something that works is hard, and you can't even reuse the same idea too much without risking its value as an artificial peak or trough. But I think it's important to keep trying. You don't just want a project and its team to be alive: you need them to be invigorated.

Topics: Software development

Writing maintainable code

Tuesday 27 September 2011 20:14

I heartily endorse this fine article on writing maintainable code. What do you mean I'm biased because I wrote it?

Topics: Software design, Software development, Testing

Orders of Magnitude

Sunday 21 February 2010 20:57

Improving performance is often a desirable goal. Sometimes you'll have a precise number for just how much performance needs to be improved by, particularly in real-time systems. More often, though, the request for improved performance is far more vague. So, what sort of numbers should we aim for when we want things to go faster? This depends on why you want faster performance -- do you just want to save a bit of time, or do you really want things to change?

Take, for instance, the time it takes to run your entire test suite. This can vary wildly, depending on the application, from seconds to days. Let's say we're working on a small project, and we have a test suite that covers the entire application in one minute. This is fast enough that we can run the entire suite every time before we commit, but we won't be running it every time we make a small change. If we made it go, say, twice as fast, we'd definitely save ourselves some time -- thirty seconds for every commit, if we really do run all the tests before every commit. This is still too slow to be running each time we make a small change, but what if we sped up the suite by an order of magnitude instead, so it takes only a few seconds to run? Now, running the entire suite every minute or two is practical, rather than just before every commit.

Sometimes, we really can get performance improvements of an order of magnitude, by improvements in technology or a clever new algorithm. Otherwise, we might still be able to do what all programmers do -- cheat. If we can get most of the benefit in a much shorter time, then this is often good enough. Going back to our test suite, if we can identify some subset of the tests that run in 10% of the time with 90% of the coverage, then most bugs we might introduce are still picked up, while our tool, the test suite, becomes more flexible.

By improving performance not by small amounts, but by orders of magnitude, we can change the way we use and think about our tools. Performance really does matter.

Topics: Software development