Mike's corner of the web.

The Cthulhu Effect, or what happens to old programmers

Sunday 28 July 2013 13:27

There's a perception in at least some parts of the world of software development that coding is a game for the young. I'm not sure whether it's actually true or not, but after a conversation with a friend, I've decided to chalk up the phenomenon to "the Cthulhu Effect". Apparently, the more you stare at Cthulhu, the more insane you go. Similarly, I expect a decade or two of staring at awful code is enough to drive any programmer to madness, leaving you with the choice of either running from the code or embracing madness. (That also goes a long towards explaining the demeanor of many experienced programmers.)

Topics: Nonsense

Adding git (or hg, or svn) dependencies in setup.py (Python)

Wednesday 29 May 2013 21:02

Update: the behaviour of pip has changed, meaning that the option --process-dependency-links is required when running pip install.

You can specify dependencies for your Python project in setup.py by referencing packages on the Python Package Index (PyPI). But what if you want to depend on your own package that you don't want to make public? Using dependency_links, you can reference a package's source repository directly.

For instance, mayo is a public package on PyPI, so I can reference it directly:

setup(
    install_requires=[
        "mayo>=0.2.1,<0.3"
    ],
    # Skipping other arguments to setup for brevity
)

But suppose that mayo is a private package that I don't want to share. Using the dependency_links argument, I can reference the package by its source repository. The only way I could get this working with git was to use an explict SSH git URL, which requires a small transformation from the SSH URLs that GitHub or BitBucket provide. For instance, if GitHub lists the SSH URL as:

git@github.com:mwilliamson/mayo.git

then we need to explictly set the URL as being SSH, which means adding ssh:// at the front, and replacing the colon after github.com with a forward slash. Finally, we need to indicate the URL is a git URL by adding git+ to the front. This gives a URL like:

git+ssh://git@github.com/mwilliamson/mayo.git

To use a specific commit, add an at symbol followed by a commit identifier. For instance, if we wanted to use version 0.2.1, which has the tag 0.2.1 in git:

git+ssh://git@github.com/mwilliamson/mayo.git@0.2.1

Then, we can use the URL in setup.py like so:

setup(
    install_requires=[
        "mayo==0.2.1"
    ],
    dependency_links=[
        "git+ssh://git@github.com/mwilliamson/mayo.git@0.2.1#egg=mayo-0.2.1"
    ]
    # Skipping other arguments to setup for brevity
)

Note that we depend on a specific version of the package, and that we use the URL fragment (the bit after #) to indicate both the package name and version.

Topics: Python

External code quality and libification

Tuesday 26 February 2013 20:11

If you ask a programmer to list symptoms of low code quality, they could probably produce a long list: deeply nested conditionals and loops, long methods, overly terse variable names. Most of these code smells tend to focus on the implementation of the code. They're about internal code quality.

External code quality instead asks you to consider the programmer that has to call your code. When trying to judge how easily somebody else can you use your code, you might ask yourself:

  • Do the class and method names describe what the caller wants to accomplish?
  • How many times must we call into your code to complete a single, discrete task?
  • Does your code have minimal dependencies on other parts of your codebase and external libraries?

As an example, consider this snippet of Java to write an XML document to an OutputStream:

import org.w3c.dom.*;
import java.io.*;
import javax.xml.transform.*;
import javax.xml.transform.dom.*;
import javax.xml.transform.stream.*;

private static final void writeDoc(Document document, OutputStream output)
        throws IOException {
    try {
        Transformer transformer =
            TransformerFactory.newInstance().newTransformer();
        transformer.setOutputProperty(
            OutputKeys.DOCTYPE_SYSTEM,
            document.getDoctype().getSystemId()
        );
        transformer.transform(new DOMSource(document), new StreamResult(output));
    } catch (TransformerException e) {
        throw new AssertionError(e); // Can't happen!
    }
}

While there are probably good reasons for all of those methods, and there are cases where having a high level of control is valuable, this isn't a good API for our user that just wants to write out their XML document to an output stream.

  • Do the class and method names describe what they want to accomplish? We want to write out our XML document, and instead we're talking about TransformerFactory and OutputKeys.DOCTYPE_SYSTEM.
  • How many times must we call into your code to complete a single, discrete task? Writing out an XML document seems simple, but we have to create an instance of a transformer factory, then ask it for a transformer, set the output property (whatever that is), wrap up our document and output stream, before we can finally use the transformer to write out our document.
  • Does your code have minimal dependencies on other parts of your codebase and external libraries? The code above actually does quite well here, since that snippet should work on a normal installation of Java.

So, why is it valuable to distinguish between internal and external code quality? The effect of low internal code quality is contained within a small scope (by definition!). I'm certainly not advocating one letter names for all local variables, but cleaning up that code is comparatively straightforward compared to improving an API. The effects of low external code quality tend to pervade your entire system. If you change the signature of a method, you now have to change every use of that method.

When writing code, we often trade off code quality against speed of execution. Even when writing good quality code, we're not going to spend weeks refactoring to make it perfect. I'm suggesting that we should be spending more time worrying about the external quality of our code. Internal quality is important, but it's not as important.

A good measure of whether a piece of your code has minimal dependencies is to try "libifying" it: turn it into an independent library. If the code you write frequently depends on large parts of the entire system, then it probably depends on too much. Once you've split out your code into a separate library, there's a good chance that external code quality will improve. For starters, once you've pulled out that code, you're unlikely to accidentally introduce new dependencies that aren't really required. Beyond that: when you've written a bad API deep within the internals of your large system, it's easy to ignore. If you've split it out into a library, it's much harder to ignore whether your library makes it hard or easy to do what it says on the tin.

Decomposing your code into libraries has plenty of advantages, such as code reuse and being able to test components independently. But I have a hypothesis that aggressively libifying your code will leave you with a much higher quality of code in the long run.

Topics: Software development, Software design

The best retrospectives are in the middle of a project, not the end

Sunday 24 February 2013 22:52

Retrospectives are unfortunately named. The name (correctly) suggests looking back over what has gone before, but I've noticed this leads many people to run retrospectives after a project has finished. The other part of a retrospective is looking forward: how can we improve in the future? What can we do differently? What can we try?

Retrospectives after a completed project can certainly be educational, but the lessons learnt and things to do in the future tend to be somewhat abstract and vague. Since the project is over, you can't make immediate changes over the next couple of weeks, so there's little motivation to come up with concrete actions. Retrospectives are about improvement, but in this case you're often improving the vague notion of a similar project in the future.

On the other hand, if you run a retrospective in the middle of a project, you can try out new ideas quickly, perhaps as soon as you leave the retrospective. These ideas will hopefully improve your working life within the next couple of weeks, rather than affecting some vague future project. This gives a strong incentive to come up with useful, concrete actions. If you're running regular retrospectives, you also have the opportunity to experiment and iterate on ideas.

Retrospectives shouldn't be held at the end of a project out of a sense of obligation, or the need to learn something from a failed project. Regular retrospectives in the middle of a project give the best chance for real improvement.

Topics: Software development

Test reuse

Monday 18 February 2013 10:38

Code reuse is often discussed, but what about test reuse? I don't just mean reusing common code between tests -- I mean running exactly the same tests against different code. Imagine you're writing a number of different implementations of the same interface. If you write a suite of tests against the interface, any one of your implementations should be able to make the tests pass. Taking the idea even further, I've found that you can reuse the same tests whenever you're exposing the same functionality through different methods, whether as a library, an HTTP API, or a command line interface.

As an example, suppose you want to start up a virtual machine from some Python code. We could use QEMU, a command line application on Linux that lets you start up virtual machines. Invoking QEMU directly is a bit ugly, so we wrap it up in a class. As an example of usage, here's what a single test case might look like:

def can_run_commands_on_machine():
    provider = QemuProvider()
    with provider.start("ubuntu-precise-amd64") as machine:
        shell = machine.shell()
        result = shell.run(["echo", "Hello there"])
        assert_equal("Hello there\n", result.output)

We create an instance of QemuProvider, use the start method to start a virtual machine, and then run a command on the virtual machine, and check the output. However, other than the original construction of the virtual machine provider, there's nothing in the test that relies on QEMU specifically. So, we could rewrite the test to accept provider as an argument to make it implementation agnostic:

def can_run_commands_on_machine(provider):
    with provider.start("ubuntu-precise-amd64") as machine:
        shell = machine.shell()
        result = shell.run(["echo", "Hello there"])
        assert_equal("Hello there\n", result.output)

If we decided to implement a virtual machine provider using a different technology, for instance by writing the class VirtualBoxProvider, then we can reuse exactly the same test case. Not only does this save us from duplicating the test code, it means that we have a degree of confidence that each implementation can be used in the same way.

If other people are implementing your interface, you could provide the same suite of tests so they can run it against their own implementation. This can give them some confidence that they've implemented your interface correctly.

What about when you're implementing somebody else's interface? Writing your own set of implementation-agnostic tests and running it existing implementations is a great way to check that you've understood the interface. You can then use the same tests against your code to make sure your own implementation is correct.

We can take the idea of test reuse a step further by testing user interfaces with the same suites of tests that we use to implement the underlying library. Using our virtual machine example, suppose we write a command line interface (CLI) to let people start virtual machines manually. We could test the CLI by writing a separate suite of tests. Alternatively, we could write an adaptor that invokes our own application to implement the provider interface:

class CliProvider(object):
    def start(self, image_name):
        output = subprocess.check_output([
            _APPLICATION_NAME, "start", image_name
        ])
        
        return CliMachine(_parse_output(output))

Now, we can make sure that our command-line interface behaves correctly using the same suite of tests that we used to test the underlying code. If our interface is just a thin layer on top of the underlying code, then writing such an adaptor is often reasonably straightforward.

I often find writing clear and clean UI tests is hard. Keeping a clean separation between the intent of the test and the implementation is often tricky, and it takes discipline to stop the implementation details from leaking out. Reusing tests in this way forces you to hide those details behind the common interface.

If you're using nose in Python to write your tests, then I've put the code I've been using to do this in a separate library called nose-set-tests.

Topics: Testing, Software design, Python

spur.py: A simplified interface for SSH and subprocess in Python

Sunday 10 February 2013 14:45

Over the last few months, I've frequently needed to use SSH from Python, but didn't find any of the existing solutions to be well-suited for what I needed (see below for discussion of other solutions). So, I've created spur.py to make using SSH from Python easy. For instance, to run echo over SSH:

import spur

shell = spur.SshShell(hostname="localhost", username="bob", password="password1")
result = shell.run(["echo", "-n", "hello"])
print result.output # prints hello

shell.run() executes a command, and returns the result once it's finished executing. If you don't want to wait until the command has finished, you can call shell.spawn() instead, which returns a process object:

process = shell.spawn(["sh", "-c", "read value; echo $value"])
process.stdin_write("hello\n")
result = process.wait_for_result()
print result.output # prints hello

spur.py also allows commands to be run locally using the same interface:

import spur

shell = spur.LocalShell()
result = shell.run(["echo", "-n", "hello"])
print result.output # prints hello

For a complete list of supported operations, take a look at the project on GitHub.

spur.py is certainly not the only way to use SSH from Python, and it's possible that one of the other solutions might be better suited for what you need. I've come across three other main alternatives.

The first is to shell out to ssh. It works, but it's ugly.

The second is to use Fabric. Unfortunately, I found Fabric to be a bit too high-level. It's useful for implementing deployment scripts using SSH, but I found it awkward to use as a general-purpose library for SSH.

Finally, there's paramiko. I found paramiko to be a bit too low-level, but both Fabric and spur.py are built on top of paramiko.

Topics: Python

The importance of extremes

Tuesday 18 December 2012 21:01

When exploring unfamiliar ideas, the best approach is often to take them to the extreme. For instance, suppose you're trying to follow the principle "tell, don't ask". I've often found it tricky to know where to draw the line, but as an exercise, try writing your code without a single getter or setter. This may seem ludicrous, but by throwing pragmatism completely out the window, you're forced to move outside your comfort zone. While some the code might be awful, some of it might present ideas in a new way.

As an example, suppose I have two coordinates which represent the top-left and bottom-right corners of a rectangle, and I want to iterate through every integer coordinate in that rectangle. My first thought might be:

def find_coordinates_in_rectangle(top_left, bottom_right):
    for x in range(top_left.x - 1, bottom_right.x + 2):
        for y in range(top_left.y - 1, bottom_right.y + 2):
            yield Coordinate(x, y)

Normally, I might be perfectly happy with this code (although there is a bit of duplication!) But if we've forbidden getters or setters, then we can't retrieve the x and y values from each coordinate. Instead, we can write something like:

def find_coordinates_in_rectangle(top_left, bottom_right):
    return top_left.all_coordinates_in_rectangle_to(bottom_right)

The method name needs a bit more thought, but the important difference is that we've moved some of the knowledge of our coordinate system into the actual coordinate class. Whether or not this turns out to be a good idea, it's food for thought that we might not have come across without such a severe constraint as "no getters or setters".

Topics: Software design

Polymorphism and reimplementing integers

Tuesday 18 December 2012 20:39

While taking part in the Global Day of Coderetreat, one of the sessions had us implementing Conway's Game of Life without using any "if" statements to force us to use polymorphism instead. Anything that was an "if" in spirit, such as a switch statement or storing functions in a dictionary, was also forbidden. For most of the code, this was fairly straightforward, but the interesting problem was the code that decided whether a cell lived or died based on how many neighbours it had:

if numberOfNeighbours in [2, 3]:
    return CellState.LIVE
else:
    return CellState.DEAD

Polymorphism allows different code to be executed depending on the type of a value. In this particular case, we need to execute different code depending on which value we have. It follows that each number has to have a different type so we can give it different behaviour:

class Zero(object):
    def increment(self):
        return One()
        
    def live_cell_next_generation():
        return CellState.DEAD
        
class One(object):
    def increment(self):
        return Two()
        
    def live_cell_next_generation():
        return CellState.DEAD
        
class Two(object):
    def increment(self):
        return Three()
        
    def live_cell_next_generation():
        return CellState.LIVE
        
class Three(object):
    def increment(self):
        return FourOrMore()
        
    def live_cell_next_generation():
        return CellState.LIVE

class FourOrMore(object):
    def increment(self):
        return FourOrMore()
        
    def live_cell_next_generation():
        return CellState.DEAD

In the code that counts the number of neighbours, we use our new number system by starting with Zero and incrementing when we find a neighbour. To choose the next state of the cell, rather than inspecting the number of neighbours, we ask the number of neighbours for the next state directly:

numberOfNeighbours.live_cell_next_generation()

And now we have no "if"s! It's possible to move the logic for choosing the next cell out of the number classes, for instance using the visitor pattern, which might feel a bit more natural. I suspect that reimplementing the natural numbers is still going to feel about the same amount of crazy though.

Topics: Software design

Don't make big decisions, make big decisions irrelevant

Monday 10 December 2012 11:42

We're often faced with decisions that we'll have to live with for a long time. What language should we write our application in? What framework should we use? What will our architecture look like? We spend lots of time and effort in trying to find the right answer, but we often forget the alternative: instead of making this big decision, could we make the decision irrelevant?

Suppose you need to pick a language to build your system in. This is tricky since it often takes months or even years to discover all the annoyances and issues of a language, by which point rewriting the entire system in another language is impractical. An alternative is to split your system up into components, and make communication between components language-agnostic, for instance by only allowing communication over HTTP. Then, the choice of language affects only a single component, rather than the entire system. You could change the language each component is written in one-by-one, or leave older components that don't need much development in their original language. Regardless, picking the “wrong” language no longer has such long-lasting effects.

This flexibility in language isn't without cost though – now you potentially have to know multiple languages to work on a system, rather than just one. What if there's a component written in language that nobody on the team understands anymore? There's also the overhead of using HTTP. Not only is an HTTP request slower than an ordinary function call, it makes the call-site more complicated.

Making any big decision irrelevant has a cost associated with it, but confidently making the “right” decision upfront is often impossible. For any big decision, it's worth considering: what's the cost of making the wrong decision versus the cost of making the decision irrelevant?

Topics: Software development

Modularity through HTTP

Monday 10 December 2012 10:41

As programmers, we spend quite a lot of effort in pursuit of some notion of modularity. We hope that this allows us to solve problems more easily by splitting them up, as well as then letting us reuse parts of the code in other applications. Plenty of attempts have been made to get closer to this ideal, object-orientation perhaps being the most obvious example, yet one of the most successful approaches to modularity is almost accidental: the web.

Modularity makes our code easier to reason about by allowing us to take our large problem, split it into small parts, and solve those small parts without having to worry about the whole. Programming languages give us plenty of ways to do this, functions and classes among them. So far, so good. But modularity has some other benefits that we’d like to be able to take advantage of. If I’ve written an independent module, say to send out e-mails to my customers, I’d like to be able to reuse that module in another application. And by creating DLLs or JARs or your platform’s package container of choice, you can do just that – provided your new application is on the same platform. Want to use a Java library from C#? Well, good luck – it might be possible, but it’s not going to be smooth sailing.

What’s more, just because the library exists, it doesn’t mean it’s going to be a pleasant experience. If nobody can understand the interface to your code, nobody’s going to use it. Let’s say we want to write out an XML document to an output stream in Java. You’d imagine this would be a simple one-liner. You’d be wrong:

import org.w3c.dom.*;
import java.io.*;
import javax.xml.transform.*;
import javax.xml.transform.dom.*;
import javax.xml.transform.stream.*;

private static final void writeDoc(Document doc, OutputStream out) 
        throws IOException{
    try {
        Transformer t = TransformerFactory.newInstance().newTransformer();
        t.setOutputProperty(
            OutputKeys.DOCTYPE_SYSTEM, doc.getDoctype().getSystemId());
        t.transform(new DOMSource(doc), new StreamResult(out));
    } catch(TransformerException e) {
       throw new AssertionError(e); // Can't happen!
    }
}

The result is that most of the code we write is just a variation on a theme. Odds are, somebody else has written the code before. Despite our best efforts, we’ve fallen a little short.

However, the web brings us a little closer to the ideal. If I want to send e-mails to my customers, I could write my own e-mail sending library. More likely, I’d use an existing one for my language. But even then, I probably wouldn’t have some niceties like A/B testing or DKIM signing. Instead, I could just fire some HTTP at MailChimp, and get a whole slew of features without getting anywhere near the code that implements them.

The web is inherently language agnostic. So long as your language can send and receive text over HTTP, and probably parse some JSON, you’re about as well-equipped as everybody else. Instead of building libraries for a specific language, you can build a service that can be used from virtually every language.

The text-based nature of HTTP also helps to limit on the complexity of the API. As SOAP will attest, you can still make a horrible mess using HTTP, but that horrible mess is plain to see. Complex data structures are tedious to marshal to and from text, providing a strong incentive to keep things simple. Spotting the complexities in a class hierarchy is often not as easy.

HTTP doesn’t solve every problem – using it inside an inner loop that’s executed thousands of times per second probably isn’t such a good idea. What’s more, this approach might introduce some new problems. For instance, if we’re combining existing applications using HTTP for communication, we often need to add a thin shim to each application. For instance, you might need to write a small plugin in PHP if you want to integrate WordPress into your system. Now, instead of a system written in one language, you’ve got to maintain a system with several distinct languages and platforms.

Even then, we should strive to avoid reimplementing the same old thing. As programmers, we consistently underestimate the cost of building a system, not to mention ongoing maintenance. By integrating existing applications, even if they’re in an unfamiliar languages, we save ourselves those development and maintenance costs, as well as being able to pick the best solution for our problem. Thanks to the web, HTTP is often the easiest way to get there.

In case you recognised the topic, an edited version of this post was used as the Simple-Talk editorial a few months ago.

Topics: Software development, Software design