Mike's corner of the web.

Converting docx to clean HTML: handling the XML structure mismatch

Tuesday 17 December 2013 08:11

One of my recent side projects is Mammoth, which converts docx files produced by Microsoft Word into HTML. It aims to produce clean HTML by using semantic information in the original document, such as the styles applied to each paragraph, rather than trying to exactly copy the font, size, colour, and so on. I wrote Mammoth so that editors wouldn't have to spend hours manually converting Word documents into HTML. Although we're converting XML to XML, there's quite a mismatch in structure. This blog post describes how Mammoth handles the mismatch. If you're interested in trying it out, you can find a Python version (including a CLI) and a JavaScript version.

The docx format stores each paragraph as a distinct w:p element. Each paragraph optionally has a style. For instance, the following docx XML represents a heading followed by an ordinary paragraph [1].

<w:p style="Heading1>A Study in Scarlet</w:p>
<w:p>In the year 1878 I took my degree</w:p>

We'd like to convert this to an h1 element and a p element:

<h1>A Study in Scarlet</h1>
<p>In the year 1878 I took my degree</p>

This seems fairly straightforward: we take each paragraph from the docx XML, and convert it to an HTML element depending on the style. We can use a small DSL to let the user control how to map docx styles to HTML elements without having to write any code. In this case, we might write:

p.Heading1 => h1:fresh
p => p:fresh

To the left of the arrow, we have a paragraph matcher. p.Heading1 from the first rule matches any paragraph with the style Heading1, while p from the second rule matches any paragraph. To the right of the arrow, we have an HTML path. To process a docx paragraph:

  • Find the first rule where its paragraph matcher matches the current docx paragraph
  • Generate HTML to satisfy the HTML path. h1 is satisfied if there's a top-level h1 i.e. an h1 with no parents. h1:fresh means generate a fresh (i.e. newly-opened) top-level h1 element. We'll see a little later why this notion of freshness is useful.

Things become a bit more tricky when we'd expect to generate some nested HTML, such as lists. For instance, consider the following list:

  • Apple
  • Banana

One way of representing this in docx is:

<w:p style="Bullet1">Apple</w:p>
<w:p style="Bullet1">Banana</w:p>

Note that there's no nesting of elements, even though the two docx paragraphs are part of the same structure (in this case, a list). The only way to tell that these bullets are in the same list is by inspecting the style of sibling elements. Compare this to the HTML we expect to generate:

<ul>
  <li>Apple</li>
  <li>Banana</li>
</ul>

To generate this HTML, you can write the following rule:

p.Bullet1 => ul > li:fresh

The HTML path uses > to indicate children. In this case, the HTML path is satisfied when there's a top-level ul with a fresh li as a child. Let's see how this example works by processing each docx paragraph.

The first paragraph matches p.Bullet1, so we require a top-level ul with a fresh li as a child. Since we have no open elements, we open both elements followed by the text of the paragraph:

<ul>
  <li>Apple

The second paragraph also requires a top-level ul with a fresh li as a child. We close and open the li since it needs to be fresh, but leave the ul alone:

<ul>
  <li>Apple</li>
  <li>Banana

Finally, we close all elements at the end of the document:

<ul>
  <li>Apple</li>
  <li>Banana</li>
</ul>

The key is that HTML elements aren't closed after processing a docx paragraph. Instead, HTML elements are kept open in case following docx paragraphs are actually part of the same structure. An element will eventually be closed either by processing a docx paragraph that isn't part of the same structure, or by reaching the end of the document.

A more complicated case is that of nested lists. For instance, given the list:

  • Fruit
    • Apple
    • Banana
  • Vegetable
    • Cucumber
    • Lettuce

This would be represented in docx by:

<w:p style="Bullet1">Fruit</w:p>
<w:p style="Bullet2">Apple</w:p>
<w:p style="Bullet2">Banana</w:p>
<w:p style="Bullet1">Vegetable</w:p>
<w:p style="Bullet2">Cucumber</w:p>
<w:p style="Bullet2">Lettuce</w:p>

And we'd like to generate this HTML:

<ul>
  <li>
    Fruit
    <ul>
      <li>Apple</li>
      <li>Banana</li>
    </ul>
  </li>
  <li>
    Vegetable
    <ul>
      <li>Cucumber</li>
      <li>Lettuce</li>
    </ul>
  </li>
</ul>

In this case, we need two rules: one each for Bullet1 and Bullet2:

p.Bullet1 => ul > li:fresh
p.Bullet2 => ul > li > ul > li:fresh

To see how this works, let's follow step by step. We start by processing the first docx paragraph. This has the style Bullet1, which requires a ul and li element to be open. This generates the following HTML:

<ul>
  <li>
    Fruit

The second paragraph has the style Bullet2, which means we need to satisfy the HTML path ul > li > ul > li:fresh. Since the ul and li from processing the first docx paragraph have been left open, we only need to generate the second set of ul and li elements, giving the HTML:

<ul>
  <li>
    Fruit
    <ul>
      <li>Apple

The third paragraph also has the style Bullet2. The first three elements of the style rule (ul > li > ul) are already satisfied, but the final li needs to be fresh. Therefore, we close the currently open li, and then open a new li:

<ul>
  <li>
    Fruit
    <ul>
      <li>Apple</li>
      <li>Banana

The fourth paragraph has the style Bullet1. The first element of the style rule (ul) is satisfied, but the li needs to be fresh. Therefore, we close the outer li, along with its children, before opening a fresh li:

<ul>
  <li>
    Fruit
    <ul>
      <li>Apple</li>
      <li>Banana</li>
    </ul>
  </li>
  <li>
    Vegetable

The processing of the final two paragraphs proceeds in the same way as before, giving us the HTML:

<ul>
  <li>
    Fruit
    <ul>
      <li>Apple</li>
      <li>Banana</li>
    </ul>
  </li>
  <li>
    Vegetable
    <ul>
      <li>Cucumber</li>
      <li>Lettuce

Since we've reached the end of the document, all that remains is to close all open elements:

<ul>
  <li>
    Fruit
    <ul>
      <li>Apple</li>
      <li>Banana</li>
    </ul>
  </li>
  <li>
    Vegetable
    <ul>
      <li>Cucumber</li>
      <li>Lettuce</li>
    </ul>
  </li>
</ul>

I've left plenty of details out, such as handling of hyperlinks and images, but this gives an overview of how Mammoth deals with the greatest mismatch between the structure of docx XML and HTML.

[1] If you go and look at an actual docx file, you'll discover that the XML is more complicated than what I've presented. I've only included the bits that matter for an overview.

Topics: Algorithms, Programs

Fun with Prolog: write an algorithm, then run it backwards

Sunday 10 November 2013 21:21

Compared to most other languages, Prolog encourages you to write code in a highly declarative style. One of the results is that you can write an algorithm, and then run the same algorithm "backwards" without any additional code.

For instance, suppose you want to find out whether a list is a palindrome or not. We write a predicate like so:

palindrome(L) :- reverse(L, L).

We can read this as: palindrome(L) is true if reverse(L, L) is true. In turn, reverse(L1, L2) is true when L1 is the reverse of L2. We try out the palindrome predicate in the interpreter:

?- palindrome([]).
true.

?- palindrome([1]).
true.

?- palindrome([1, 1]).
true.

?- palindrome([1, 2]).
false.

?- palindrome([1, 2, 1]).
true.

So far, not that different from any other programming language. However, if we set some of the elements of the list to be variables, Prolog tries to fill in the blanks -- that is, it tries to find values for those variables so that the predicate is true. For instance:

?- palindrome([1, A]).
A = 1.

In the above, Prolog tells us that the list [1, A] is a palindrome if A has the value 1. We can do something a bit more fancy if we use a variable for the tail of the list, rather than just one element. [1 | A] means a list with 1 as the first element, with any remaining elements represented by A.

?- palindrome([1 | A]).
A = [1]

Prolog tells us that [1 | A] is a palindrome if A has the value [1]. However, if we hit the semicolon in the interpreter, Prolog gives us another value for A that satisfies the predicate.

?- palindrome([1, 2 | A]).
A = [1] ;
A = [2, 1]

Now Prolog is telling us that [2, 1] is another value for A that satisfies the predicate. If we hit semicolon again, we get another result:

?- palindrome([1, 2 | A]).
A = [1] ;
A = [2, 1] ;
A = [_G313, 2, 1]

This time, Prolog says A = [_G313, 2, 1] satifies the predicate. The value _G313 means that any value would be valid in that position. Another hit of the semicolon, and another possibility:

?- palindrome([1, 2 | A]).
A = [1] ;
A = [2, 1] ;
A = [_G313, 2, 1] ;
A = [_G313, _G313, 2, 1]

We still have _G313, but this time it appears twice. The first and second element of A can be anything so long as they're the same value. We can keep hitting semicolon, and Prolog will keep giving us possibilities:

?- palindrome([1, 2 | A]).
A = [1] ;
A = [2, 1] ;
A = [_G313, 2, 1] ;
A = [_G313, _G313, 2, 1] ;
A = [_G313, _G319, _G313, 2, 1] ;
A = [_G313, _G319, _G319, _G313, 2, 1] ;
A = [_G20, _G26, _G32, _G26, _G20, 2, 1] ;
A = [_G20, _G26, _G32, _G32, _G26, _G20, 2, 1]

In each of these possibilities, Prolog correctly determines which of the elements in the list must be the same. Now for one last example: what if we don't put any constraints on the list?

?- palindrome(A).
A = [] ;
A = [_G295] ;
A = [_G295, _G295] ;
A = [_G295, _G301, _G295] ;
A = [_G295, _G301, _G301, _G295] ;
A = [_G295, _G301, _G307, _G301, _G295] ;
A = [_G295, _G301, _G307, _G307, _G301, _G295] ;
A = [_G295, _G301, _G307, _G313, _G307, _G301, _G295]

Once again, Prolog generates possibilities for palindromes, telling us which elements need to be the same, but otherwise not putting any restrictions on values.

In summary, we wrote some code to tell us whether lists were palindromes or not, but that same code can be used to generate palindromes. As another example, we might want to implement run-length encoding of lists:

?- encode([a, a, a, b, b, a, c, c], X).
X = [[3, a], [2, b], [1, a], [2, c]] .

Once we've written encode, to work in one direction (turning ordinary lists into run-length-encoded lists), we can use the same predicate to work in the other direction (turning run-length-encoded lists into ordinary lists):

?- encode(X, [[3, a], [2, b], [1, a], [2, c]]).
X = [a, a, a, b, b, a, c, c] .

For the interested, the implementation can be found as a GitHub gist. One caveat is that the implementation of encode has to be written carefully so that it works in both directions. Although this might be harder (and much less efficient) than writing two separate predicates, one for encoding and one for decoding, using a single predicate gives a high degree of confidence that the decode operation is correctly implemented as the inverse of the encode operation. Writing a version of encode that actually works in both directions is an interesting challenge, and also the topic of another blog post.

(Thanks to Ninety-Nine Prolog Problems for inspiration for examples.)

Topics: Prolog, Language design

Relocatable Python virtualenvs using Whack

Saturday 7 September 2013 17:25

One of the uses for Whack is creating relocatable (aka path-independent) Python virtualenvs. Normally, a virtualenv is tied to a specific absolute path:

$ virtualenv venv
$ venv/bin/pip install glances
(Snipping pip output)
$ mv venv venv2
$ venv2/bin/glances -v
bash: venv2/bin/glances: /tmp/venv/bin/python: bad interpreter: No such file or directory

Copying the entire virtualenv has similar but subtler problems. Rather than getting a straightforward error, the scripts in the new virtualenv will use the Python interpreter and libraries in the original virtualenv.

Whack allows virtualenvs to be created, and then moved to any other location:

$ whack install \
    git+https://github.com/mwilliamson/whack-package-python-virtualenv-env.git \
    venv
$ venv/bin/pip install glances
(Snipping pip output)
$ whack deploy venv --in-place
$ # Now we can copy the virtualenv to any other path,
$ # and it will continue to work
$ mv venv venv2
$ venv2/bin/glances -v
Glances version 1.7.1 with PsUtil 1.0.1

The whack deploy command is necessary to add any newly-installed scripts in the virtualenv to the bin directory.

One question is: why not use the --relocatable argument that virtualenv itself provides? This works in many cases, and doesn't require installation of Whack, but it also comes with a warning from virtualenv's documentation:

The --relocatable option currently has a number of issues, and is not guaranteed to work in all circumstances. It is possible that the option will be deprecated in a future version of virtualenv.

Topics: Python, Whack, Programs

An experiment in reusable web widgets

Wednesday 31 July 2013 10:28

For the same reasons that breaking down programs into short, composable functions is a good idea, it seems like breaking down web code into short, composable web widgets would be a good idea. (By web widget, I mean the HTML, CSS and JavaScript that go together to implement a particular piece of functionality.) Having shorter snippets makes code easier to understand and change, with the potential for reuse.

Yet it feels like there's no good way of sharing the HTML, CSS and JavaScript that go together to implement a particular piece of functionality. For instance, the usual way of creating a web widget using JQuery is to create a JQuery plugin, but there's no natural way of using such a JQuery plugin from Knockout. Over the past few days, I've tried an experiment in creating web widgets that can be written and consumed independently of technology.

First of all, I've defined a widget as being a function that accepts a single options argument. That options argument must contain an element property, which is the element that will be transformed into the widget (for instance, we might turn an <input> element into a date picker). The options argument can also contain any number of other options for that widget. The interface is kept simple so it's easy to implement, while still being sufficiently general. It's not exactly something to write home about, but the value is in choosing a fixed interface.

Now that we've defined the notion of web widget, we'll want to start consuming and creating widgets. For instance, we can create a widget that will turn its message option to uppercase wrapped in <strong> tags:

function shoutingWidget(options) {
    var element = options.element;
    var contents = options.message.toUpperCase();
    // Assuming that we've defined htmlEscape elsewhere
    element.innerHTML = "<strong>" + htmlEscape(contents) + "</strong>";
}

We can use it like so:

shoutingWidget({
    element: document.getElementById("example"),
    message: "Hello!"
});

which will transform the following HTML:

<span id="example"></span>

into:

<span id="example"><strong>HELLO!</strong></span>

However, most of the time, I'm not writing web code using raw JavaScript. So, for any given web framework/library, we can start to answer two questions:

  • What's the easiest way we can consume a widget?
  • What's the easiest way we can create a widget?

In particular, when a widget is used, we shouldn't care about the underlying implementation. Whether it was created using jQuery or Knockout or something else, we should be able to use it with the same interface.

Let's see how this works with Knockout. To create a web widget, I call the function knockoutWidgets.widget() with an object with an init function, and I get back a widget (which is just a function). The init function is called with the options object each time the widget is rendered. The init function should return the view model and template for that widget. For instance, to implement the previous example using Knockout:

var shoutingWidget = knockoutWidgets.widget({
    init: function(options) {
        var contents = options.message.toUpperCase();
        return {
            viewModel: {contents: contents},
            template: '<strong data-bind="text: contents"></strong>'
        }
    }
});

To consume widgets from Knockout, we have to explicitly specify dependencies. By avoiding putting all widgets into a single namespace, we avoid collisions without using long, unwieldy names. For instance, to create an emphatic greeter widget that transforms:

<span id="example"></span>

into:

<span id="example">Hello <strong>BOB</strong>!</span>

we can write:

var emphaticGreeterWidget = knockoutWidgets.widget({
    init: function(options) {
        return {
            viewModel: {name: options.name},
            template: 'Hello <span data-bind="widget: \'shout\', widgetOptions: {message: name}"></span>!'
        }
    },
    dependencies: {
        shout: shoutingWidget
    }
});

emphaticGreeterWidget({
    element: document.getElementById("example"),
    name: "Bob"
});

Importantly, although we've created the widget using Knockout, any code that supports our general notion of a web widget should be able to use it. Similarly, emphaticGreeterWidget can use shoutingWidget regardless of whether it was written using Knockout, raw JavaScript, or something else altogether.

Although I've successfully used this style with Knockout for some small bits of work, there are still two rather major unsolved problems.

The first problem: how should data binding be handled? All the above examples have data flowing in one direction: into the widget. What if we want data to flow in both directions, such as a date picker widget?

The second problem: should content within widgets be allowed? Our shouting widget had the message passed in via the options argument, but it could have been specified in the body of the element that the widget was applied to. Using raw JavaScript, that means a definition that looks something like:

function shoutingWidget(options) {
    var element = options.element;
    var message = "message" in options ? options.message : element.textContent;
    var contents = message.toUpperCase();
    // Assuming that we've defined htmlEscape elsewhere
    element.innerHTML = "<strong>" + htmlEscape(contents) + "</strong>";
}

If we allow content within widgets, then we have to work out how the widget interacts with the content and the web library in use. For instance, if we're using Knockout, do we apply the bindings before or after the widget is executed? How should the widget detect changes when its children change as a result of those Knockout bindings?

Also notably absent from the examples was any mention of CSS, despite my earlier mention. The reason: it hasn't been needed in my small experiments so far, so I haven't thought that much about it! It's something that will need dealing with at some point though.

Thoughts on the overall idea or those specific problems are welcome! You can take a look at the code on GitHub.

Topics: HTML, JavaScript

The Cthulhu Effect, or what happens to old programmers

Sunday 28 July 2013 13:27

There's a perception in at least some parts of the world of software development that coding is a game for the young. I'm not sure whether it's actually true or not, but after a conversation with a friend, I've decided to chalk up the phenomenon to "the Cthulhu Effect". Apparently, the more you stare at Cthulhu, the more insane you go. Similarly, I expect a decade or two of staring at awful code is enough to drive any programmer to madness, leaving you with the choice of either running from the code or embracing madness. (That also goes a long towards explaining the demeanor of many experienced programmers.)

Topics: Nonsense

Adding git (or hg, or svn) dependencies in setup.py (Python)

Wednesday 29 May 2013 21:02

You can specify dependencies for your Python project in setup.py by referencing packages on the Python Package Index (PyPI). But what if you want to depend on your own package that you don't want to make public? Using dependency_links, you can reference a package's source repository directly.

For instance, mayo is a public package on PyPI, so I can reference it directly:

setup(
    install_requires=[
        "mayo>=0.2.1,<0.3"
    ],
    # Skipping other arguments to setup for brevity
)

But suppose that mayo is a private package that I don't want to share. Using the dependency_links argument, I can reference the package by its source repository. The only way I could get this working with git was to use an explict SSH git URL, which requires a small transformation from the SSH URLs that GitHub or BitBucket provide. For instance, if GitHub lists the SSH URL as:

git@github.com:mwilliamson/mayo.git

then we need to explictly set the URL as being SSH, which means adding ssh:// at the front, and replacing the colon after github.com with a forward slash. Finally, we need to indicate the URL is a git URL by adding git+ to the front. This gives a URL like:

git+ssh://git@github.com/mwilliamson/mayo.git

To use a specific commit, add an at symbol followed by a commit identifier. For instance, if we wanted to use version 0.2.1, which has the tag 0.2.1 in git:

git+ssh://git@github.com/mwilliamson/mayo.git@0.2.1

Then, we can use the URL in setup.py like so:

setup(
    install_requires=[
        "mayo==0.2.1"
    ],
    dependency_links=[
        "git+ssh://git@github.com/mwilliamson/mayo.git@0.2.1#egg=mayo-0.2.1"
    ]
    # Skipping other arguments to setup for brevity
)

Note that we depend on a specific version of the package, and that we use the URL fragment (the bit after #) to indicate both the package name and version.

Topics: Python

External code quality and libification

Tuesday 26 February 2013 20:11

If you ask a programmer to list symptoms of low code quality, they could probably produce a long list: deeply nested conditionals and loops, long methods, overly terse variable names. Most of these code smells tend to focus on the implementation of the code. They're about internal code quality.

External code quality instead asks you to consider the programmer that has to call your code. When trying to judge how easily somebody else can you use your code, you might ask yourself:

  • Do the class and method names describe what the caller wants to accomplish?
  • How many times must we call into your code to complete a single, discrete task?
  • Does your code have minimal dependencies on other parts of your codebase and external libraries?

As an example, consider this snippet of Java to write an XML document to an OutputStream:

import org.w3c.dom.*;
import java.io.*;
import javax.xml.transform.*;
import javax.xml.transform.dom.*;
import javax.xml.transform.stream.*;

private static final void writeDoc(Document document, OutputStream output)
        throws IOException {
    try {
        Transformer transformer =
            TransformerFactory.newInstance().newTransformer();
        transformer.setOutputProperty(
            OutputKeys.DOCTYPE_SYSTEM,
            document.getDoctype().getSystemId()
        );
        transformer.transform(new DOMSource(document), new StreamResult(output));
    } catch (TransformerException e) {
        throw new AssertionError(e); // Can't happen!
    }
}

While there are probably good reasons for all of those methods, and there are cases where having a high level of control is valuable, this isn't a good API for our user that just wants to write out their XML document to an output stream.

  • Do the class and method names describe what they want to accomplish? We want to write out our XML document, and instead we're talking about TransformerFactory and OutputKeys.DOCTYPE_SYSTEM.
  • How many times must we call into your code to complete a single, discrete task? Writing out an XML document seems simple, but we have to create an instance of a transformer factory, then ask it for a transformer, set the output property (whatever that is), wrap up our document and output stream, before we can finally use the transformer to write out our document.
  • Does your code have minimal dependencies on other parts of your codebase and external libraries? The code above actually does quite well here, since that snippet should work on a normal installation of Java.

So, why is it valuable to distinguish between internal and external code quality? The effect of low internal code quality is contained within a small scope (by definition!). I'm certainly not advocating one letter names for all local variables, but cleaning up that code is comparatively straightforward compared to improving an API. The effects of low external code quality tend to pervade your entire system. If you change the signature of a method, you now have to change every use of that method.

When writing code, we often trade off code quality against speed of execution. Even when writing good quality code, we're not going to spend weeks refactoring to make it perfect. I'm suggesting that we should be spending more time worrying about the external quality of our code. Internal quality is important, but it's not as important.

A good measure of whether a piece of your code has minimal dependencies is to try "libifying" it: turn it into an independent library. If the code you write frequently depends on large parts of the entire system, then it probably depends on too much. Once you've split out your code into a separate library, there's a good chance that external code quality will improve. For starters, once you've pulled out that code, you're unlikely to accidentally introduce new dependencies that aren't really required. Beyond that: when you've written a bad API deep within the internals of your large system, it's easy to ignore. If you've split it out into a library, it's much harder to ignore whether your library makes it hard or easy to do what it says on the tin.

Decomposing your code into libraries has plenty of advantages, such as code reuse and being able to test components independently. But I have a hypothesis that aggressively libifying your code will leave you with a much higher quality of code in the long run.

Topics: Software development, Software design

The best retrospectives are in the middle of a project, not the end

Sunday 24 February 2013 22:52

Retrospectives are unfortunately named. The name (correctly) suggests looking back over what has gone before, but I've noticed this leads many people to run retrospectives after a project has finished. The other part of a retrospective is looking forward: how can we improve in the future? What can we do differently? What can we try?

Retrospectives after a completed project can certainly be educational, but the lessons learnt and things to do in the future tend to be somewhat abstract and vague. Since the project is over, you can't make immediate changes over the next couple of weeks, so there's little motivation to come up with concrete actions. Retrospectives are about improvement, but in this case you're often improving the vague notion of a similar project in the future.

On the other hand, if you run a retrospective in the middle of a project, you can try out new ideas quickly, perhaps as soon as you leave the retrospective. These ideas will hopefully improve your working life within the next couple of weeks, rather than affecting some vague future project. This gives a strong incentive to come up with useful, concrete actions. If you're running regular retrospectives, you also have the opportunity to experiment and iterate on ideas.

Retrospectives shouldn't be held at the end of a project out of a sense of obligation, or the need to learn something from a failed project. Regular retrospectives in the middle of a project give the best chance for real improvement.

Topics: Software development

Test reuse

Monday 18 February 2013 10:38

Code reuse is often discussed, but what about test reuse? I don't just mean reusing common code between tests -- I mean running exactly the same tests against different code. Imagine you're writing a number of different implementations of the same interface. If you write a suite of tests against the interface, any one of your implementations should be able to make the tests pass. Taking the idea even further, I've found that you can reuse the same tests whenever you're exposing the same functionality through different methods, whether as a library, an HTTP API, or a command line interface.

As an example, suppose you want to start up a virtual machine from some Python code. We could use QEMU, a command line application on Linux that lets you start up virtual machines. Invoking QEMU directly is a bit ugly, so we wrap it up in a class. As an example of usage, here's what a single test case might look like:

def can_run_commands_on_machine():
    provider = QemuProvider()
    with provider.start("ubuntu-precise-amd64") as machine:
        shell = machine.shell()
        result = shell.run(["echo", "Hello there"])
        assert_equal("Hello there\n", result.output)

We create an instance of QemuProvider, use the start method to start a virtual machine, and then run a command on the virtual machine, and check the output. However, other than the original construction of the virtual machine provider, there's nothing in the test that relies on QEMU specifically. So, we could rewrite the test to accept provider as an argument to make it implementation agnostic:

def can_run_commands_on_machine(provider):
    with provider.start("ubuntu-precise-amd64") as machine:
        shell = machine.shell()
        result = shell.run(["echo", "Hello there"])
        assert_equal("Hello there\n", result.output)

If we decided to implement a virtual machine provider using a different technology, for instance by writing the class VirtualBoxProvider, then we can reuse exactly the same test case. Not only does this save us from duplicating the test code, it means that we have a degree of confidence that each implementation can be used in the same way.

If other people are implementing your interface, you could provide the same suite of tests so they can run it against their own implementation. This can give them some confidence that they've implemented your interface correctly.

What about when you're implementing somebody else's interface? Writing your own set of implementation-agnostic tests and running it existing implementations is a great way to check that you've understood the interface. You can then use the same tests against your code to make sure your own implementation is correct.

We can take the idea of test reuse a step further by testing user interfaces with the same suites of tests that we use to implement the underlying library. Using our virtual machine example, suppose we write a command line interface (CLI) to let people start virtual machines manually. We could test the CLI by writing a separate suite of tests. Alternatively, we could write an adaptor that invokes our own application to implement the provider interface:

class CliProvider(object):
    def start(self, image_name):
        output = subprocess.check_output([
            _APPLICATION_NAME, "start", image_name
        ])
        
        return CliMachine(_parse_output(output))

Now, we can make sure that our command-line interface behaves correctly using the same suite of tests that we used to test the underlying code. If our interface is just a thin layer on top of the underlying code, then writing such an adaptor is often reasonably straightforward.

I often find writing clear and clean UI tests is hard. Keeping a clean separation between the intent of the test and the implementation is often tricky, and it takes discipline to stop the implementation details from leaking out. Reusing tests in this way forces you to hide those details behind the common interface.

If you're using nose in Python to write your tests, then I've put the code I've been using to do this in a separate library called nose-set-tests.

Topics: Testing, Software design, Python

spur.py: A simplified interface for SSH and subprocess in Python

Sunday 10 February 2013 14:45

Over the last few months, I've frequently needed to use SSH from Python, but didn't find any of the existing solutions to be well-suited for what I needed (see below for discussion of other solutions). So, I've created spur.py to make using SSH from Python easy. For instance, to run echo over SSH:

import spur

shell = spur.SshShell(hostname="localhost", username="bob", password="password1")
result = shell.run(["echo", "-n", "hello"])
print result.output # prints hello

shell.run() executes a command, and returns the result once it's finished executing. If you don't want to wait until the command has finished, you can call shell.spawn() instead, which returns a process object:

process = shell.spawn(["sh", "-c", "read value; echo $value"])
process.stdin_write("hello\n")
result = process.wait_for_result()
print result.output # prints hello

spur.py also allows commands to be run locally using the same interface:

import spur

shell = spur.LocalShell()
result = shell.run(["echo", "-n", "hello"])
print result.output # prints hello

For a complete list of supported operations, take a look at the project on GitHub.

spur.py is certainly not the only way to use SSH from Python, and it's possible that one of the other solutions might be better suited for what you need. I've come across three other main alternatives.

The first is to shell out to ssh. It works, but it's ugly.

The second is to use Fabric. Unfortunately, I found Fabric to be a bit too high-level. It's useful for implementing deployment scripts using SSH, but I found it awkward to use as a general-purpose library for SSH.

Finally, there's paramiko. I found paramiko to be a bit too low-level, but both Fabric and spur.py are built on top of paramiko.

Topics: Python