Mike's corner of the web.

Archive: Algorithms

Converting docx to clean HTML: handling the XML structure mismatch

Tuesday 17 December 2013 08:11

One of my recent side projects is Mammoth, which converts docx files produced by Microsoft Word into HTML. It aims to produce clean HTML by using semantic information in the original document, such as the styles applied to each paragraph, rather than trying to exactly copy the font, size, colour, and so on. I wrote Mammoth so that editors wouldn't have to spend hours manually converting Word documents into HTML. Although we're converting XML to XML, there's quite a mismatch in structure. This blog post describes how Mammoth handles the mismatch. If you're interested in trying it out, you can find a Python version (including a CLI) and a JavaScript version.

The docx format stores each paragraph as a distinct w:p element. Each paragraph optionally has a style. For instance, the following docx XML represents a heading followed by an ordinary paragraph [1].

<w:p style="Heading1>A Study in Scarlet</w:p>
<w:p>In the year 1878 I took my degree</w:p>

We'd like to convert this to an h1 element and a p element:

<h1>A Study in Scarlet</h1>
<p>In the year 1878 I took my degree</p>

This seems fairly straightforward: we take each paragraph from the docx XML, and convert it to an HTML element depending on the style. We can use a small DSL to let the user control how to map docx styles to HTML elements without having to write any code. In this case, we might write:

p.Heading1 => h1:fresh
p => p:fresh

To the left of the arrow, we have a paragraph matcher. p.Heading1 from the first rule matches any paragraph with the style Heading1, while p from the second rule matches any paragraph. To the right of the arrow, we have an HTML path. To process a docx paragraph:

  • Find the first rule where its paragraph matcher matches the current docx paragraph
  • Generate HTML to satisfy the HTML path. h1 is satisfied if there's a top-level h1 i.e. an h1 with no parents. h1:fresh means generate a fresh (i.e. newly-opened) top-level h1 element. We'll see a little later why this notion of freshness is useful.

Things become a bit more tricky when we'd expect to generate some nested HTML, such as lists. For instance, consider the following list:

  • Apple
  • Banana

One way of representing this in docx is:

<w:p style="Bullet1">Apple</w:p>
<w:p style="Bullet1">Banana</w:p>

Note that there's no nesting of elements, even though the two docx paragraphs are part of the same structure (in this case, a list). The only way to tell that these bullets are in the same list is by inspecting the style of sibling elements. Compare this to the HTML we expect to generate:

<ul>
  <li>Apple</li>
  <li>Banana</li>
</ul>

To generate this HTML, you can write the following rule:

p.Bullet1 => ul > li:fresh

The HTML path uses > to indicate children. In this case, the HTML path is satisfied when there's a top-level ul with a fresh li as a child. Let's see how this example works by processing each docx paragraph.

The first paragraph matches p.Bullet1, so we require a top-level ul with a fresh li as a child. Since we have no open elements, we open both elements followed by the text of the paragraph:

<ul>
  <li>Apple

The second paragraph also requires a top-level ul with a fresh li as a child. We close and open the li since it needs to be fresh, but leave the ul alone:

<ul>
  <li>Apple</li>
  <li>Banana

Finally, we close all elements at the end of the document:

<ul>
  <li>Apple</li>
  <li>Banana</li>
</ul>

The key is that HTML elements aren't closed after processing a docx paragraph. Instead, HTML elements are kept open in case following docx paragraphs are actually part of the same structure. An element will eventually be closed either by processing a docx paragraph that isn't part of the same structure, or by reaching the end of the document.

A more complicated case is that of nested lists. For instance, given the list:

  • Fruit
    • Apple
    • Banana
  • Vegetable
    • Cucumber
    • Lettuce

This would be represented in docx by:

<w:p style="Bullet1">Fruit</w:p>
<w:p style="Bullet2">Apple</w:p>
<w:p style="Bullet2">Banana</w:p>
<w:p style="Bullet1">Vegetable</w:p>
<w:p style="Bullet2">Cucumber</w:p>
<w:p style="Bullet2">Lettuce</w:p>

And we'd like to generate this HTML:

<ul>
  <li>
    Fruit
    <ul>
      <li>Apple</li>
      <li>Banana</li>
    </ul>
  </li>
  <li>
    Vegetable
    <ul>
      <li>Cucumber</li>
      <li>Lettuce</li>
    </ul>
  </li>
</ul>

In this case, we need two rules: one each for Bullet1 and Bullet2:

p.Bullet1 => ul > li:fresh
p.Bullet2 => ul > li > ul > li:fresh

To see how this works, let's follow step by step. We start by processing the first docx paragraph. This has the style Bullet1, which requires a ul and li element to be open. This generates the following HTML:

<ul>
  <li>
    Fruit

The second paragraph has the style Bullet2, which means we need to satisfy the HTML path ul > li > ul > li:fresh. Since the ul and li from processing the first docx paragraph have been left open, we only need to generate the second set of ul and li elements, giving the HTML:

<ul>
  <li>
    Fruit
    <ul>
      <li>Apple

The third paragraph also has the style Bullet2. The first three elements of the style rule (ul > li > ul) are already satisfied, but the final li needs to be fresh. Therefore, we close the currently open li, and then open a new li:

<ul>
  <li>
    Fruit
    <ul>
      <li>Apple</li>
      <li>Banana

The fourth paragraph has the style Bullet1. The first element of the style rule (ul) is satisfied, but the li needs to be fresh. Therefore, we close the outer li, along with its children, before opening a fresh li:

<ul>
  <li>
    Fruit
    <ul>
      <li>Apple</li>
      <li>Banana</li>
    </ul>
  </li>
  <li>
    Vegetable

The processing of the final two paragraphs proceeds in the same way as before, giving us the HTML:

<ul>
  <li>
    Fruit
    <ul>
      <li>Apple</li>
      <li>Banana</li>
    </ul>
  </li>
  <li>
    Vegetable
    <ul>
      <li>Cucumber</li>
      <li>Lettuce

Since we've reached the end of the document, all that remains is to close all open elements:

<ul>
  <li>
    Fruit
    <ul>
      <li>Apple</li>
      <li>Banana</li>
    </ul>
  </li>
  <li>
    Vegetable
    <ul>
      <li>Cucumber</li>
      <li>Lettuce</li>
    </ul>
  </li>
</ul>

I've left plenty of details out, such as handling of hyperlinks and images, but this gives an overview of how Mammoth deals with the greatest mismatch between the structure of docx XML and HTML.

[1] If you go and look at an actual docx file, you'll discover that the XML is more complicated than what I've presented. I've only included the bits that matter for an overview.

Topics: Algorithms, Programs