Converting docx to clean HTML: handling the XML structure mismatch
Tuesday 17 December 2013 08:11
One of my recent side projects is Mammoth, which converts docx files produced by Microsoft Word into HTML. It aims to produce clean HTML by using semantic information in the original document, such as the styles applied to each paragraph, rather than trying to exactly copy the font, size, colour, and so on. I wrote Mammoth so that editors wouldn't have to spend hours manually converting Word documents into HTML. Although we're converting XML to XML, there's quite a mismatch in structure. This blog post describes how Mammoth handles the mismatch. If you're interested in trying it out, you can find a Python version (including a CLI) and a JavaScript version.
The docx format stores each paragraph as a distinct w:p
element.
Each paragraph optionally has a style.
For instance, the following docx XML represents a heading followed by an ordinary paragraph [1].
<w:p style="Heading1>A Study in Scarlet</w:p>
<w:p>In the year 1878 I took my degree</w:p>
We'd like to convert this to an h1
element and a p
element:
<h1>A Study in Scarlet</h1>
<p>In the year 1878 I took my degree</p>
This seems fairly straightforward: we take each paragraph from the docx XML, and convert it to an HTML element depending on the style. We can use a small DSL to let the user control how to map docx styles to HTML elements without having to write any code. In this case, we might write:
p.Heading1 => h1:fresh p => p:fresh
To the left of the arrow, we have a paragraph matcher.
p.Heading1
from the first rule matches any paragraph with the style Heading1
,
while p
from the second rule matches any paragraph.
To the right of the arrow, we have an HTML path.
To process a docx paragraph:
- Find the first rule where its paragraph matcher matches the current docx paragraph
-
Generate HTML to satisfy the HTML path.
h1
is satisfied if there's a top-levelh1
i.e. anh1
with no parents.h1:fresh
means generate a fresh (i.e. newly-opened) top-levelh1
element. We'll see a little later why this notion of freshness is useful.
Things become a bit more tricky when we'd expect to generate some nested HTML, such as lists. For instance, consider the following list:
- Apple
- Banana
One way of representing this in docx is:
<w:p style="Bullet1">Apple</w:p>
<w:p style="Bullet1">Banana</w:p>
Note that there's no nesting of elements, even though the two docx paragraphs are part of the same structure (in this case, a list). The only way to tell that these bullets are in the same list is by inspecting the style of sibling elements. Compare this to the HTML we expect to generate:
<ul>
<li>Apple</li>
<li>Banana</li>
</ul>
To generate this HTML, you can write the following rule:
p.Bullet1 => ul > li:fresh
The HTML path uses >
to indicate children.
In this case, the HTML path is satisfied when there's a top-level ul
with a fresh li
as a child.
Let's see how this example works by processing each docx paragraph.
The first paragraph matches p.Bullet1
,
so we require a top-level ul
with a fresh li
as a child.
Since we have no open elements, we open both elements followed by the text of the paragraph:
<ul>
<li>Apple
The second paragraph also requires a top-level ul
with a fresh li
as a child.
We close and open the li
since it needs to be fresh, but leave the ul
alone:
<ul>
<li>Apple</li>
<li>Banana
Finally, we close all elements at the end of the document:
<ul>
<li>Apple</li>
<li>Banana</li>
</ul>
The key is that HTML elements aren't closed after processing a docx paragraph. Instead, HTML elements are kept open in case following docx paragraphs are actually part of the same structure. An element will eventually be closed either by processing a docx paragraph that isn't part of the same structure, or by reaching the end of the document.
A more complicated case is that of nested lists. For instance, given the list:
-
Fruit
- Apple
- Banana
-
Vegetable
- Cucumber
- Lettuce
This would be represented in docx by:
<w:p style="Bullet1">Fruit</w:p>
<w:p style="Bullet2">Apple</w:p>
<w:p style="Bullet2">Banana</w:p>
<w:p style="Bullet1">Vegetable</w:p>
<w:p style="Bullet2">Cucumber</w:p>
<w:p style="Bullet2">Lettuce</w:p>
And we'd like to generate this HTML:
<ul>
<li>
Fruit
<ul>
<li>Apple</li>
<li>Banana</li>
</ul>
</li>
<li>
Vegetable
<ul>
<li>Cucumber</li>
<li>Lettuce</li>
</ul>
</li>
</ul>
In this case, we need two rules: one each for Bullet1
and Bullet2
:
p.Bullet1 => ul > li:fresh p.Bullet2 => ul > li > ul > li:fresh
To see how this works, let's follow step by step.
We start by processing the first docx paragraph.
This has the style Bullet1
,
which requires a ul
and li
element to be open.
This generates the following HTML:
<ul>
<li>
Fruit
The second paragraph has the style Bullet2
,
which means we need to satisfy the HTML path ul > li > ul > li:fresh
.
Since the ul
and li
from processing the first docx paragraph have been left open,
we only need to generate the second set of ul
and li
elements, giving the HTML:
<ul>
<li>
Fruit
<ul>
<li>Apple
The third paragraph also has the style Bullet2
.
The first three elements of the style rule (ul > li > ul
) are already satisfied,
but the final li
needs to be fresh.
Therefore, we close the currently open li
,
and then open a new li
:
<ul>
<li>
Fruit
<ul>
<li>Apple</li>
<li>Banana
The fourth paragraph has the style Bullet1
.
The first element of the style rule (ul
) is satisfied,
but the li
needs to be fresh.
Therefore, we close the outer li
, along with its children,
before opening a fresh li
:
<ul>
<li>
Fruit
<ul>
<li>Apple</li>
<li>Banana</li>
</ul>
</li>
<li>
Vegetable
The processing of the final two paragraphs proceeds in the same way as before, giving us the HTML:
<ul>
<li>
Fruit
<ul>
<li>Apple</li>
<li>Banana</li>
</ul>
</li>
<li>
Vegetable
<ul>
<li>Cucumber</li>
<li>Lettuce
Since we've reached the end of the document, all that remains is to close all open elements:
<ul>
<li>
Fruit
<ul>
<li>Apple</li>
<li>Banana</li>
</ul>
</li>
<li>
Vegetable
<ul>
<li>Cucumber</li>
<li>Lettuce</li>
</ul>
</li>
</ul>
I've left plenty of details out, such as handling of hyperlinks and images, but this gives an overview of how Mammoth deals with the greatest mismatch between the structure of docx XML and HTML.
[1] If you go and look at an actual docx file, you'll discover that the XML is more complicated than what I've presented. I've only included the bits that matter for an overview.
Topics: Algorithms, Programs