In this blogpost I’ll explain my recent bypass in DOMPurify – the popular HTML sanitizer library. In a nutshell, DOMPurify’s job is to take an untrusted HTML snippet, supposedly coming from an end-user, and remove all elements and attributes that can lead to Cross-Site Scripting (XSS).
This is the bypass:
<style></math><img src onerror=alert(1)>
Believe me that there’s not a single element in this snippet that is superfluous 🙂
To understand why this particular code worked, I need to give you a ride through some interesting features of HTML specification that I used to make the bypass work.
Usage of DOMPurify
Let’s begin with the basics, and explain how DOMPurify is usually used. Assuming that we have an untrusted HTML in
htmlMarkup and we want to assign it to a certain
div, we use the following code to sanitize it using DOMPurify and assign to the
div.innerHTML = DOMPurify.sanitize(htmlMarkup)
In terms of parsing and serializing HTML as well as operations on the DOM tree, the following operations happen in the short snippet above:
htmlMarkupis parsed into the DOM Tree.
- DOMPurify sanitizes the DOM Tree (in a nutshell, the process is about walking through all elements and attributes in the DOM tree, and deleting all nodes that are not in the allow-list).
- The DOM tree is serialized back into the HTML markup.
- After assignment to
innerHTML, the browser parses the HTML markup again.
- The parsed DOM tree is appended into the DOM tree of the document.
Let’s see that on a simple example. Assume that our initial markup is
AB. In the first step it is parsed into the following tree:
Then, DOMPurify sanitizes it, leaving the following DOM tree:
Then it is serialized to:
And this is what
DOMPurify.sanitize returns. Then the markup is parsed again by the browser on assignment to innerHTML:
The DOM tree is identical to the one that DOMPurify worked on, and it is then appended to the document.
So to put it shortly, we have the following order of operations: parsing ➡️ serialization ➡️ parsing. The intuition may be that serializing a DOM tree and parsing it again should always return the initial DOM tree. But this is not true at all. There’s even a warning in the HTML spec in a section about serializing HTML fragments:
It is possible that the output of this algorithm [serializing HTML], if parsed with an HTML parser, will not return the original tree structure. Tree structures that do not roundtrip a serialize and reparse step can also be produced by the HTML parser itself, although such cases are typically non-conforming.
The important take-away is that serialize-parse roundtrip is not guaranteed to return the original DOM tree (this is also a root cause of a type of XSS known as mutation XSS). While usually these situations are a result of some kind of parser/serializer error, there are at least two cases of spec-compliant mutations.
Nesting FORM element
One of these cases is related to the FORM element. It is quite special element in the HTML because it cannot be nested in itself. The specification is explicit that it cannot have any descendant that is also a FORM:
This can be confirmed in any browser, with the following markup:
Which would yield the following DOM tree:
form is completely omitted in the DOM tree just as it wasn’t ever there.
Now comes the interesting part. If we keep reading the HTML specification, it actually gives an example that with a slightly broken markup with mis-nested tags, it is possible to create nested forms. Here it comes (taken directly from the spec):
<form id=“outer”><div></form><form id=“inner”><input>
It yields the following DOM tree, which contains a nested form element:
This is not a bug in any particular browser; it results directly from the HTML spec, and is described in the algorithm of parsing HTML. Here’s the general idea:
- When you open a
tag, the parser needs to keep record of the fact that it was opened with a form element pointer (that’s how it’s called in the spec). If the pointer is not
formelement cannot be created.
- When you end a
tag, the form element pointer is always set to
Thus, going back to the snippet:
<form id=“outer”><div></form><form id=“inner”><input>
In the beginning, the form element pointer is set to the one with
id="outer". Then, a
div is being started, and the
null. Because it’s
null, the next form with
id="inner" can be created; and because we’re currently within
div, we effectively have a
form nested in
Now, if we try to serialize the resulting DOM tree, we’ll get the following markup:
<form id=“outer”><div><form id=“inner”><input></form></div></form>
Note that this markup no longer has any mis-nested tags. And when the markup is parsed again, the following DOM tree is created:
So this is a proof that serialize-reparse roundtrip is not guaranteed to return the original DOM tree. And even more interestingly, this is basically a spec-compliant mutation.
Since the very moment I was made aware of this quirk, I’ve been pretty sure that it must be possible to somehow abuse it to bypass HTML sanitizers. And after a long time of not getting any ideas of how to make use of it, I finally stumbled upon another quirk in HTML specification. But before going into the specific quirk itself, let’s talk about my favorite Pandora’s box of the HTML specification: foreign content.
The HTML parser can create a DOM tree with elements of three namespaces:
- HTML namespace (
- SVG namespace (
- MathML namespace (
By default, all elements are in HTML namespace; however if the parser encounters