Welcome to Software Development on Codidact!
Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.
Can regex be used to check if input conforms to a very strict subset of HTML?
Tldr; I don't need to parse HTML, but I need to check if user submitted input conforms to a very strict subset of HTML. Can regex be a suitable tool for this?
Details
I have a frontend sanitiser that accepts user input from keystrokes and clipboard (as a wysiwyg editor) and outputs HTML with three important guarantees:
- The only tag types will be p, span, br, i, b, and u.
- The only attributes will be style and href.
- Angle brackets will only exist as part of tags; everywhere else they will be represented as HTML entities. There are some other guarantees too, but these are the ones that matter.
The sanitiser then sends the HTML to my backend as a string type. I am confident that once the HTML reaches the backend, if it contains anything that does not conform to the above specifications then it has been tampered with by a malicious user.
Before saving the HTML to disk I need to check if indeed it does conform to the above specifications. I don't need the backend to sanitise non conforming HTML, instead I will simply return an error. It's well established that you shouldn't parse HTML with regex, but I don't really want the overhead of a full HTML parser, especially since I'm not really parsing the HTML. To check if the HTML conforms I plan to use the following regex:
-
Check if any invalid tag types:
/<\/?(?!((span(\s|>))|(p(\s|>))|(a(\s|>))|(b(\s|>))|(i(\s|>))|(u(\s|>))|(br(\s|>))|(\/\s*))).*?/gi
-
Check if any invalid attributes:
/<(.(?!>))*?\s+?(?!(href|style)).*?>/gi
I don't really care if the HTML itself is invalid - if the user somehow submits <p></span></p>>><br href="./></span>
then that's their loss - this is to ensure the HTML can't load any scripts or run any events.
Question
Is using regex in this way a watertight method of ensuring no scripts or events can be attached to user submitted HTML?
4 answers
tl;dr
Although it can be done with regex (and work for "most" cases), I still prefer to use a parser.
Long answer
I'd use something such as DOMParser
to do the job:
let validTags = ['p', 'span', 'br', 'i', 'b', 'u'];
let validAttribs = ['style', 'href'];
function validHtml(string) {
let domparser = new DOMParser();
let doc = domparser.parseFromString(string, 'application/xml');
for (const node of doc.querySelectorAll('*')) {
if (! validTags.includes(node.nodeName)) return false;
for (const attr of node.attributes) {
if (! validAttribs.includes(attr.name)) return false;
}
}
return true;
}
console.log(validHtml('<p></span></p>>><br href="./></span>')); // false
console.log(validHtml('<p onclick="alert(\'hi\')"></p>')); // false
console.log(validHtml('<p style="padding: 2px">Lorem <span>ipsum <b>dolor</b> sit <u>amet</u></span></p>')); // true
console.log(validHtml('<p class="whatever">Lorem <span>ipsum <b>dolor</b> sit <u>amet</u></span></p>')); // false
console.log(validHtml('abc')); // false
With this, not only I can easily update the rules (change the arrays of valid tags and attributes), but also validate the HTML itself - according to the docs, when the string is invalid, parseFromString
returns an error document:
<parsererror xmlns="http://www.mozilla.org/newlayout/xml/parsererror.xml">
(error description)
<sourcetext>(a snippet of the source XML)</sourcetext>
</parsererror>
So, when checking it in the for
loop, it will enter the first if
(because the tag parseerror
is not in the array of valid tags) and it'll return false
as well.
You told you don't care if the HTML is valid, but even without that restriction, doing it with regex is - IMO - much worse. I could think of something like this:
function validHtml(string) {
// check if it has invalid tag
let tags = /<(?!\b([piu]|span|br?)\b)[^>]*>/;
// check if tag is valid, but with an invalid attribute
let attributes = /<\b([piu]|span|br?)\b[^>\w]*(?!\b(href|style)\b=[^>]*)\w[^>]*>/;
return (! tags.test(string)) || (! attributes.test(string));
}
console.log(validHtml('<p></span></p>>><br href="./></span>')); // true
console.log(validHtml('<p onclick="alert(\'hi\')"></p>')); // false
console.log(validHtml('<p style="padding: 2px">Lorem <span>ipsum <b>dolor</b> sit <u>amet</u></span></p>')); // true
console.log(validHtml('<p class="whatever">Lorem <span>ipsum <b>dolor</b> sit <u>amet</u></span></p>')); // false
console.log(validHtml('abc')); // true
The only difference between this and the first example is the first and the last cases. In the first case, the HTML is invalid, although it has all the valid tags (so the regex says it's valid). And in the last case, the string is not even HTML at all, but the regex also says it's valid - and this also happens with strings like ' '
and '!@#$%¨&*'
.
But the problem here is - IMO - how easy/hard is to read, understand and maintain each one of the options. I think the first one with DOMParser
is much easier - and you have more control over the structure (having DOM nodes, you can easily check whatever information they have, making it easier to change the criteria - such as check attribute values, comments, text nodes and so on).
And I haven't done extensive tests with those regexes, so I'm pretty sure there could be lots of corner cases that they don't catch - which are already handled by a HTML parser.
Regarding overhead, regex also has its own. If performance is an issue, you should benchmark it anyway. But I believe you should also consider how easy it is to maintain the code - IMO, the regexes above are not trivial to understand.
My conclusion is that using regex might work, but I wouldn't recommend it as "the best way".
0 comment threads
I mostly agree with the upshot of hkotsubo's answer, but I want to both tailor the answer more specifically to your question and give some more general advice.
First, the restricted subset you describe is still not a regular language. As a rule of thumb, if you have constructs, e.g. <span>
, which can be arbitrarily nested, you don't have a regular language. While there are various extensions to regular expressions that allow them to parse non-regular languages of various sorts, they are still a poor tool for that. It is very likely that an attempt to capture the actual grammar via regular expressions with extensions will produce a solution that is worse in every way to using an HTML parser as illustrated by hkotsubo.
Taking a much more general perspective now, I strongly recommend a LANGSEC approach. Curing the Vulnerable Parser offers a good overview of both the problem and solutions. To summarize the most relevant of their recommendations: 1) you should explicitly and formally define your input language, 2) you should use appropriate and quality tools/libraries written by experts to handle language recognition, 3) you should completely validate/recognize the input as belonging to the defined language before any additional processing, and 4) an output language should also be defined and there should be a clear, centralized position that produces the final output.
Another LANGSEC recommendation is to prefer simpler (e.g. in the Chomsky hierarchy sense) and restrictive input languages. At first this may seem not to apply to your case as HTML is already defined and is what it is. However, your description strongly suggests a scenario similar to the following: You have a front-end which produces HTML client-side (e.g. via a WYSIWYG editor or from Markdown) which then sends that generated HTML to an API endpoint (e.g. when submitting a comment). You want to validate the input to the API endpoint and, likely, ultimately want to present back the provided HTML in the future (e.g. display the comments).
In this case, while you don't control the definition of the HTML, you do control the input language of your API endpoint which does not need to be HTML. Here we can apply the guidance to make illegal states unrepresentable which overlaps with the guidance of using a restrictive input language. For example, you could have the frond-end pass JSON to the API endpoint which presents a vastly simplified model of the desired HTML, e.g. {"element": "span", "style": "color: red;", "body": "foobar"}
. This JSON can be validated against a schema server-side using one of many JSON schema approaches, e.g. JSON Schema. This replaces the need for HTML parsing server-side with parsing and validating JSON server-side which, while still a context-free language, is nevertheless much simpler than HTML. If needed, HTML can be generated (and cached) from the JSON representation. Ideally, this would be done with a HTML combinator library/builder that ensures the output is the intended HTML, i.e. isn't just blindly concatenating strings.
To reiterate, mismatches between the input language your application actually accepts and what you intended it to are one of the most common sources of security failures. Explicitly and formally specifying the desired input language and generating a recognizer from this specification avoids this. Similarly, ambiguity in how to interpret the language is another very common source of security failures. This can be mitigated by using simpler and narrower input languages. See A Patch for Postel’s Robustness Principle for more discussion about ambiguity. (Postel's Robustness Principle is the (in)famous guideline: "Be conservative in what you do, and liberal in what you accept from others.")
There are some downsides to this approach. The main one is that a non-trivial transformation process leads to the possibility of the generated output not matching the preview. Beyond security and correctness, though, there are also benefits, such as easing processing. It is quite possible that the approach I've outlined can be simpler to develop, easier to maintain, and more efficient in addition to being more secure and correct than ad-hoc stabs at trying to pin down the desired language via regular expressions.
Can regex be used to check if input conforms to a very strict subset of HTML?
The theoretical answer is Yes. The Javascript regex language is more than powerful enough to parse a recursive grammar.
In practice it is a bad idea.
-
Bugs! Writing a regex that can validate arbitrarily nested HTML elements (including the context rules) is complicated. Thoroughly testing the regex is difficult.
-
Writing a recursive regex that is not vulnerable to a "backtracking attack" could be difficult. Such an attack would entail crafting some input HTML that would trigger catastrophic backtracking ... in a regex that wasn't designed to defend against this problem.
0 comment threads
Okay, I'll be the contrarian.
For this case, yes, I think a regex-based approach can be used to validate these properties. This approach will not guarantee that the provided input is valid HTML; in particular, it won't ensure that elements are nested correctly. For this reason, I would go with the conventional wisdom, as exemplified in other answers, to use a built-for-purpose HTML parser and validator for this if you were planning to inject the submitted HTML into a server-rendered document without further inspection; in such a context, an unbalanced HTML tag might have far-reaching effects. But if your only intended use for this HTML is to send it to some dynamic JavaScript client code which pushes it into some element's innerHTML
property, then this can work, because any unclosed tags will be closed at the boundary of the element being innerHTML'd.
To ensure that a valid HTML fragment (or an invalid fragment which could be made valid by placing close tags at or before the close of the fragment) meets the conditions you've laid out, it suffices to ensure that every character is either not an angle bracket, or is an angle bracket that begins or ends a tag that meets your criteria (allowed element name, allowed attributes, no special characters inside the tag). The language thus described is regular; here is a regex that tests for it:
^([^<>]|<\/?(p|span|br|i|b|u)(\s+(style|href)(\s*=\s*([^'"\s<>=`]+|'[^'<>]*'|"[^"<>]*"))?)*\s*\/?>)*$
With whitespace and //
comments:
^
( [^<>] // Non-angle bracket characters are always fine.
| < // An open angle must start a valid tag
\/? // (might be an open or close tag)
(p|span|br|i|b|u) // of one of these types,
( \s+(style|href) // with zero or more of these attributes,
( \s*=\s* // optionally having values,
( [^'"\s<>=`]+ // which might be unquoted,
| '[^'<>]*' // or single-quoted,
| "[^"<>]*" // or double-quoted.
)
)?
)*
\s*
\/? // Void tag syntax also allowed.
> // End with a close angle.
)*
$
A more complex regex might exclude invalid constructs like </span href />
, but this is invalid HTML and no browser is at risk of parsing a fragment containing the above into an HTML fragment that violates your guarantees, so I don't consider it worth protecting against.
Given that neither style
nor href
are reasonably empty attributes, you might want to remove the (
/)?
making attribute values optional; I left it in for completeness.
I don't guarantee the above regex against bugs! But the general approach should be sound; I will stand behind the claim that recognizing only tags (as opposed to properly nesting the tags) that are valid according to your principles is a regular language, and thus fair game for regexes.
(Source for details of HTML syntax)
1 comment thread