Welcome to Software Development on Codidact!
Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.
Comments on Can regex be used to check if input conforms to a very strict subset of HTML?
Parent
Can regex be used to check if input conforms to a very strict subset of HTML?
Tldr; I don't need to parse HTML, but I need to check if user submitted input conforms to a very strict subset of HTML. Can regex be a suitable tool for this?
Details
I have a frontend sanitiser that accepts user input from keystrokes and clipboard (as a wysiwyg editor) and outputs HTML with three important guarantees:
- The only tag types will be p, span, br, i, b, and u.
- The only attributes will be style and href.
- Angle brackets will only exist as part of tags; everywhere else they will be represented as HTML entities. There are some other guarantees too, but these are the ones that matter.
The sanitiser then sends the HTML to my backend as a string type. I am confident that once the HTML reaches the backend, if it contains anything that does not conform to the above specifications then it has been tampered with by a malicious user.
Before saving the HTML to disk I need to check if indeed it does conform to the above specifications. I don't need the backend to sanitise non conforming HTML, instead I will simply return an error. It's well established that you shouldn't parse HTML with regex, but I don't really want the overhead of a full HTML parser, especially since I'm not really parsing the HTML. To check if the HTML conforms I plan to use the following regex:
-
Check if any invalid tag types:
/<\/?(?!((span(\s|>))|(p(\s|>))|(a(\s|>))|(b(\s|>))|(i(\s|>))|(u(\s|>))|(br(\s|>))|(\/\s*))).*?/gi
-
Check if any invalid attributes:
/<(.(?!>))*?\s+?(?!(href|style)).*?>/gi
I don't really care if the HTML itself is invalid - if the user somehow submits <p></span></p>>><br href="./></span>
then that's their loss - this is to ensure the HTML can't load any scripts or run any events.
Question
Is using regex in this way a watertight method of ensuring no scripts or events can be attached to user submitted HTML?
Post
I mostly agree with the upshot of hkotsubo's answer, but I want to both tailor the answer more specifically to your question and give some more general advice.
First, the restricted subset you describe is still not a regular language. As a rule of thumb, if you have constructs, e.g. <span>
, which can be arbitrarily nested, you don't have a regular language. While there are various extensions to regular expressions that allow them to parse non-regular languages of various sorts, they are still a poor tool for that. It is very likely that an attempt to capture the actual grammar via regular expressions with extensions will produce a solution that is worse in every way to using an HTML parser as illustrated by hkotsubo.
Taking a much more general perspective now, I strongly recommend a LANGSEC approach. Curing the Vulnerable Parser offers a good overview of both the problem and solutions. To summarize the most relevant of their recommendations: 1) you should explicitly and formally define your input language, 2) you should use appropriate and quality tools/libraries written by experts to handle language recognition, 3) you should completely validate/recognize the input as belonging to the defined language before any additional processing, and 4) an output language should also be defined and there should be a clear, centralized position that produces the final output.
Another LANGSEC recommendation is to prefer simpler (e.g. in the Chomsky hierarchy sense) and restrictive input languages. At first this may seem not to apply to your case as HTML is already defined and is what it is. However, your description strongly suggests a scenario similar to the following: You have a front-end which produces HTML client-side (e.g. via a WYSIWYG editor or from Markdown) which then sends that generated HTML to an API endpoint (e.g. when submitting a comment). You want to validate the input to the API endpoint and, likely, ultimately want to present back the provided HTML in the future (e.g. display the comments).
In this case, while you don't control the definition of the HTML, you do control the input language of your API endpoint which does not need to be HTML. Here we can apply the guidance to make illegal states unrepresentable which overlaps with the guidance of using a restrictive input language. For example, you could have the frond-end pass JSON to the API endpoint which presents a vastly simplified model of the desired HTML, e.g. {"element": "span", "style": "color: red;", "body": "foobar"}
. This JSON can be validated against a schema server-side using one of many JSON schema approaches, e.g. JSON Schema. This replaces the need for HTML parsing server-side with parsing and validating JSON server-side which, while still a context-free language, is nevertheless much simpler than HTML. If needed, HTML can be generated (and cached) from the JSON representation. Ideally, this would be done with a HTML combinator library/builder that ensures the output is the intended HTML, i.e. isn't just blindly concatenating strings.
To reiterate, mismatches between the input language your application actually accepts and what you intended it to are one of the most common sources of security failures. Explicitly and formally specifying the desired input language and generating a recognizer from this specification avoids this. Similarly, ambiguity in how to interpret the language is another very common source of security failures. This can be mitigated by using simpler and narrower input languages. See A Patch for Postel’s Robustness Principle for more discussion about ambiguity. (Postel's Robustness Principle is the (in)famous guideline: "Be conservative in what you do, and liberal in what you accept from others.")
There are some downsides to this approach. The main one is that a non-trivial transformation process leads to the possibility of the generated output not matching the preview. Beyond security and correctness, though, there are also benefits, such as easing processing. It is quite possible that the approach I've outlined can be simpler to develop, easier to maintain, and more efficient in addition to being more secure and correct than ad-hoc stabs at trying to pin down the desired language via regular expressions.
1 comment thread