Welcome to Software Development on Codidact!

Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.

Post History

70%

+5 −1

Q&A Can regex be used to check if input conforms to a very strict subset of HTML?

I mostly agree with the upshot of hkotsubo's answer, but I want to both tailor the answer more specifically to your question and give some more general advice. First, the restricted subset you desc...

posted 4y ago by Derek Elkins‭

Answer

#1: Initial revision by

Derek Elkins‭ · 2020-09-13T01:02:18Z (over 4 years ago)

Copy Link

Raw

Markdown

I mostly agree with the upshot of hkotsubo's answer, but I want to both tailor the answer more specifically to your question and give some more general advice.

First, the restricted subset you describe is still not a regular language. As a rule of thumb, if you have constructs, e.g. `<span>`, which can be arbitrarily nested, you don't have a regular language. While there are various extensions to regular expressions that allow them to parse non-regular languages of various sorts, they are still a poor tool for that. It is very likely that an attempt to capture the actual grammar via regular expressions with extensions will produce a solution that is worse in *every* way to using an HTML parser as illustrated by hkotsubo.

Taking a much more general perspective now, I strongly recommend a [LANGSEC](http://langsec.org/) approach. [Curing the Vulnerable Parser](http://langsec.org/papers/curing-the-vulnerable-parser.pdf) offers a good overview of both the problem and solutions. To summarize the most relevant of their recommendations: 1) you should explicitly and formally define your input language, 2) you should use appropriate and quality tools/libraries written by experts to handle language recognition, 3) you should completely validate/recognize the input as belonging to the defined language before any additional processing, and 4) an output language should also be defined and there should be a clear, centralized position that produces the final output.

Another LANGSEC recommendation is to prefer simpler (e.g. in the Chomsky hierarchy sense) and restrictive input languages. At first this may seem not to apply to your case as HTML is already defined and is what it is. However, your description strongly suggests a scenario similar to the following: You have a front-end which produces HTML client-side (e.g. via a WYSIWYG editor or from Markdown) which then sends that generated HTML to an API endpoint (e.g. when submitting a comment). You want to validate the input to the API endpoint and, likely, ultimately want to present back the provided HTML in the future (e.g. display the comments).

In this case, while you don't control the definition of the HTML, you *do* control the input language of your API endpoint which does not need to be HTML. Here we can apply the guidance to make illegal states unrepresentable which overlaps with the guidance of using a restrictive input language. For example, you could have the frond-end pass JSON to the API endpoint which presents a vastly simplified model of the desired HTML, e.g. `{"element": "span", "style": "color: red;", "body": "foobar"}`. This JSON can be validated against a schema server-side using one of many JSON schema approaches, e.g. [JSON Schema](https://json-schema.org/). This replaces the need for HTML parsing server-side with parsing and validating JSON server-side which, while still a context-free language, is nevertheless much simpler than HTML. If needed, HTML can be generated (and cached) from the JSON representation. Ideally, this would be done with a HTML combinator library/builder that ensures the output is the intended HTML, i.e. isn't just blindly concatenating strings.

To reiterate, mismatches between the input language your application actually accepts and what you intended it to are one of the most common sources of security failures. Explicitly and formally specifying the desired input language and generating a recognizer from this specification avoids this. Similarly, ambiguity in how to interpret the language is another very common source of security failures. This can be mitigated by using simpler and narrower input languages. See [A Patch for Postel’s Robustness Principle](http://langsec.org/papers/postel-patch.pdf) for more discussion about ambiguity. (Postel's Robustness Principle is the (in)famous guideline: "Be conservative in what you do, and liberal in what you accept from others.")

There are some downsides to this approach. The main one is that a non-trivial transformation process leads to the possibility of the generated output not matching the preview. Beyond security and correctness, though, there are also benefits, such as easing processing. It is quite possible that the approach I've outlined can be simpler to develop, easier to maintain, and more efficient in addition to being more secure and correct than ad-hoc stabs at trying to pin down the desired language via regular expressions.

Communities

Post History