Software Development

−1

I mostly agree with the upshot of hkotsubo's answer, but I want to both tailor the answer more specifically to your question and give some more general advice.

First, the restricted subset you describe is still not a regular language. As a rule of thumb, if you have constructs, e.g. <span>, which can be arbitrarily nested, you don't have a regular language. While there are various extensions to regular expressions that allow them to parse non-regular languages of various sorts, they are still a poor tool for that. It is very likely that an attempt to capture the actual grammar via regular expressions with extensions will produce a solution that is worse in every way to using an HTML parser as illustrated by hkotsubo.

Taking a much more general perspective now, I strongly recommend a LANGSEC approach. Curing the Vulnerable Parser offers a good overview of both the problem and solutions. To summarize the most relevant of their recommendations: 1) you should explicitly and formally define your input language, 2) you should use appropriate and quality tools/libraries written by experts to handle language recognition, 3) you should completely validate/recognize the input as belonging to the defined language before any additional processing, and 4) an output language should also be defined and there should be a clear, centralized position that produces the final output.

Another LANGSEC recommendation is to prefer simpler (e.g. in the Chomsky hierarchy sense) and restrictive input languages. At first this may seem not to apply to your case as HTML is already defined and is what it is. However, your description strongly suggests a scenario similar to the following: You have a front-end which produces HTML client-side (e.g. via a WYSIWYG editor or from Markdown) which then sends that generated HTML to an API endpoint (e.g. when submitting a comment). You want to validate the input to the API endpoint and, likely, ultimately want to present back the provided HTML in the future (e.g. display the comments).

In this case, while you don't control the definition of the HTML, you do control the input language of your API endpoint which does not need to be HTML. Here we can apply the guidance to make illegal states unrepresentable which overlaps with the guidance of using a restrictive input language. For example, you could have the frond-end pass JSON to the API endpoint which presents a vastly simplified model of the desired HTML, e.g. {"element": "span", "style": "color: red;", "body": "foobar"}. This JSON can be validated against a schema server-side using one of many JSON schema approaches, e.g. JSON Schema. This replaces the need for HTML parsing server-side with parsing and validating JSON server-side which, while still a context-free language, is nevertheless much simpler than HTML. If needed, HTML can be generated (and cached) from the JSON representation. Ideally, this would be done with a HTML combinator library/builder that ensures the output is the intended HTML, i.e. isn't just blindly concatenating strings.

To reiterate, mismatches between the input language your application actually accepts and what you intended it to are one of the most common sources of security failures. Explicitly and formally specifying the desired input language and generating a recognizer from this specification avoids this. Similarly, ambiguity in how to interpret the language is another very common source of security failures. This can be mitigated by using simpler and narrower input languages. See A Patch for Postel’s Robustness Principle for more discussion about ambiguity. (Postel's Robustness Principle is the (in)famous guideline: "Be conservative in what you do, and liberal in what you accept from others.")

There are some downsides to this approach. The main one is that a non-trivial transformation process leads to the possibility of the generated output not matching the preview. Beyond security and correctness, though, there are also benefits, such as easing processing. It is quite possible that the approach I've outlined can be simpler to develop, easier to maintain, and more efficient in addition to being more secure and correct than ad-hoc stabs at trying to pin down the desired language via regular expressions.

posted over 4 years ago

CC BY-SA 4.0

Derek Elkins‭

2719 reputation 0 53 267 12

Copy Link

Raw

Markdown

History

1 comment thread

General comments (4 comments)

Communities

Comments on Can regex be used to check if input conforms to a very strict subset of HTML?

Can regex be used to check if input conforms to a very strict subset of HTML?

Details

Question

1 comment thread

1 comment thread