Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Q&A

Welcome to Software Development on Codidact!

Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.

Comments on Can regex be used to check if input conforms to a very strict subset of HTML?

Parent

Can regex be used to check if input conforms to a very strict subset of HTML?

+11
−0

Tldr; I don't need to parse HTML, but I need to check if user submitted input conforms to a very strict subset of HTML. Can regex be a suitable tool for this?

Details

I have a frontend sanitiser that accepts user input from keystrokes and clipboard (as a wysiwyg editor) and outputs HTML with three important guarantees:

  • The only tag types will be p, span, br, i, b, and u.
  • The only attributes will be style and href.
  • Angle brackets will only exist as part of tags; everywhere else they will be represented as HTML entities. There are some other guarantees too, but these are the ones that matter.

The sanitiser then sends the HTML to my backend as a string type. I am confident that once the HTML reaches the backend, if it contains anything that does not conform to the above specifications then it has been tampered with by a malicious user.

Before saving the HTML to disk I need to check if indeed it does conform to the above specifications. I don't need the backend to sanitise non conforming HTML, instead I will simply return an error. It's well established that you shouldn't parse HTML with regex, but I don't really want the overhead of a full HTML parser, especially since I'm not really parsing the HTML. To check if the HTML conforms I plan to use the following regex:

  • Check if any invalid tag types:

    /<\/?(?!((span(\s|>))|(p(\s|>))|(a(\s|>))|(b(\s|>))|(i(\s|>))|(u(\s|>))|(br(\s|>))|(\/\s*))).*?/gi

  • Check if any invalid attributes:

    /<(.(?!>))*?\s+?(?!(href|style)).*?>/gi

I don't really care if the HTML itself is invalid - if the user somehow submits <p></span></p>>><br href="./></span> then that's their loss - this is to ensure the HTML can't load any scripts or run any events.

Question

Is using regex in this way a watertight method of ensuring no scripts or events can be attached to user submitted HTML?

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.
Why should this post be closed?

1 comment thread

General comments (6 comments)
Post
+5
−1

I mostly agree with the upshot of hkotsubo's answer, but I want to both tailor the answer more specifically to your question and give some more general advice.

First, the restricted subset you describe is still not a regular language. As a rule of thumb, if you have constructs, e.g. <span>, which can be arbitrarily nested, you don't have a regular language. While there are various extensions to regular expressions that allow them to parse non-regular languages of various sorts, they are still a poor tool for that. It is very likely that an attempt to capture the actual grammar via regular expressions with extensions will produce a solution that is worse in every way to using an HTML parser as illustrated by hkotsubo.

Taking a much more general perspective now, I strongly recommend a LANGSEC approach. Curing the Vulnerable Parser offers a good overview of both the problem and solutions. To summarize the most relevant of their recommendations: 1) you should explicitly and formally define your input language, 2) you should use appropriate and quality tools/libraries written by experts to handle language recognition, 3) you should completely validate/recognize the input as belonging to the defined language before any additional processing, and 4) an output language should also be defined and there should be a clear, centralized position that produces the final output.

Another LANGSEC recommendation is to prefer simpler (e.g. in the Chomsky hierarchy sense) and restrictive input languages. At first this may seem not to apply to your case as HTML is already defined and is what it is. However, your description strongly suggests a scenario similar to the following: You have a front-end which produces HTML client-side (e.g. via a WYSIWYG editor or from Markdown) which then sends that generated HTML to an API endpoint (e.g. when submitting a comment). You want to validate the input to the API endpoint and, likely, ultimately want to present back the provided HTML in the future (e.g. display the comments).

In this case, while you don't control the definition of the HTML, you do control the input language of your API endpoint which does not need to be HTML. Here we can apply the guidance to make illegal states unrepresentable which overlaps with the guidance of using a restrictive input language. For example, you could have the frond-end pass JSON to the API endpoint which presents a vastly simplified model of the desired HTML, e.g. {"element": "span", "style": "color: red;", "body": "foobar"}. This JSON can be validated against a schema server-side using one of many JSON schema approaches, e.g. JSON Schema. This replaces the need for HTML parsing server-side with parsing and validating JSON server-side which, while still a context-free language, is nevertheless much simpler than HTML. If needed, HTML can be generated (and cached) from the JSON representation. Ideally, this would be done with a HTML combinator library/builder that ensures the output is the intended HTML, i.e. isn't just blindly concatenating strings.

To reiterate, mismatches between the input language your application actually accepts and what you intended it to are one of the most common sources of security failures. Explicitly and formally specifying the desired input language and generating a recognizer from this specification avoids this. Similarly, ambiguity in how to interpret the language is another very common source of security failures. This can be mitigated by using simpler and narrower input languages. See A Patch for Postel’s Robustness Principle for more discussion about ambiguity. (Postel's Robustness Principle is the (in)famous guideline: "Be conservative in what you do, and liberal in what you accept from others.")

There are some downsides to this approach. The main one is that a non-trivial transformation process leads to the possibility of the generated output not matching the preview. Beyond security and correctness, though, there are also benefits, such as easing processing. It is quite possible that the approach I've outlined can be simpler to develop, easier to maintain, and more efficient in addition to being more secure and correct than ad-hoc stabs at trying to pin down the desired language via regular expressions.

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.

1 comment thread

General comments (4 comments)
General comments
jla‭ wrote over 4 years ago

Thanks, this is very insightful as to why regex may not be a good solution.

ShowMeBillyJo‭ wrote over 4 years ago

I appreciate the expanded answer and reasoning, but I disagree with the suggestion to send JSON to the server. That would make the browser responsible for parsing the user-generated HTML and creating the JSON payload. Still needs a parser and slows down the client. Better in my mind to just accept HTML, parse it properly on the server side, sanitize and then optionally convert to a different meta-language, then store.

jla‭ wrote over 4 years ago

The client should already be sanitising the HTML, in which case it already has the overhead of a parser.

Derek Elkins‭ wrote over 4 years ago

@ShowMeBillyJo In the scenario I believe the OP is in and the scenarios I've described, you are not starting with HTML but rather generating it client-side, e.g. from Markdown. In this case, no one needs to parse HTML. To me, it's better to offload user-specific non-security-sensitive work to the client. Parsing 1KB of HTML is negligible for the client. Parsing 1 million 1KB chunks of HTML isn't negligible for the server. Nevertheless, the advice could be taken to apply to an internal interface.