Can regex be used to check if input conforms to a very strict subset of HTML?
Tldr; I don't need to parse HTML, but I need to check if user submitted input conforms to a very strict subset of HTML. Can regex be a suitable tool for this?
I have a frontend sanitiser that accepts user input from keystrokes and clipboard (as a wysiwyg editor) and outputs HTML with three important guarantees:
- The only tag types will be p, span, br, i, b, and u.
- The only attributes will be style and href.
- Angle brackets will only exist as part of tags; everywhere else they will be represented as HTML entities. There are some other guarantees too, but these are the ones that matter.
The sanitiser then sends the HTML to my backend as a string type. I am confident that once the HTML reaches the backend, if it contains anything that does not conform to the above specifications then it has been tampered with by a malicious user.
Before saving the HTML to disk I need to check if indeed it does conform to the above specifications. I don't need the backend to sanitise non conforming HTML, instead I will simply return an error. It's well established that you shouldn't parse HTML with regex, but I don't really want the overhead of a full HTML parser, especially since I'm not really parsing the HTML. To check if the HTML conforms I plan to use the following regex:
Check if any invalid tag types:
Check if any invalid attributes:
I don't really care if the HTML itself is invalid - if the user somehow submits
<p></span></p>>><br href="./></span> then that's their loss - this is to ensure the HTML can't load any scripts or run any events.
Is using regex in this way a watertight method of ensuring no scripts or events can be attached to user submitted HTML?