Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Q&A

Welcome to Software Development on Codidact!

Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.

Comments on Can regex be used to check if input conforms to a very strict subset of HTML?

Post

Can regex be used to check if input conforms to a very strict subset of HTML?

+11
−0

Tldr; I don't need to parse HTML, but I need to check if user submitted input conforms to a very strict subset of HTML. Can regex be a suitable tool for this?

Details

I have a frontend sanitiser that accepts user input from keystrokes and clipboard (as a wysiwyg editor) and outputs HTML with three important guarantees:

  • The only tag types will be p, span, br, i, b, and u.
  • The only attributes will be style and href.
  • Angle brackets will only exist as part of tags; everywhere else they will be represented as HTML entities. There are some other guarantees too, but these are the ones that matter.

The sanitiser then sends the HTML to my backend as a string type. I am confident that once the HTML reaches the backend, if it contains anything that does not conform to the above specifications then it has been tampered with by a malicious user.

Before saving the HTML to disk I need to check if indeed it does conform to the above specifications. I don't need the backend to sanitise non conforming HTML, instead I will simply return an error. It's well established that you shouldn't parse HTML with regex, but I don't really want the overhead of a full HTML parser, especially since I'm not really parsing the HTML. To check if the HTML conforms I plan to use the following regex:

  • Check if any invalid tag types:

    /<\/?(?!((span(\s|>))|(p(\s|>))|(a(\s|>))|(b(\s|>))|(i(\s|>))|(u(\s|>))|(br(\s|>))|(\/\s*))).*?/gi

  • Check if any invalid attributes:

    /<(.(?!>))*?\s+?(?!(href|style)).*?>/gi

I don't really care if the HTML itself is invalid - if the user somehow submits <p></span></p>>><br href="./></span> then that's their loss - this is to ensure the HTML can't load any scripts or run any events.

Question

Is using regex in this way a watertight method of ensuring no scripts or events can be attached to user submitted HTML?

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.
Why should this post be closed?

1 comment thread

General comments (6 comments)
General comments
EJP‭ wrote over 4 years ago

No. HTML is a context-free language, and regular expressions only handle regular languages. You need a parser. Specifically, you need a schema validator.

Skipping 1 deleted comment.

Peter Taylor‭ wrote over 4 years ago

@EJP, regular expression != regex. Regex languages vary in power from Chomsky level 1 to 4.

Lundin‭ wrote over 4 years ago

In case you haven't read this old SO meme: parsing HTML with regex :)

jla‭ wrote over 4 years ago

I am familiar with TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ :) I'm not actually parsing HTML, just checking the output of a very predictable sanitiser.

BobJarvis‭ wrote over 4 years ago · edited over 4 years ago

Ph'nglui mglw'nafh Cthulhu R'lyeh wgah'nagl fhtagn

EJP‭ wrote over 4 years ago

@jla Actually you are parsing HTML. No other way to do it.