Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »

Welcome to Software Development on Codidact!

Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.

Review Suggested Edit

You can't approve or reject suggested edits because you haven't yet earned the Edit Posts ability.

Rejected.
This suggested edit was rejected about 4 years ago by jla‭:

20 / 255
Can regex be used to check if input conforms to a very strict subset of HTML?
  • **Tldr;** I don't need to parse HTML, but I need to check if user submitted input conforms to a very strict subset of HTML. Can regex be a suitable tool for this?
  • # Details
  • I have a frontend sanitiser that accepts user input from keystrokes and clipboard (as a wysiwyg editor) and outputs HTML with three important guarantees:
  • - The only tag types will be p, span, br, i, b, and u.
  • - The only attributes will be style and href.
  • - Angle brackets will only exist as part of tags; everywhere else they will be represented as HTML entities.
  • There are some other guarantees too, but these are the ones that matter.
  • The sanitiser then sends the HTML to my backend as a string type. I am confident that once the HTML reaches the backend, if it contains anything that does not conform to the above specifications then it has been tampered with by a malicious user.
  • Before saving the HTML to disk I need to check if indeed it does conform to the above specifications. **I don't need the backend to sanitise non conforming HTML, instead I will simply return an error**. It's well established that you shouldn't parse HTML with regex, but I don't really want the overhead of a full HTML parser, especially since I'm not really parsing the HTML. To check if the HTML conforms I plan to use the following regex:
  • - Check if any invalid tag types:
  • `/<\/?(?!((span(\s|>))|(p(\s|>))|(a(\s|>))|(b(\s|>))|(i(\s|>))|(u(\s|>))|(br(\s|>))|(\/\s*))).*?/gi`
  • - Check if any invalid attributes:
  • `/<(.(?!>))*?\s+?(?!(href|style)).*?>/gi`
  • **I don't really care if the HTML itself is invalid** - if the user somehow submits `<p></span></p>>><br href="./></span>` then that's their loss - this is to ensure the HTML can't load any scripts or run any events.
  • # Question
  • Is using regex in this way a watertight method of ensuring no scripts or events can be attached to user submitted HTML?
  • **Tldr;** I don't need to parse HTML, but I need to check if user submitted input conforms to a very strict subset of HTML. Can regex be a suitable tool for this?
  • # Details
  • I have a frontend sanitiser that accepts user input from keystrokes and clipboard (as a wysiwyg editor) and outputs HTML with three important guarantees:
  • - The only tag types will be `p`, `span`, `br`, `i`, `b`, and `u`.
  • - The only attributes will be `style` and `href`.
  • - Angle brackets will only exist as part of tags; everywhere else they will be represented as HTML entities.
  • There are some other guarantees too, but these are the ones that matter.
  • The sanitiser then sends the HTML to my backend as a string type. I am confident that once the HTML reaches the backend, if it contains anything that does not conform to the above specifications then it has been tampered with by a malicious user.
  • Before saving the HTML to disk I need to check if indeed it does conform to the above specifications. **I don't need the backend to sanitise non conforming HTML, instead I will simply return an error**. It's well established that you shouldn't parse HTML with regex, but I don't really want the overhead of a full HTML parser, especially since I'm not really parsing the HTML. To check if the HTML conforms I plan to use the following regex:
  • - Check if any invalid tag types:
  • `/<\/?(?!((span(\s|>))|(p(\s|>))|(a(\s|>))|(b(\s|>))|(i(\s|>))|(u(\s|>))|(br(\s|>))|(\/\s*))).*?/gi`
  • - Check if any invalid attributes:
  • `/<(.(?!>))*?\s+?(?!(href|style)).*?>/gi`
  • **I don't really care if the HTML itself is invalid** - if the user somehow submits `<p></span></p>>><br href="./></span>` then that's their loss - this is to ensure the HTML can't load any scripts or run any events.
  • # Question
  • Is using regex in this way a watertight method of ensuring no scripts or events can be attached to user submitted HTML?

Suggested about 4 years ago by hkotsubo‭