Welcome to Software Development on Codidact!

Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.

Post History

66%

+2 −0

Q&A Can regex be used to check if input conforms to a very strict subset of HTML?

Okay, I'll be the contrarian. For this case, yes, I think a regex-based approach can be used to validate these properties. This approach will not guarantee that the provided input is valid HTML; in...

posted 4y ago by r~~‭

Answer

#1: Initial revision by

r~~‭ · 2020-09-14T19:41:00Z (over 4 years ago)

Copy Link

Raw

Markdown

Okay, I'll be the contrarian.

For this case, **yes**, I think a regex-based approach can be used to validate these properties. This approach will *not* guarantee that the provided input is valid HTML; in particular, it won't ensure that elements are nested correctly. For this reason, I would go with the conventional wisdom, as exemplified in other answers, to use a built-for-purpose HTML parser and validator for this *if* you were planning to inject the submitted HTML into a server-rendered document without further inspection; in such a context, an unbalanced HTML tag might have far-reaching effects. But if your only intended use for this HTML is to send it to some dynamic JavaScript client code which pushes it into some element's `innerHTML` property, then this can work, because any unclosed tags will be closed at the boundary of the element being innerHTML'd.

To ensure that a valid HTML fragment (or an invalid fragment which could be made valid by placing close tags at or before the close of the fragment) meets the conditions you've laid out, it suffices to ensure that every character is either not an angle bracket, or is an angle bracket that begins or ends a tag that meets your criteria (allowed element name, allowed attributes, no special characters inside the tag). The language thus described is regular; here is a regex that tests for it:

```
^([^<>]|<\/?(p|span|br|i|b|u)(\s+(style|href)(\s*=\s*([^'"\s<>=`]+|'[^'<>]*'|"[^"<>]*"))?)*\s*\/?>)*$
```

With whitespace and `//` comments:

```
^
( [^<>]                // Non-angle bracket characters are always fine.
| <                    // An open angle must start a valid tag
  \/?                  // (might be an open or close tag)
  (p|span|br|i|b|u)    // of one of these types,
  ( \s+(style|href)    // with zero or more of these attributes,
    ( \s*=\s*          // optionally having values,
      ( [^'"\s<>=`]+   // which might be unquoted,
      | '[^'<>]*'      // or single-quoted,
      | "[^"<>]*"      // or double-quoted.
      )
    )?
  )*
  \s*
  \/?                  // Void tag syntax also allowed.
  >                    // End with a close angle.
)*
$
```

A more complex regex might exclude invalid constructs like `</span href />`, but this is invalid HTML and no browser is at risk of parsing a fragment containing the above into an HTML fragment that violates your guarantees, so I don't consider it worth protecting against.

Given that neither `style` nor `href` are reasonably empty attributes, you might want to remove the `(`/`)?` making attribute values optional; I left it in for completeness.

I don't guarantee the above regex against bugs! But the general approach should be sound; I will stand behind the claim that recognizing only *tags* (as opposed to properly nesting the tags) that are valid according to your principles is a regular language, and thus fair game for regexes.

([Source for details of HTML syntax](https://html.spec.whatwg.org/#elements-2))

Communities

Post History