Security Tip: Validating HTML & Markdown Input!
[Tip#22] Validating user input is easy to forget without adding HTML or Markdown into the mix!
Let’s start with this question:
Curious as to what validation strategy can be applied when saving input from a text editor (ie ckeditor, quill) that is html markup?
My initial thinking is that it would have to be a regex pattern that includes all the tags that are enabled, or something along those lines.
I love the way they’re thinking about the output of the editor and how to protect against Cross-Site Scripting (XSS) attacks. It’s far too easy to assume that an editor like CKEditor won’t allow the user to submit an XSS payload (you can modify it in the browser), or that because Markdown isn’t HTML you can’t inject HTML (you can, and it’s even allowed in the spec!). So you very much do need to be thinking like this, and planning how to defend against XSS in your user input.
However, you can’t simply reach for something like a regex to solve the problem (can you ever "simply" use regex?). You’ll have a lot of trouble writing a regex to match all possible XSS payloads without also squashing legitimate tags. Just take a quick browse through this XSS cheat sheet and you’ll realise the wide scope of the task…
That said, the solution doesn’t have to be hard. This is such a common problem that it’s been solved many times before. 😁
HTML Purifier
If you’re receiving raw HTML from the user, then you can pass it through an HTML Purifier. They will deconstruct the HTML and strip out everything you haven’t specifically allowed, which allows you to be very specific with what you let your users use.
This is the one I’ve used before, and it seems to be by far the most popular one on Packagist: https://github.com/ezyang/htmlpurifier
Stripping HTML in Markdown
If you’re receiving Markdown, you should use a converter that includes the option to strip out all HTML when converting. This will save you having to pass the rendered Markdown into a purifier and double handing the data.
The one I recommend is CommonMark, which follows the CommonMark Spec.
It is important to note that part of the spec is that raw HTML is allowed, so you’ll want to read their security page and ensure you’re enabling the security features:
use League\CommonMark\CommonMarkConverter;
$converter = new CommonMarkConverter([
'html_input' => 'escape',
'allow_unsafe_links' => false,
]);
echo $converter->convert('<script>alert("Hello XSS!");</script>');
// <script>alert("Hello XSS!");</script>
Laravel’s Markdown Helper
Some of you may know that Laravel includes a Markdown helper in the String class (Str::markdown()
), but you might not be aware that by default it does not strip out raw HTML.
Laravel uses CommonMark internally, so you can just pass the converter options when you’re converting your HTML:
use Illuminate\Support\Str;
>>> Str::markdown('Inject: <script>alert("Hello XSS!");</script>', [
'html_input' => 'strip',
'allow_unsafe_links' => false,
]);
// <p>Inject: alert("Hello XSS!");</p>
Therefore, if you’re dealing with a WYSIWYG, raw HTML, or Markdown, don’t forget to secure it! Use a purifier or configure the server-side parser, whichever makes sense for your use case - just make sure you secure it!