Security Tip: Avoiding Filename Collisions

[Tip#55] Let's look at my old buddy time(), who always has something for me during my audits. This time it's helping avoid filename collisions?

Security Tip: Avoiding Filename Collisions

Greetings friends! Today we’re going to take a quick look at a common weakness I’ve found around generated filenames. It’s a super easy trap to fall into when you’re focused on solving a single aspect of a more complicated problem.

If you’re waiting on part 2 of my Laracon talk walkthrough, I’ll be picking that up next week with the final challenges for the talk, plus explaining my cookie hijacking trick from the end.

⚠️ Just a reminder that I’m almost booked out for the year, so if you’re interested in a Laravel Security Audit and Penetration Test, you’ll need to reach out now, so we can get the ball rolling. 🕵️



Avoiding Filename Collisions

A very common pattern I’ve come across during my audits is around generating filenames and the desire to avoid collisions.

I’ll often find code that looks like this:

function generateFilename($originalName, $extension): string 
{
    $hash = substr(md5($originalName), 0, 5);
    return time().random_int(100, 1000).$hash.'.'.$extension;
}

When writing the code, the developer is clearly thinking about (in this order):

  1. Avoiding naming collisions
  2. Making the name unique
  3. Limiting the length and/or characters used
  4. Security?

They’ve immediately reached for `time()` (or if they’re lucky `microtime()`) to provide unique names, added in some small randomness and hashed the original name for more uniqueness, and a known length, and called it a day.

We may laugh or groan about it now, but if we’re honest with ourselves as developers, we know we’ve done this exact thing before. The concern was about collisions and uniqueness, and the fact that using randomness opens up a possibility1 for a collision, so the developer brain sought out a different solution…

Ignoring the fact that this doesn’t actually avoid collisions, the format is incredibly predictable. If you were attacking this app, trying to identify the filename format, the timestamp would jump out immediately, and you’d already have various MD52 hashes already generated and be looking for them. All you’d need is the random number in the middle, of which there are 900 possible values?3 That won’t take long to find when you’re looking for a specific file around a specific time.

So what’s the solution to this?

This may be controversial4, but I recommend you just use more randomness. You could back it up with a unique column in a DB5, but I’d just generate a sufficiently long random string and use that for your filenames.

The more random a string you generate, the lower the chances of collisions actually are6, so I suggest just swapping all of the logic out for a lengthy random string.

For example, instead of:

// time().random(100, 1000).$hash.'.'.$extension
1660338149543ab3d.png

Just use this:

// Str::random(16)
CHQvtbkA89jC1Ojl.png

Update

After posting this article, Tim Düsterhus over on the Fediverse made the excellent point about my length of 16 in the above example:

A summary of the posts follow the image.

Tim pointed out that `Str::random(16)` only generates 95 bits of randomness, while 128 bits of randomness is currently considered unguessable by today’s computing standards. By his estimate, this can be achieved by using `Str::random(22)`, or using his recommended approach: `bin2hex(random_bytes(16))`.

He then went on to talk about timing attacks and implementations, which I recommend you check out if you’re interested in diving deeper. Timing attacks won’t be an issue for the filename collisions we talked about above, but they could be an issue for other uses of `Str::random()` in specific implementations.


Looking to learn more?
Security Tip #36: Encoding/Serialising Data
▶️ In Depth #12: "Password Generator" Security Audit

  1. However small…

  2. and SHA1, SHA256, etc…

  3. If my maths is correct?

  4. I’ll happily have this discussion in the comments if you disagree with me.

  5. Which you’ll probably need for a huge filestore anyway.

  6. Someone who loves this sort of maths can confirm, but I’m guessing there becomes a point where multiple requests in the same second (or even same microsecond) are more likely to occur than two matching sufficiently long random strings are generated.