The content upload nightmare - How to properly detect the content of a file

“What is this file?”

This is a simple question I didn’t care much about. But that was before I had to properly handle file uploading.

This question is easy, anyone can answer it quickly. test.pdf is a pdf, it ends with pdf, image.jpg is a jpg , and that's when things start to get interesting, because there are a lot of questions you can ask to make someone doubt:

  • Are you sure it’s a PDF? Someone could have changed the extension.
  • Your machine says it’s a pdf but can you trust your machine? How does it know that it is a pdf?
  • What if someone changed the bytes inside the file to make your machine think it’s a valid pdf?
  • What if the file you’re trying to open is a malware in disguise?

It’s almost impossible for a system that handles file uploading to be sure, but let’s try to find a way to be as close to the truth as we can.

In this article I’ll share with you how you can detect and verify files that are being uploaded to your server (so server-side). Alright, let’s jump into it!

The content-type header

First things first, when a file is uploaded to your server using HTTP through a browser, you will always have the content-type header. This is a field provided by your browser to help easily identify what kind of file is uploaded. Say I'm uploading my test.pdf, Firefox will add the content-type to the request and say it's an application/pdf.

Browser content-type header

How does the browser detect the file's content-type ?

The simple response is “like you and me, the browser checks the extension and then decides what kind of file it is”. It sees a .pdf, it is a pdf.

The truth is a bit more complex than that, and to make it a bit more fun, results are not always the same between browsers. It’s called mime-sniffing. Lets see some of them:

Firefox

// OK. We want to try the following sources of mimetype information, in this
  // order:
  // 1. defaultMimeEntries array
  // 2. OS-provided information
  // 3. our "extras" array
  // 4. Information from plugins
  // 5. The "ext-to-type-mapping" category
  // Note that, we are intentionally not looking at the handler service, because
  // that can be affected by websites, which leads to undesired behavior.

From function nsExternalHelperAppService::GetTypeFromExtension in mozilla-central/uriloader/exthandler/nsExternalHelperAppService.cpp (link).

So first they browse through an array, let’s take a look:

/**
 * Default extension->mimetype mappings. These are not overridable.
 * If you add types here, make sure they are lowercase, or you'll regret it.
 */
static const nsDefaultMimeTypeEntry defaultMimeEntries[] = {
    // The following are those extensions that we're asked about during startup,
    // sorted by order used
    {IMAGE_GIF, "gif"},
    {TEXT_XML, "xml"},
    {APPLICATION_RDF, "rdf"},
    {IMAGE_PNG, "png"},
    // -- end extensions used during startup
    {TEXT_CSS, "css"},
    {IMAGE_JPEG, "jpeg"},
    {IMAGE_JPEG, "jpg"},
    {IMAGE_SVG_XML, "svg"},
    {TEXT_HTML, "html"},
    {TEXT_HTML, "htm"},
    {APPLICATION_XPINSTALL, "xpi"},
    {"application/xhtml+xml", "xhtml"},
    {"application/xhtml+xml", "xht"},
    {TEXT_PLAIN, "txt"},
    {APPLICATION_JSON, "json"},
    {APPLICATION_XJAVASCRIPT, "js"},
    {APPLICATION_XJAVASCRIPT, "jsm"},
    {VIDEO_OGG, "ogv"},
    {VIDEO_OGG, "ogg"},
    {APPLICATION_OGG, "ogg"},
    {AUDIO_OGG, "oga"},
    {AUDIO_OGG, "opus"},
    {APPLICATION_PDF, "pdf"},
    {VIDEO_WEBM, "webm"},
    {AUDIO_WEBM, "webm"},
    {IMAGE_ICO, "ico"},
    {TEXT_PLAIN, "properties"},
    {TEXT_PLAIN, "locale"},
    {TEXT_PLAIN, "ftl"},
#if defined(MOZ_WMF)
    {VIDEO_MP4, "mp4"},
    {AUDIO_MP4, "m4a"},
    {AUDIO_MP3, "mp3"},
#endif
#ifdef MOZ_RAW
    {VIDEO_RAW, "yuv"}
#endif
};

from the same file.

This array seems pretty short compared to the number of existing extensions, but should do the trick for most of the files uploaded. What about files that do not match this list?

Then we check for the internal type and…

nsresult nsExternalHelperAppService::GetMIMEInfoFromOS(
    const nsACString& aMIMEType, const nsACString& aFileExt, bool* aFound,
    nsIMIMEInfo** aMIMEInfo) {
  *aMIMEInfo = nullptr;
  *aFound = false;
  return NS_ERROR_NOT_IMPLEMENTED;
}

It’s not implemented ¯\(ツ)/¯.

I won’t go through the whole file, but the extra array is the same as the first array, with a bit more information and extensions handled. Then it looks for plugins that could have the answer (which is very smart), and at the end, it queries another database with more information to find the answer (available here).

Ok so Firefox is pretty straightforward, it just looks for the extension in multiple ways to find the content-type. Classic and easy to fool if you ask me.

Let’s take a look at another one.

Chrome

Chrome is very similar to Firefox in its handling:

// 2) File extension -> MIME type.  Implemented in GetMimeTypeFromExtension().
//
//    Sources are considered in the following order:
//
//    a) kPrimaryMappings.  Order matters here since file extensions can appear
//       multiple times on these lists.  The first mapping in order of
//       appearance in the list wins.
//
//    b) Underlying platform.
//
//    c) kSecondaryMappings.  Again, the order matters.

I won’t copy paste the full list here, but you can check it out in this URL. From the comments, Chrome is doing exactly the same thing as Firefox (except maybe the fact that the underlying platform check is implemented in Chrome).

Other browsers

Sadly, Internet Explorer is not open sourced, but it’s probably doing the same thing as its two competitors.

Since the hard-coded lists are limited, your browser will always send either a content-type found in its list, or the one provided by your OS.

Furthermore, this solution is not really viable since it can easily be tricked. Say I take my file virus.exe and rename it test.jpg. Now I upload it to your server, you receive img/jpeg so you trust it and try to resize it, then your image library will probably crash because it's a .exe file, so your server fails, and you are sad because you absolutely need to resize every uploaded jpg file.

Even without doing so, you can simply intercept the request using a proxy to modify the header and now you can upload your virus.exe without even renaming it. Or simply call your server using curl or postman to change the content-type manually.

Ok so we can’t trust the content-type, what do we trust then? Chromium (and hopefully Firefox in a near future?) use the type provided by the OS (underlying platform). Let’s check how it works.

File type defined by the OS

For example, in Linux, the command file runs a set of 3 tests, the first one that has the answer is printed:

  • filesystem test: test based on the result of a call to stat command to determine if it is a regular file or special file.
  • magic number test: verifies the first bytes of a file in order to determine what type of file it is, based on a list of mime-types and the bytes related.
  • language test: if the file does not match the two prior tests, it is examined in order to determine whether it is a text file or not.
JPG file that actually is an HEIC file detected by the OS

More info on the file command

Hey so we can use the magic number to find out what type of file is uploaded! If we do this check server-side, we will have no problem!

Note: “magic number” is a short-hand for file signatures, learn more about it here.

Well yes and no, but first let’s take a look at what actually is a magic number.

A magic number is a set of bytes (sometimes with an offset), that are characteristic of a type of file. Lets take a pdf file as an example:»

» xxd test.pdf | head -n 2                                         
00000000: 2550 4446 2d31 2e34 0a25 f6e4 fcdf 0a31  %PDF-1.4.%.....1
00000010: 2030 206f 626a 0a3c 3c0a 2f54 7970 6520   0 obj.<<./Type

These are the first two lines of the hexadecimal values of the content of my test.pdf file, it starts with 25 50 44 46 which is the magic-number used to determine if a file is a pdf. This magic number is equivalent to %PDF. You could modify the magic-number of a file, but then, if you don't know what you're doing besides that, you will just destroy your file since a lot of these values are linked together.

Sadly, this is not always true, a PDF file starts with this magic-number, but an Adobe Illustrator file (.ai) also starts with %PDF. We need to check if the hexadecimal value of the file contains also Adobe Illustrator to know whether this file is a pdf or an Adobe Illustrator file.

See for yourself in the famous file-type library in Javascript:

if (checkString('%PDF')) {
		// Check if this is an Adobe Illustrator file
		const isAiFile = await checkSequence('Adobe Illustrator', 1350);
		if (isAiFile) {
			return {
				ext: 'ai',
				mime: 'application/postscript'
			};
		}
		// Assume this is just a normal PDF
		return {
			ext: 'pdf',
			mime: 'application/pdf'
		};
	}

You can find one of the lists here.

Then again, I can easily break your server if I know you are using this, I need to modify these specific hexadecimal values to match what you seek, and you can have a great invalid PDF which is considered valid (and who knows what happens next). Do note that this is probably the hardest part to fake since modifying hexadecimal values of a file to match a file signature can be very painful.

Before leaving you, lets talk a bit about what could be your worst nightmare: the octet-stream. When an OS or a browser does not find out what file it has to deal with, it decides to call it an octet-stream. It’s true, your file is a bunch of bytes, but accepting files on your server that have the type octet-stream is the same as accepting anything and everything, and never knowing what’s in front of you.

We can never know precisely what file was sent to you, unless we try to use it and fail, that is the sad part, but we can always try to be as close to the reality as we can get, by using magic-numbers as part of our verification on upload for example. I hope this short article gave you a better view of content-uploading and the nightmare of handling all cases.


--Vincent Dufrasnes, Software Engineer @PayFit