“What is this file?”
This is a simple question I didn’t care much about. But that was before I had to properly handle file uploading.
This question is easy, anyone can answer it quickly. test.pdf
is a pdf, it ends with pdf
, image.jpg
is a jpg
, and that's when things start to get interesting, because there are a lot of questions you can ask to make someone doubt:
- Are you sure it’s a PDF? Someone could have changed the extension.
- Your machine says it’s a
pdf
but can you trust your machine? How does it know that it is a pdf? - What if someone changed the bytes inside the file to make your machine think it’s a valid pdf?
- What if the file you’re trying to open is a malware in disguise?
It’s almost impossible for a system that handles file uploading to be sure, but let’s try to find a way to be as close to the truth as we can.
In this article I’ll share with you how you can detect and verify files that are being uploaded to your server (so server-side). Alright, let’s jump into it!
The content-type header
First things first, when a file is uploaded to your server using HTTP through a browser, you will always have the content-type
header. This is a field provided by your browser to help easily identify what kind of file is uploaded. Say I'm uploading my test.pdf
, Firefox will add the content-type
to the request and say it's an application/pdf
.
How does the browser detect the file's content-type
?
The simple response is “like you and me, the browser checks the extension and then decides what kind of file it is”. It sees a .pdf
, it is a pdf.
The truth is a bit more complex than that, and to make it a bit more fun, results are not always the same between browsers. It’s called mime-sniffing
. Lets see some of them:
Firefox
// OK. We want to try the following sources of mimetype information, in this
// order:
// 1. defaultMimeEntries array
// 2. OS-provided information
// 3. our "extras" array
// 4. Information from plugins
// 5. The "ext-to-type-mapping" category
// Note that, we are intentionally not looking at the handler service, because
// that can be affected by websites, which leads to undesired behavior.
From function nsExternalHelperAppService::GetTypeFromExtension
in mozilla-central/uriloader/exthandler/nsExternalHelperAppService.cpp
(link).
So first they browse through an array, let’s take a look:
/**
* Default extension->mimetype mappings. These are not overridable.
* If you add types here, make sure they are lowercase, or you'll regret it.
*/
static const nsDefaultMimeTypeEntry defaultMimeEntries[] = {
// The following are those extensions that we're asked about during startup,
// sorted by order used
{IMAGE_GIF, "gif"},
{TEXT_XML, "xml"},
{APPLICATION_RDF, "rdf"},
{IMAGE_PNG, "png"},
// -- end extensions used during startup
{TEXT_CSS, "css"},
{IMAGE_JPEG, "jpeg"},
{IMAGE_JPEG, "jpg"},
{IMAGE_SVG_XML, "svg"},
{TEXT_HTML, "html"},
{TEXT_HTML, "htm"},
{APPLICATION_XPINSTALL, "xpi"},
{"application/xhtml+xml", "xhtml"},
{"application/xhtml+xml", "xht"},
{TEXT_PLAIN, "txt"},
{APPLICATION_JSON, "json"},
{APPLICATION_XJAVASCRIPT, "js"},
{APPLICATION_XJAVASCRIPT, "jsm"},
{VIDEO_OGG, "ogv"},
{VIDEO_OGG, "ogg"},
{APPLICATION_OGG, "ogg"},
{AUDIO_OGG, "oga"},
{AUDIO_OGG, "opus"},
{APPLICATION_PDF, "pdf"},
{VIDEO_WEBM, "webm"},
{AUDIO_WEBM, "webm"},
{IMAGE_ICO, "ico"},
{TEXT_PLAIN, "properties"},
{TEXT_PLAIN, "locale"},
{TEXT_PLAIN, "ftl"},
#if defined(MOZ_WMF)
{VIDEO_MP4, "mp4"},
{AUDIO_MP4, "m4a"},
{AUDIO_MP3, "mp3"},
#endif
#ifdef MOZ_RAW
{VIDEO_RAW, "yuv"}
#endif
};
from the same file.
This array seems pretty short compared to the number of existing extensions, but should do the trick for most of the files uploaded. What about files that do not match this list?
Then we check for the internal type and…
nsresult nsExternalHelperAppService::GetMIMEInfoFromOS(
const nsACString& aMIMEType, const nsACString& aFileExt, bool* aFound,
nsIMIMEInfo** aMIMEInfo) {
*aMIMEInfo = nullptr;
*aFound = false;
return NS_ERROR_NOT_IMPLEMENTED;
}
It’s not implemented ¯\(ツ)/¯.
I won’t go through the whole file, but the extra array is the same as the first array, with a bit more information and extensions handled. Then it looks for plugins that could have the answer (which is very smart), and at the end, it queries another database with more information to find the answer (available here).
Ok so Firefox is pretty straightforward, it just looks for the extension in multiple ways to find the content-type. Classic and easy to fool if you ask me.
Let’s take a look at another one.
Chrome
Chrome is very similar to Firefox in its handling:
// 2) File extension -> MIME type. Implemented in GetMimeTypeFromExtension().
//
// Sources are considered in the following order:
//
// a) kPrimaryMappings. Order matters here since file extensions can appear
// multiple times on these lists. The first mapping in order of
// appearance in the list wins.
//
// b) Underlying platform.
//
// c) kSecondaryMappings. Again, the order matters.
I won’t copy paste the full list here, but you can check it out in this URL. From the comments, Chrome is doing exactly the same thing as Firefox (except maybe the fact that the underlying platform check is implemented in Chrome).
Other browsers
Sadly, Internet Explorer is not open sourced, but it’s probably doing the same thing as its two competitors.
Since the hard-coded lists are limited, your browser will always send either a content-type found in its list, or the one provided by your OS.
Furthermore, this solution is not really viable since it can easily be tricked. Say I take my file virus.exe
and rename it test.jpg
. Now I upload it to your server, you receive img/jpeg
so you trust it and try to resize it, then your image library will probably crash because it's a .exe
file, so your server fails, and you are sad because you absolutely need to resize every uploaded jpg file.
Even without doing so, you can simply intercept the request using a proxy to modify the header and now you can upload your virus.exe
without even renaming it. Or simply call your server using curl
or postman
to change the content-type manually.
Ok so we can’t trust the content-type, what do we trust then? Chromium (and hopefully Firefox in a near future?) use the type provided by the OS (underlying platform). Let’s check how it works.
File type defined by the OS
For example, in Linux, the command file
runs a set of 3 tests, the first one that has the answer is printed:
- filesystem test: test based on the result of a call to
stat
command to determine if it is a regular file or special file. - magic number test: verifies the first bytes of a file in order to determine what type of file it is, based on a list of mime-types and the bytes related.
- language test: if the file does not match the two prior tests, it is examined in order to determine whether it is a text file or not.
Hey so we can use the magic number to find out what type of file is uploaded! If we do this check server-side, we will have no problem!
Note: “magic number” is a short-hand for file signatures, learn more about it here.
Well yes and no, but first let’s take a look at what actually is a magic number.
A magic number is a set of bytes (sometimes with an offset), that are characteristic of a type of file. Lets take a pdf
file as an example:»
» xxd test.pdf | head -n 2
00000000: 2550 4446 2d31 2e34 0a25 f6e4 fcdf 0a31 %PDF-1.4.%.....1
00000010: 2030 206f 626a 0a3c 3c0a 2f54 7970 6520 0 obj.<<./Type
These are the first two lines of the hexadecimal values of the content of my test.pdf
file, it starts with 25 50 44 46
which is the magic-number used to determine if a file is a pdf
. This magic number is equivalent to %PDF
. You could modify the magic-number of a file, but then, if you don't know what you're doing besides that, you will just destroy your file since a lot of these values are linked together.
Sadly, this is not always true, a PDF
file starts with this magic-number, but an Adobe Illustrator
file (.ai
) also starts with %PDF
. We need to check if the hexadecimal value of the file contains also Adobe Illustrator
to know whether this file is a pdf or an Adobe Illustrator file.
See for yourself in the famous file-type library in Javascript:
if (checkString('%PDF')) {
// Check if this is an Adobe Illustrator file
const isAiFile = await checkSequence('Adobe Illustrator', 1350);
if (isAiFile) {
return {
ext: 'ai',
mime: 'application/postscript'
};
}
// Assume this is just a normal PDF
return {
ext: 'pdf',
mime: 'application/pdf'
};
}
You can find one of the lists here.
Then again, I can easily break your server if I know you are using this, I need to modify these specific hexadecimal values to match what you seek, and you can have a great invalid PDF which is considered valid (and who knows what happens next). Do note that this is probably the hardest part to fake since modifying hexadecimal values of a file to match a file signature can be very painful.
Before leaving you, lets talk a bit about what could be your worst nightmare: the octet-stream. When an OS or a browser does not find out what file it has to deal with, it decides to call it an octet-stream. It’s true, your file is a bunch of bytes, but accepting files on your server that have the type octet-stream is the same as accepting anything and everything, and never knowing what’s in front of you.
We can never know precisely what file was sent to you, unless we try to use it and fail, that is the sad part, but we can always try to be as close to the reality as we can get, by using magic-numbers as part of our verification on upload for example. I hope this short article gave you a better view of content-uploading and the nightmare of handling all cases.
--Vincent Dufrasnes, Software Engineer @PayFit