Dan's Mail Format Site:

Headers: MIME

In the context of Internet mail format, MIME does not refer to a silent entertainer; rather, Multipurpose Internet Mail Extensions are the method by which e-mail was transformed from plain text in the ASCII character set to something much more versatile. Here are the facts about these new features.

E-Mail from Plain to Fancy

Traditionally, e-mail could only contain plain text. Though some of us still prefer it this way, traditional plain-text e-mail could be limiting even for those who disdain multimedia fanciness; messages were limited to the ASCII character set, which is fine for English, but not so great for the other languages of the world. And even the traditionalists who stuck to plain text most of the time would sometimes have the desire to use e-mail to send data of other sorts, like to send family pictures to Mom or a spreadsheet to the boss. To make these things possible, the MIME standard was adopted, superseding a patchwork of earlier half-baked techniques that allowed some clunky attempts to send non-text data by e-mail. This was a great success, and almost all mail programs now support it. MIME has enabled all sorts of things ranging from useful to entertaining to pointless to harmful... maybe you'll catch an e-mail virus from Mom and get a pointlessly bloated MS-Word attachment from your boss containing a memo that could and should have been done in plain text... but, as you engage in an e-mail exchange in Chinese with the Hong Kong branch office, aren't you glad that international character sets are now supported via MIME e-mail?

What is MIME?

The MIME standard consists of the definition of a few new message headers, which indicate what sort of content is in a message -- what content type (plain text, HTML, graphics, etc.) and how it's encoded. Some of the content types are "multipart", meaning that they define a complex message structure with more than one part, each of which has headers of its own. These parts can be nested in arbitrarily complex ways, allowing for an enormous degree of versatility in expressing structured data within an e-mail message. As we'll see later when multipart messages are described, each part has its own set of MIME headers, in addition to the headers at the beginning of the message.

MIME not only revolutionized e-mail, it was also adopted as a major part of the World Wide Web; the HTTP protocol uses many of the same MIME headers as e-mail messages.

The MIME Headers

Here are the headers used in a MIME message.

MIME-Version

This must always be present in any MIME message, and the only recognized value for it is 1.0. It indicates the version of the MIME standard that is in use, of which only one exists so far. Since the present MIME standard permits a great degree of extensibility through the definition of new content types, subtypes, encodings, and the like, it's unlikely that any other MIME version will ever need to be defined, and doing so would necessarily break compatibility with all existing mail programs (the standards explicitly say that they can't give any guidance on what a MIME-1.0-compliant program ought to do if it encounters a hypothetical MIME 1.1 or MIME 2.0 message, so it's probably best off throwing up its hands and giving up on it).

Since parenthesized comments are permitted in message headers, the following is a valid Mime-Version header:

MIME-Version: 1.0 (produced by FooSoft 4.5)

Content-Type

This header defines what type of data is being sent, using what is known as "MIME types". A MIME type is a string that identifies a data format. MIME types always have a slash in them, separating a major type from a subtype. For instance, image/jpeg is the MIME type for JPEG images, where the major type is image (other major types are text, audio, video, application, message, and multipart), and the subtype is jpeg (identifying the specific format). Plain text is text/plain, while HTML is text/html. A registry list gives the current "official" MIME types, but some unofficial ones are used too; unofficial and experimental types can be used with names beginning with x-, which signifies a type that's not actually registered. In some cases MIME types without any x- have been popularized despite not appearing on the official list; this is against the standards, but some of these types are in such widespread use that it's impossible to avoid them. These include text/javascript and audio/wav.

Text-based types, like text/plain and text/html, can also have a charset parameter indicating what character encoding is in use, like this:

Content-type: text/plain; charset=iso-8859-1

In this case, the ISO-8859-1 (Latin-1) character set is specified, giving a character range that includes many accented letters in addition to the normal ASCII characters.

Content-Transfer-Encoding

Since e-mail messages (including MIME messages) are still supposed to be limited to the characters in the ASCII set, to ensure compatibility with programs that might not be able to handle anything else, any non-ASCII things (including text with other characters and binary files) needs to be encoded in a manner that can be transmitted in plain ASCII. Two encodings are defined in the MIME standards, base64 and quoted-printable, with base64 being more appropriate for binary data and quoted-printable for text data. I mention these encodings more in the file attachments and character sets pages. When a transfer encoding is used, it's noted in the Content-Transfer-Encoding header:

Content-Transfer-Encoding: quoted-printable

Content-ID

The Content-ID header is primarily of use in multi-part messages (as discussed below); a Content-ID is a unique identifier for a message part, allowing it to be referred to (e.g., in IMG tags of an HTML message allowing the inline display of attached images). The content ID is contained within angle brackets in the Content-ID header. Here is an example:

Content-ID: <5.31.32252.1057009685@server01.example.net>

The standards don't really have a lot to say about exactly what is in a Content-ID; they're only supposed to be globally and permanently unique (meaning that no two are the same, even when generated by different people in different times and places). To achieve this, some conventions have been adopted; one of them is to include an at sign, with the hostname of the computer which created the content ID to the right of it. This ensures the content ID is different from any created by other computers (well, at least it is when the originating computer has a unique Internet hostname; if, as sometimes happens, an anonymous machine inserts something generic like localhost, uniqueness is no longer guaranteed). Then, the part to the left of the at sign is designed to be unique within that machine; a good way to do this is to append several constantly-changing strings that programs have access to. In this case, four different numbers were inserted, with dots between them: the rightmost one is a timestamp of the number of seconds since January 1, 1970; to the left of it is the process ID of the program that generated the message (on servers running Unix or Linux, each process has a number which is unique among the processes in progress at any moment, though they do repeat over time); to the left of that is a count of the number of messages generated so far by the current process; and the leftmost number is the number of parts in the current message that have been generated so far. Put together, these guarantee that the content ID will never repeat; even if multiple messages are generated within the same second, they either have different process IDs or a different count of messages generated by the same process.

That's just an example of how a unique content ID can be generated; different programs do it differently. It's only necessary that they remain unique, a requirement that is necessary to ensure that, even if a bunch of different messages are joined together as part of a bigger multi-part message (as happens when a message is forwarded as an attachment, or assembled into a MIME-format digest), you won't have two parts with the same content ID, which would be likely to confuse mail programs greatly.

There's a similar header called Message-ID which assigns a unique identifier to the message as a whole; this is not actually part of the MIME standards, since it can be used on non-MIME as well as MIME messages. If the originating mail program doesn't add a message ID, a server handling the message later on probably will, since a number of programs (both clients and servers) want every message to have one in order to keep track of them. Some headers discussed in the Other Headers article make use of message IDs.

When referenced in the form of a Web URI (the term "URL" is being deprecated by the newest proposed Web standards in favor of "URI"), content IDs and message IDs are placed within the URI schemes cid and mid respectively, without the angle brackets:

cid:5.31.32252.1057009685@server01.example.net

Content-Description

Content-Description is a free-form plain text field that can be used to briefly describe the purpose of a MIME message part. Some mail programs display it along with other information about an attachment in the list of things within a message that can be saved or opened.

Content-Disposition

The Content-Disposition header suggests to the receiving program what should be done with a MIME message part. Acceptable values are inline, indicating that the part is intended to be displayed as a portion of a compound document (e.g., an image to be shown within an HTML document) and attachment, indicating an object to be opened or saved separately.

Parameters can be appended, separated by a semicolon from the main value; the most common parameter is filename, which allows you to suggest what name the file should be saved as. (The destination program might not necesarily save it under this name, however; it could be an invalid name under the naming conventions of the destination operating system, and also the end user generally gets a chance to type or select a filename for saving regardless of what the MIME headers suggest.)

A sample Content-Disposition header:

Content-Disposition: attachment; filename="test.jpg"

Multipart MIME Message Bodies

As has been mentioned, the MIME types starting with multipart define structures that contain multiple parts. The most commonly used varieties are multipart/alternative, which indicates that each part is a different version of the same document (e.g., one version in text form and one in HTML) and the viewer ought to pick one and display it; multipart/related, which indicates that the parts form a larger unit intended to be displayed together (e.g., an HTML document and its inline images); and multipart/mixed, indicating a "mixed bag" of different types of data, probably intended to be separately opened and/or saved. Another MIME type that can have a complex structure is message/rfc822, which contains an entire e-mail message (this is what's used when you forward a message as an attachment); this message may, in turn, have the MIME features that allow it to contain multiple parts itself, even though message/rfc822 isn't actually in the "multipart" MIME type group itself. A compilation of multiple messages into a structured digest would use the type multipart/digest. A PGP-signed message (using encryption to authenticate its sender) might be sent as a multipart/signed structure, but some mail programs (e.g., Outlook Express) don't like that and end up showing a completely blank message... not very nice.

A multipart unit has some special parameters in its Content-Type header:

Content-Type: multipart/related;
	boundary="----=_NextPart_32252.1057009685.31.001";
	type="multipart/alternative"

The boundary parameter gives a string that's used to mark the boundary between parts of the document. It needs to be some sequence of characters that doesn't occur elsewhere in the document -- including in the boundary markers of any nested multipart units within it. For this reason, boundary strings should be generated to be globally unique like message and content IDs (you never know what messages might get forwarded as an attachment to other messages, so you don't want to reuse the same boundary strings in different messages), and should contain a sufficiently unusual combination of numbers, letters, and punctuation characters that it's extremely unlikely to come up by chance in the text of the message itself.

The type parameter gives the MIME type of the main part within the multipart group; it's used when one of the parts is the main message body and the rest are auxiliary things like attachments or images. In this case, the main part type is yet another multipart group, nested within the first one; the outer multipart group consists of the message body and its related items, while the inner one consists of the alternative (plain text and HTML) versions of the body.

After the message headers, and the blank line that terminates the headers, the multipart message continues like this:


This is a multi-part message in MIME format.

------=_NextPart_32252.1057009685.31.001
Content-Type: multipart/alternative;
	boundary="----=_NextPart_32252.1057009685.31.002"
Content-Description: Message in alternative text and HTML forms


------=_NextPart_32252.1057009685.31.002
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Content-Description: Message in plain-text form

Some plain text goes here.

First comes the line that says "This is a multi-part message in MIME format.", for the benefit of any non-MIME-capable reader who might be wondering what the message is. Then comes the boundary marker, as defined in the headers; you may note that it begins here with six dashes while the definition had only four; that's because the standards call for the boundary to be preceded by two dashes.

The boundary marker is followed by the MIME headers for the next part; it's unnecessary to include the Mime-Version header (it appears only once in the main message headers), but each part has a content type and possibly a description, transfer encoding, and disposition.

If a part is itself a multipart entity, it has its own boundary marker (which must be different from the outer one; note here that the inner boundary ends in "002" while the outer one ended in "001"). The parts of the inner multipart unit follow; here, we see the beginning of a plain text portion. The end of that part and the beginning of the next one looks like this:

More plain text goes here.

------=_NextPart_32252.1057009685.31.002
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Content-Description: Message in HTML form

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
 "http://www.w3.org/TR/html40/strict.dtd">
<html>
<!-- HTML code goes here -->

Once all parts are complete, the end is marked with a boundary marker followed immediately by two more dashes, at which point the outer multipart group resumes with its next part:

</html>

------=_NextPart_32252.1057009685.31.002--

------=_NextPart_32252.1057009685.31.001
Content-Type: image/gif
Content-Transfer-Encoding: base64
Content-Description: Graph
Content-Disposition: inline; filename="image.gif"
Content-ID: <1.31.32252.1057009685@server01.example.net>

The end of the message is signalled by a top-level boundary marker, two dashes, and then -- End -- on a line by itself:

------=_NextPart_32252.1057009685.31.001--


-- End --

Non-ASCII Characters in Headers

The MIME standards also provide a way to get characters outside the ASCII set into the headers themselves, which is useful for people whose names include accented letters, for instance. Unfortunately, such headers look like a mess when viewed in raw mode:

=?iso-8859-1?Q?l'=E9te_c'est_arrive=E9!?=

By the standard, an "encoded word" is a sequence of characters that begins with "=?", ends with "?=", and has two "?"s in between. (That means that this sequence of characters had better not occur accidentally in a header, or else it'll be interpreted as an encoded word by MIME-compatible programs.) After the first question mark is the name of the character encoding being used; after the second question mark is the manner in which it's being encoded into plain ASCII (Q=quoted printable, B=base64); and after the third question mark is the text itself.

The above format for encoding special characters is the one specified in RFC 2047. However, it has some drawbacks such as the lack of any means of specifying particular encoding information for parameters such as filenames that are appended to MIME headers. For this, a newer and more versatile format was specified later in RFC 2231, which supports providing encoding and language code information for each parameter, and breaking up of such parameter values to multiple lines (often necessary when special characters are used which require lengthy encoding sequences). This looks like:

Content-Type: application/x-stuff title*0*=us-ascii'en'This%20is%20even%20more%20 title*1*=%2A%2A%2Afun%2A%2A%2A%20 title*2="isn't it!"

This format, using asterisks following the name of the parameter, has a sequential number (0, 1, 2) indicating that the lines are part of the same parameter, then has an encoding (us-ascii), a language code (en for English), and the encoded value, all separated by single quotes.

Unfortunately, many mail programs still fail to support this "new" (even though about a decade old by now) format, and will fail to correctly parse the filename. This has led some developers to cause programs to output filenames in standards-noncompliant formats that nevertheless work in commonly used programs, when special characters are needed.