Dan's Mail Format Site | Body | Character Sets

Dan's Mail Format Site:

Body: Character Sets

[<== Previous] | [Up] | [Next ==>]

See also Serbo-Croatian translation, Portuguese translation (from Travel Ticker), Indonesian translation (from ChameleonJohn), German translation (provided by Game Period), Bosnian translation (by Ratko Kecmanovic), and Japanese translation (by Daily Deals Coupon). (done by others in their own site, with my permission)

If you need to use accented letters or mathematical symbols in your messages, or you wonder if there's a way to insert a "euro sign", or you observe that somebody else's message contains garbage where a special character should be, this article will help you understand the issues involved.

Note: Some people are winding up on this page when they search for the string: The message contains Unicode characters and has been sent as a binary attachment. They probably received an e-mail message with this text. It's a virus; don't open the attachment. Real Unicode messages (which are explained below) don't need binary attachments.

Your Computer's Cast of Characters

Computers are very powerful devices. However, they have a very significant limitation: all they are really able to deal with is numbers. Anything else -- words, pictures, sounds, videoclips -- needs to be converted into a sequence of numbers in order for a computer to deal with it. That's the job of data format standards, to ensure that different computers and programs agree with one another about what data is represented by a particular bunch of numbers. In this age of "point-and-click" software, users have become accustomed to being able to drag, drop, cut, paste, upload, and download any sort of multimedia. They seldom stop to think about what's actually going on "beneath the hood" of their computer, except when something goes wrong and a data file comes out as a mass of garbage on the computer screen; at that point, knowing how the data is coded is essential to figuring out what failed and how to fix it.

This article concerns itself with how a computer stores and transmits text. (Other types of data are discussed in the file attachments page.) Text is one of the earliest kinds of data that people wanted to store on computers, so developers have been coming up with schemes to represent text as numbers for the last half century. After a few proprietary coding systems were devised by computer manufacturers, the desirability for a universal character coding standard to be used consistently by everybody led to the devising of ASCII (American Standard Code for Information Interchange) in the early 1960s. For a while, ASCII fought a "VHS vs. Beta"-style battle with other contending character codings like EBCDIC and Baudot, but it won out in the end. (However, just like Beta-based video formats are still found in specialized professional uses, the other character codings still have their niches; there are IBM mainframes that use EBCDIC, and telecommunications devices for the deaf using Baudot. Anybody needing to transfer data from these to anything else, however, needs to convert it to ASCII.) After a few revisions over the years, a form of ASCII known as US-ASCII is now the "common denominator" character set that's understood by pretty much all computer systems now in use.

In the ASCII character set, each letter, number, and punctuation mark in a piece of text is represented by a number from 0 to 127. (In the binary code used by computers, this takes 7 bits, or binary digits, to store.) For instance, a capital letter A is represented by the number 65. You can see the importance of consistent character set standards; if another computer used a character encoding that represented the letter Z by the number 65, then anybody trying to read a document transferred to this computer from one that uses ASCII would see a Z everywhere an A was intended by the author. Aristotle and Ayn Rand make a big deal about how "A is A", but if your character sets don't match, A might be Z!

While there are 128 characters in the ASCII set, some of them are control characters like tabs and linefeeds (and more exotic stuff like Unit Separator and Device Control 2 which are seldom used these days). The regular characters include the 26-letter alphabet in upper and lower case, the 10 digits, and various common punctuation like periods and semicolons. Normal English-language text can be written very well in "plain" ASCII (though you need to use only "straight" quotes and apostrophes, not the curly sort, which I'll discuss later).

US-ASCII Characters
0 NUL   16 DLE   32 SP   48 0   64 @   80 P   96 `   112 p
1 SOH   17 DC1   33 !   49 1   65 A   81 Q   97 a   113 q
2 STX   18 DC2   34 "   50 2   66 B   82 R   98 b   114 r
3 ETX   19 DC3   35 #   51 3   67 C   83 S   99 c   115 s
4 EOT   20 DC4   36 $   52 4   68 D   84 T   100 d   116 t
5 ENQ   21 NAK   37 %   53 5   69 E   85 U   101 e   117 u
6 ACK   22 SYN   38 &   54 6   70 F   86 V   102 f   118 v
7 BEL   23 ETB   39 '   55 7   71 G   87 W   103 g   119 w
8 BS   24 CAN   40 (   56 8   72 H   88 X   104 h   120 x
9 HT   25 EM   41 )   57 9   73 I   89 Y   105 i   121 y
10 LF   26 SUB   42 *   58 :   74 J   90 Z   106 j   122 z
11 VT   27 ESC   43 +   59 ;   75 K   91 [   107 k   123 {
12 FF   28 FS   44 ,   60 <   76 L   92 \   108 l   124 |
13 CR   29 GS   45 -   61 =   77 M   93 ]   109 m   125 }
14 SO   30 RS   46 .   62 >   78 N   94 ^   110 n   126 ~
15 SI   31 US   47 /   63 ?   79 O   95 _   111 o   127 DEL

Fortunately, ASCII was adopted in a sufficiently universal way that you can be almost certain that anything written using the characters in this set (other than the control characters, anyway) will show up the same way it was written, no matter what systems and programs it's sent through. For e-mail users (yes, I did plan on getting back on-topic for this site eventually!), this means that ASCII characters are the very safest characters to use. If your message consists entirely of the letters, numbers, and punctuation in the ASCII set, you won't have any problem with their readability. (In fact, it's even legal under the e-mail format standards to include the control characters in a message, with the special condition that carriage returns and linefeeds can only occur together to make up a line break, not separately. However, aside from line breaks and tabs, there's really no point to including control characters in e-mail, and no consistent interpretation made of them by programs on the receiving end. The formfeed character, #12, however, has some traditional use in newsgroups to mark "spoilers" in discussions about books, movies, and the like; some news readers pause for a keypress before continuing from that point, or otherwise obscure what follows the character until you're ready to see it. This feature is less common in current-day mail or news readers, however.)

One thing to note about the control characters is that there is some platform divergence in how a line break is represented; by the traditional standards, the two characters CR (#13) and LF (#10) go together to end a line. Windows systems do it this way (so Microsoft actually follows traditional standards here for a change!), while Unix, Linux and similar systems use only the LF character, and MacOS traditionally only used the CR character. (However, recent MacOS versions are Unix-based and have switched to using the LF character.) This can sometimes cause hassles when text files are transferred between systems, but I haven't noticed any e-mail problems; either all the mail clients and servers follow the standards properly in encoding line breaks regardless of platform, or they're robust enough to recognize the variant breaks of other systems and work transparently with them.

Tabs (#9) can also be problematic, as programs may differ in how many spaces they make between tab stops.

Beyond ASCII

The rest of the world doesn't all speak English, though, and there's where ASCII becomes problematic. You don't have to be a leftist PC freak to find there to be some cultural bias to giving computers a "standard" character set that represents English very well, but omits the letters with accents, umlauts, and other diacritical marks, used in many other languages. Also missing are other alphabets such as Greek and Cyrillic, currency symbols other than the dollar sign, and specialized symbols needed for advanced applications such as higher mathematics. For computers to be usable worldwide, it is necessary to go beyond ASCII.

Since the standard byte (unit of data storage) on personal computers is 8 bits, and ASCII uses only 7 bits, the obvious thing to do was to put the eighth bit into use, doubling the number of characters that could be represented. This could be a problem with older software that used the eighth bit as a checksum or mode flag, but it eventually became commonplace for computers to use all eight bits for character storage. Unfortunately, it took a while for a standard to emerge regarding just what characters were in those other 128 positions (representing numbers from 128 to 255). Different platforms used different combinations of accented letters, symbols, box-drawing characters, and other things. The IBM PC text mode had one set, the Macintosh used another, and when Windows came along it had yet a different set. Versions of computer systems intended for the markets of different countries would also vary so that the particular characters needed for the local language would be supported. This wasn't a very good situation for the interchange of data between different systems.

Fortunately, the International Organization for Standardization (which is, for some reason, abbreviated ISO instead of IOS; actually, according to their site, it's not actually intended to stand for their real initials, in order not to offend the various nationalities who would abbreviate it differently in different languages; marketing types these days seem to like initialisms and acronyms that don't stand for anything, anyway) came out with a group of standard character sets. They couldn't just come out with one unified character set, because the different languages of the world had more characters between them than would fit in a single 8-bit group of characters. Instead, they came out with various character sets (designated as the ISO 8859 series) designed for different groups of languages. The most commonly used one is ISO-8859-1, also known as "Latin-1", which contains characters useful for the languages of Western Europe. This character set (or, more properly, "character encoding"; purists will point out that the "set", or "repertoire", is the group of available characters, but the "encoding" specifies what numbers correspond to which characters) is actually the same as the proprietary "Windows-1252" encoding, with the exception that the group of characters at positions #128 through #159, where Windows puts some characters including the trademark sign (™) and "curly" quotes, are instead reserved for control characters in ISO-8859-1. Another ISO standard, ISO 6429, actually gives geeky names and abbreviations for these control characters, like "Reverse Line Feed" and "Control Sequence Introducer". I don't know what programs actually use these control characters, but I don't think it makes any sense to use them in e-mail messages. (Even if it did, it wouldn't be safe, since programs, in Windows at least, tend to assume that those character positions are occupied by the proprietary Microsoftism characters from the Windows character set, rather than the control characters the standards actually say are at those positions.) However, for completeness, I'm including them here in a chart of characters #128 through #255 of the ISO-8859-1 encoding (characters #0-#127 are the same as in US-ASCII).

ISO-8859-1 characters (with ISO 6429 controls)
128 XXX   144 DCS   160 NBSP   176 °   192 À   208 Ð   224 à   240 ð
129 XXX   145 PU1   161 ¡   177 ±   193 Á   209 Ñ   225 á   241 ñ
130 BPH   146 PU2   162 ¢   178 ²   194 Â   210 Ò   226 â   242 ò
131 NBH   147 STS   163 £   179 ³   195 Ã   211 Ó   227 ã   243 ó
132 IND   148 CCH   164 ¤   180 ´   196 Ä   212 Ô   228 ä   244 ô
133 NEL   149 MW   165 ¥   181 µ   197 Å   213 Õ   229 å   245 õ
134 SSA   150 SPA   166 ¦   182   198 Æ   214 Ö   230 æ   246 ö
135 ESA   151 EPA   167 §   183 ·   199 Ç   215 ×   231 ç   247 ÷
136 HTS   152 SOS   168 ¨   184 ¸   200 È   216 Ø   232 è   248 ø
137 HTJ   153 XXX   169 ©   185 ¹   201 É   217 Ù   233 é   249 ù
138 VTS   154 SCI   170 ª   186 º   202 Ê   218 Ú   234 ê   250 ú
139 PLD   155 CSI   171 «   187 »   203 Ë   219 Û   235 ë   251 û
140 PLU   156 ST   172 ¬   188 ¼   204 Ì   220 Ü   236 ì   252 ü
141 RI   157 OSC   173 SHY   189 ½   205 Í   221 Ý   237 í   253 ý
142 SS2   158 PM   174 ®   190 ¾   206 Î   222 Þ   238 î   254 þ
143 SS3   159 APC   175 ¯   191 ¿   207 Ï   223 ß   239 ï   255 ÿ

The "XXX" control characters, incidentally, aren't used by the porn industry; they're just left undefined by the standard. Anyway, since ISO-8859-1 is just one of several language-specific character encodings, it is necessary for any protocol that sends and receives text to have some manner of indicating which encoding is being used. One possibility is to declare by fiat that one encoding is the standard; ISO-8859-1 (Latin-1) is the de-facto standard these days in most cases where nothing indicates otherwise; the characters in this set are, next to the US-ASCII ones, the "safest" ones for use in text, as most computer systems can understand them. However, this leaves out the other languages represented by different encodings. Fortunately, most protocols, including those for the Web and e-mail, provide for the explicit indication of a character encoding. For e-mail, it is done in the Content-Type header with the addition of a charset parameter. So, to indicate a plain text message in ISO-8859-1 encoding, this appears in the headers:

Content-Type: text/plain; charset=iso-8859-1

Quoted Printable

There's just one more problem; the mail format standards (RFC 2822) disallow the use of characters out of the 7-bit ASCII range. The reason for this is that 8-bit characters might have unpredictable effects on programs and networks unused to them. This is probably more of an abstract academic concern nowadays, but in the not-so-distant past much e-mail was being transferred through networks that used the eighth bit as a flag or checksum. To avoid causing problems in such situations, the quoted printable and base64 encoding systems were devised to allow any sort of data to be sent purely in safe ASCII characters. Base64 is designed for transmitting binary data, and will be discussed more in the file attachments article. (Some spammers do encode their main body text in base64 as an obscuring technique!) Quoted printable is designed for plain-text messages that might contain some non-ASCII characters. The parts of the message that are composed of normal ASCII printable characters are kept unchanged, while "special" characters (including control characters, and anything above character #127) are encoded as sequences consisting of an equal sign (=) followed by two hexadecimal (base 16) digits (these consist of the digits 0 through 9 and the letters A through F). The use of the equal sign as a special character means that it, too, must be encoded (as "=3D"). A few more rules are used to deal with line breaks and whitespace.

If the receiving mail program understands quoted printable encoding (as almost all do these days), this encoding is undone at the receiving end, so the characters come out the same way they came in. If the recipient doesn't understand this encoding (or is viewing the message in raw source-code form), the message will mostly look like ordinary, readable, text, but will have a few oddities like equal signs and hex digits interspersed in it, and may also have odd line breaks (quoted printable encoding adds line breaks to bring line lengths within specs, but this is undone at the receiving end when the last character of each line is an = sign to indicate it's a "soft line break").

This header line is added to indicate that quoted printable encoding is in use:

Content-Transfer-Encoding: quoted-printable

Onward to Unicode

The standardization of the set of ISO character encodings helped bring order to the chaos of proprietary vendor-specific character sets, but some people still had a dream of creating a single, unified character set that encompassed the characters needed by all languages. This would obviously take more than 8 bits to represent; Chinese, alone, has more characters than can fit in a 256-character set. So, when the character standard that would be known as Unicode first took form, it was a 16-bit encoding, taking two bytes per character (twice as much as the 8-bit encodings), and able to represent 65,536 different characters. (As we'll see later, they ultimately expanded it to an even broader range than this.) These characters have numbers (or "code positions") ranging from 0 to 65,535, but are more often given in hexadecimal as 0000 through FFFF. ISO-8859-1 (Latin-1) is a subset of Unicode, whose first 256 positions correspond to this older standard. Since this in turn includes US-ASCII in its first 128 positions, that too is encompassed within Unicode. The remaining positions, #256 and beyond, include everything from Greek to Hebrew to Chinese to mathematical symbols to chess pieces... and also the Euro sign (€), important to Europeans now to symbolize their unified currency, but which didn't exist at the time earlier character set standards were devised.

Since most online text is in English or Western European languages, where most characters are in the US-ASCII set, requiring two bytes per character was considered wasteful since it doubles the size of a text document. Hence, some more efficient encodings were devised, the most popular being UTF-8. This encoding drops the concept that all characters take the same number of bits, and represents characters as variable-length sequences. Notably, the 128 US-ASCII characters are encoded as single bytes, identical to their representation in US-ASCII and ISO-8859-1, so that any UTF-8 document consisting entirely of those characters is indistinguishable from a plain ASCII document, which is good for forward and reverse compatibility. Beyond this, various combinations of bytes with their high bit set are used to represent other Unicode characters. In particular, it should be noted that the Latin-1 characters from #128 to #255 can't be included as "raw" single bytes in UTF-8, since these bytes are used as part of multi-byte sequences; those characters must be encoded as more than one byte, unlike US-ASCII characters. This can sometimes cause a problem when Latin-1 characters are pasted into a UTF-8 document and the software involved doesn't do the appropriate conversion. However, as software authors get more globally aware (as the computer market spreads to countries where non-ASCII characters are essential), it is becoming more common for software to properly handle all sorts of characters without the users having to think too much about it... except on the occasions where something screws up!

Once UTF-8 was established (and used much more commonly than raw 16-bit encoding), Unicode itself dropped the concept that all its characters contained the same number of bits, and revised its standard to permit more characters to be assigned at positions even higher than #65535. These characters take up to six bytes to encode in UTF-8, but allow for the addition of characters too obscure to make it before. (So far, however, efforts to get Klingon added to the Unicode set have been rejected; however, they have seen fit to add such useful characters as "Pile of Poo", at hex code U+1F4A9.) The Unicode character set has also been adopted as a standard by ISO, which has designated it as ISO 10646.

The UTF-8 coding is very efficient for documents containing mostly ASCII characters with just a few others. It's also the best way to encode a document containing text from multiple languages, where most other encodings would be unable to represent all the needed characters at once. However, if something is written entirely in a single language composed of non-ASCII characters, a different encoding, specific to that language's character set, is more efficient. Hence, UTF-8 will never crowd out all other encodings; however, the underlying Unicode standard is the "common ground" by which characters in all encodings can be compared and converted, a "lingua franca" for character sets.

A UTF-8-encoded document has this header line to indicate its encoding:

Content-Type: text/plain; charset=utf-8

In an e-mail message, it should be further transfer-encoded as quoted printable, as described above, so that the byte-sequences denoting non-ASCII characters get represented in ASCII (hex-digit) form.

Curly Quotes, Em-Dashes, and Trademark Signs

Earlier, I mentioned that some characters in the Windows character set, including "curly" quotes and the ™ sign, were not part of ISO-8859-1. Despite this, many programs (especially ones from Microsoft) like to insert them into documents and e-mail messages. The feature of so-called "smart quotes", found in a number of programs, causes normal ASCII quotes and apostrophes, " and ', to be converted to the "curly" variety, “”‘’. Even if your e-mail program doesn't do this, you might still introduce these characters when you paste in text from somewhere else, such as a word processor or Web page. Typographic purists say that this is more correct, though old-time computerists (and people familiar with typewriters before that) are used to the "straight" variety of quote. There are several ways a "curly quote", and other characters in the group that's in the Windows set but not Latin-1, can be represented in an e-mail message, and they range from being completely wrong (by the standards) to being correct but problematic. (Even in Web pages they can be problematic; if your browser shows question marks or raw code like &lsquo; above where example curly quotes should be, that means it does not support these character entities.)

  1. Some programs just plop these characters down into a document or message as 8-bit characters, straight out of Windows. If the message header indicates that it's in us-ascii, iso-8859-1, or utf-8, then this is just plain wrong. Such characters are undefined in ASCII, are control characters in ISO-8859-1, and are part of multi-byte sequences in UTF-8; they don't stand for what Windows thinks they do. However, if the message header indicates the encoding is windows-1252, then these characters are technically proper, though the use of a proprietary, platform-specific encoding is not a good idea (non-Windows systems may not know what to make of it). For that matter, some non-Windows systems (especially MacOS) sometimes plop down their proprietarily-encoded "smart quotes", with characters differing from the Windows variety, into documents and messages, so that an apostrophe ends up looking at the other end like a superscripted number 1.

  2. Sometimes these characters are represented as numeric references in HTML (or SGML or XML) syntax. This makes no sense for a plain-text message (where no markup-language syntax has any business being used), but it doesn't always stop programs from doing it anyway. In HTML e-mail, it does make sense just as in Web pages. However, the numeric references sometimes used are bogus ones like &#147;, corresponding to the position of the desired character in the Windows encoding. Numeric character references in HTML are always with respect to the Unicode positions of characters, and the control character at #147 in Unicode is in a range specifically disallowed in HTML. The characters in question are in Unicode, however, at much higher numbered positions; thus, &#8220; is a valid numeric reference to a left curly quote.

  3. Finally, if UTF-8 encoding is used, these characters can be included as multi-byte sequences under this encoding. This is properly standards-compliant, and works for plain-text as well as HTML e-mail. Unfortunately, not all e-mail programs support UTF-8; here's what an attempt to use it might look like (taken from an actual screenshot of an inbound message as displayed in a mail program):

    [Screen Shot]

    UTF-8 characters have also been known to get similarly mangled when a message containing them is quoted, forwarded, copied and pasted, or otherwise manipulated; or when a bunch of different messages are put together in a single digest or archive file (which can have only one "charset" header; if this is something other than UTF-8, even programs that would normally understand the encoded characters would see garbage instead).

Because of the problems and glitches involved, it's best to stick to the "safe" US-ASCII characters, including "straight quotes", rather than trying to be "fancy" with so-called "smart quotes" instead. If you actually need non-ASCII characters from the Unicode repertoire, such as in a multilingual message, then go ahead and use the appropriate encoding (and any users with nonsupporting reader programs will be out of luck), but if it's just a "frippery" like curly quotes, it's better to Keep It Simple, Stupid. Anyway, a curly apostrophe encoded in UTF-8 and transfer-encoded in quoted printable comes out as =E2=80=99, which takes a whopping nine bytes... a waste of bandwidth and disk space even if it's displayed correctly. The HTML reference &#8217; takes seven bytes. A normal ASCII apostrophe (') takes one byte.

People trying to imitate curly quotes have sometimes "appropriated" other ASCII and Latin-1 characters, with results I regard as more awkward than just using straight quotes. The grave accent (`), which is in ASCII, and the acute accent (´), which is in Latin-1, are sometimes pressed into service as single quotes or apostrophes; however, they are not intended to be any sort of quote. They lean too far to look good as quotes, and additionally some software treats the keys for them as nonspacing combination characters used in typing accented letters -- the accent is combined with the letter typed right before (or maybe after?) it. Thus, people who get in the habit of using them as quotes find that they sometimes don't work right. U.S. keyboards have a key only for the grave accent, anyway, not the acute one (though keyboards in other countries often have both). I've also seen people use a grave accent as an apostrophe (how`s that?), though it leans in entirely the wrong direction. Then, there's what I call "Unix Geek Quoting" (also common in news wire services) that uses a grave accent as the opening single quote and a normal straight single quote to close it, like `this'. This was encouraged by archaic versions of the ASCII standard, implemented in the fonts of some old computer systems, that called for the normal ASCII apostrophe to "lean". Since the '80s at least, the standard has called for the ASCII apostrophe to be straight, and most current fonts follow this, so the two sides of a quote done this way don't come close to matching. People who use this quoting style often do opening double quotes with two grave accents, making it even more ``out of whack" when matching the single-character double quote at the other end.

Besides quotes and the trademark sign, commonly used and abused Windows characters outside Latin-1 include the "em dash" (—) and the ellipsis (…). "Plain-ASCII" substitutes are two dashes (--) and three dots (...) respectively.

ROT13

ROT13 isn't really a character set, but it's a form of encoding you might sometimes encounter, especially on newsgroups. It's not part of any official, documented standard (as far as I know), and has no header lines to indicate its presence; rather, it's normally just embedded in the middle of a plain-text message. Suddenly (with or without a warning), you hit a piece of gibberish text, though it's made up of normal letters (no funny control characters or hexadecimal digits). If it's on a geeky newsgroup or mailing list, you've probably run into ROT13. What it is is a trivial "encryption" scheme, designed not to keep a message secret (since it's easy to decode once you know how) but to provide a minor degree of protection against it being accidentally seen when it shouldn't. It's used for such things as plot spoilers in discussions of books and movies, dirty jokes that might offend people, or mentions of names of people and companies in the course of heated rants about office politics that the ranters would prefer not be indexed by Google where their boss might read them.

In ROT13 encoding, the 26 letters of the standard English alphabet are shifted 13 positions away, with the alphabet considered to wrap from Z back to A in an endless loop. All other characters (numbers, punctuation, and accented letters, for instance) are left alone "as-is". (This probably makes ROT13 inadequate for hiding text in non-English languages that have a high proportion of characters other than the ASCII alphabet.) Since 13 is exactly half of 26, the same exact operation serves both to encode and to decode a message.

Traditionally, Unix-based news readers have a built-in ROT13 encoding/decoding function making it easy to read such encoded messages, or create your own. Windows mail/news programs don't always have this function, but Web sites exist to do it for you.

Links

Next: You may have seen emoticons, or "smileys", in messages. Are they good :-) or bad :-( ?

[<== Previous] | [Up] | [Next ==>]

 

This page was first created 17 May 2003, and was last modified 11 Mar 2013.
Copyright © 2003-2018 by Daniel R. Tobias. All rights reserved.

webmaster@mailformat.dan.info