10.1. The UTF-8 encoding

How, you might ask, do we pack 32-bit Unicode characters into 8-bit bytes? Quite prevalent on the Web and the Internet generally is the UTF-8 encoding, which allows any of the Unicode characters to be represented as a string of one or more 8-bit bytes.

First, some definitions:

A code point is a number representing a unique member of the Unicode character set.
The Unicode code points are visualized as a three-dimensional structure made of planes, each of which has a range of 65536 code points organized as 256 rows of 256 columns each.
The low-order eight bits of the code point select the column; the next eight more significant bits select the row; and the remaining most significant bits select the plane.

This diagram shows how UTF-8 encoding works. The first 128 code points (hexadecimal 00 through 7F) are encoded as in normal 7-bit ASCII, with the high-order bit always 0. For code points above hex 7F, all bytes have the high-order (0x80) bit set, and the bits of the code point are distributed through two, three, or four bytes, depending on the number of bits needed to represent the code point value.

To encode a Unicode string U, use this method:

U.encode('utf_8')

To decode a regular str value S that contains a UTF-8 encoded value, use this method:

S.decode('utf_8')

Examples:

>>> tilde='~'
>>> tilde.encode('utf_8')
'~'
>>> u16 = u'\u0456'
>>> s = u16.encode('utf_8')
>>> s
'\xd1\x96'
>>> s.decode('utf_8')
u'\u0456'
>>> u32 = u'\U000E1234'
>>> s = u32.encode('utf_8')
>>> s
'\xf3\xa1\x88\xb4'
>>> s.decode('utf_8')
u'\U000e1234'

UTF-8 is not the only encoding method. For more details, consult the documentation for the Python module codecs.

Python 2.7 quick reference

10.1. The UTF-8 encoding