With the advent of the Web as medium for worldwide information interchange, the Unicode character set has become vital. For general background on this character set, see the Unicode homepage.
To get a Unicode string, prefix the string with
u. For example:
u'klarn'
is a five-character Unicode string.
To include one of the special Unicode characters in a string constant, use these escape sequences:
\x
|
For a code with the 8-bit hexadecimal value
.
|
\u
|
For a code with the 16-bit hexadecimal value
.
|
\U
|
For a code with the 32-bit hexadecimal value
.
|
Examples:
>>> u'Klarn.' u'Klarn.' >>> u'Non-breaking-\xa0-space.' u'Non-breaking-\xa0-space.' >>> u'Less-than-or-equal symbol: \u2264' u'Less-than-or-equal symbol: \u2264' >>> u"Phoenician letter 'wau': \U00010905" u"Phoenician letter 'wau': \U00010905" >>> len(u'\U00010905') 1
All the operators and methods of str type are available with unicode values.
Additionally, for a Unicode value , use this method to encode
its value as a string of type Ustr:
U.encode (
encoding[, error )
Return the value of as type Ustr. The argument
is a string that specifies the encoding method. In most
cases, this will be encoding'utf_8'. For
discussion and examples, see Section 10.1, “The UTF-8 encoding”.
The optional string specifies what
to do with characters that do not have exact
equivalents. For example, if you are converting to
the ASCII
character set, the error argument is encoding'ascii'. Values of the error argument are given in the table below.
'strict' |
Raise a UnicodeError exception if
any character has no ASCII equivalent. This is
the default behavior.
|
'ignore' | Leave out characters that have no equivalent. |
'replace' |
Substitute a '?' for each
character that has no equivalent.
|
'xmlcharrefreplace' |
Use the XML character entity escape sequence for
characters with no ASCII equivalent. The
general form of this sequence is "&#, where is the decimal value of the
character's code point. This feature is very
handy for generating internationalized Web
pages.
|
'backslashreplace'
| Use Python backslash escape sequences to represent characters with no equivalent. |
Here are some examples to demonstrate error
argument values.
>>> s = u"a\u262ez"
>>> len(s)
3
>>> s
u'a\u262ez'
>>> s.encode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u262e'
in position 1: ordinal not in range(128)
>>> s.encode('ascii', 'ignore')
'az'
>>> s.encode('ascii', 'replace')
'a?z'
>>> s.encode('ascii', 'xmlcharrefreplace')
'a☮z'
>>> hex(9774)
'0x262e'
>>> t = s.encode('ascii', 'backslashreplace')
>>> t
'a\\u262eb'
>>> print t
a\u262eb
>>> len(t)
8
>>> t[1]
'\\'