gpt4 book ai didi

python - 为什么 unicode() 只在我的对象上使用 str() 而没有给出编码?

转载 作者:太空狗 更新时间:2023-10-29 22:17:31 26 4
gpt4 key购买 nike

我首先创建一个字符串变量,其中包含一些非 ascii utf-8 编码数据:

>>> text = 'á'
>>> text
'\xc3\xa1'
>>> text.decode('utf-8')
u'\xe1'

在其上使用 unicode() 会引发错误...

>>> unicode(text)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
ordinal not in range(128)

...但是如果我知道编码,我可以将它用作第二个参数:

>>> unicode(text, 'utf-8')
u'\xe1'
>>> unicode(text, 'utf-8') == text.decode('utf-8')
True

现在,如果我有一个在 __str__() 方法中返回此文本的类:

>>> class ReturnsEncoded(object):
... def __str__(self):
... return text
...
>>> r = ReturnsEncoded()
>>> str(r)
'\xc3\xa1'

unicode(r) 似乎在其上使用了 str(),因为它引发了与上面的 unicode(text) 相同的错误:

>>> unicode(r)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
ordinal not in range(128)

到目前为止,一切都按计划进行!

但没有人会想到,unicode(r, 'utf-8') 甚至不会尝试:

>>> unicode(r, 'utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: coercing to Unicode: need string or buffer, ReturnsEncoded found

为什么?为什么会出现这种不一致的行为?这是一个错误吗?是故意的吗?很尴尬。

最佳答案

这种行为看起来确实令人困惑,但却是故意的。我在这里复制了 Python Built-In Functions documentation 中的全部 unicode 文档。 (对于版本 2.5.2,我写这篇文章时):

unicode([object[, encoding [, errors]]])

Return the Unicode string version of object using one of the following modes:

If encoding and/or errors are given, unicode() will decode the object which can either be an 8-bit string or a character buffer using the codec for encoding. The encoding parameter is a string giving the name of an encoding; if the encoding is not known, LookupError is raised. Error handling is done according to errors; this specifies the treatment of characters which are invalid in the input encoding. If errors is 'strict' (the default), a ValueError is raised on errors, while a value of 'ignore' causes errors to be silently ignored, and a value of 'replace' causes the official Unicode replacement character, U+FFFD, to be used to replace input characters which cannot be decoded. See also the codecs module.

If no optional parameters are given, unicode() will mimic the behaviour of str() except that it returns Unicode strings instead of 8-bit strings. More precisely, if object is a Unicode string or subclass it will return that Unicode string without any additional decoding applied.

For objects which provide a __unicode__() method, it will call this method without arguments to create a Unicode string. For all other objects, the 8-bit string version or representation is requested and then converted to a Unicode string using the codec for the default encoding in 'strict' mode.

New in version 2.0. Changed in version 2.2: Support for __unicode__() added.

因此,当您调用 unicode(r, 'utf-8') 时,它需要一个 8 位字符串或字符缓冲区作为第一个参数,因此它使用 __str__() 方法,并尝试使用 utf-8 编解码器对其进行解码。如果没有 utf-8unicode() 函数会在您的对象上寻找一个 for a __unicode__() 方法,但没有找到,按照您的建议调用 __str__() 方法,尝试使用默认编解码器转换为 unicode。

关于python - 为什么 unicode() 只在我的对象上使用 str() 而没有给出编码?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/106630/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com