gpt4 book ai didi

python 将 unicode 转换为 "print"形式

转载 作者:行者123 更新时间:2023-12-01 04:52:57 24 4
gpt4 key购买 nike

我在网页中抓取了这一段:

It doesn’t look like a controversial new case management system is going anywhere. So the city plans to spend the next few months helping local social assistance workers learn to live with it.

在我下载的 python unicode html 数据中,它看起来像这样:

mystr = u'It doesn\u2019t look lake a controversial new case management system is going anywhere. So\xa0the city plans to spend the next few months helping local social assistance workers learn to live with it.'

我的计划是能够使用类似 mystr.find("doesn't") 的东西找到该词的位置。目前,mystr.find("doesn't")将返回-1事实上doesn\u2019tmystr

有没有快速转换mystr的方法与上面的段落完全相同,以便所有 unicode“字符”都替换为“正常”字符,以便我可以使用 str.find()

到目前为止,我在网页上找到的最好的帖子是替换 u'\u2019'"'"然后替换u'\xa0'' ' 。是否有更方便的方法,这样我就不必真正编写方法并构建转换字典?

ps:

我也尝试过 unicodedata.normalizing 之类的东西,似乎不起作用。

编辑:忘了说了,python版本是2.7

最佳答案

您已经拥有该网页所包含的内容。 \u2019U+2019 RIGHT SINGLE QUOTATION MARK ,一个花哨的单引号,但您使用的是简单的 ASCII 单引号,例如卑微的U+0027 APOSTROPHE .

如果打印该值,您会看到它生成的内容看起来很像其中有一个单引号,但稍微弯曲:

>>> mystr = u'It doesn\u2019t look lake a controversial new case management system is going anywhere. So\xa0the city plans to spend the next few months helping local social assistance workers learn to live with it.'
>>> print mystr
It doesn’t look lake a controversial new case management system is going anywhere. So the city plans to spend the next few months helping local social assistance workers learn to live with it.

Python 所做的只是回显字符串的表示,它将任何不可打印和非 ASCII 的内容替换为转义序列,使值可重现;您可以将该值复制并粘贴到任何 Python 解释器或脚本中,它将生成相同的值。由于 Python 的默认源编码是 ASCII,因此仅使用 ASCII 字符来描述该值。

您可以查找该文本:

>>> u'doesn\u2019t' in mystr
True

或者你可以使用像 unidecode 这样的库用 ASCII 'lookalikes' 替换非 ASCII 代码点;它将用纯 ASCII 引号替换花哨的引号:

>>> from unidecode import unidecode
>>> unidecode(mystr)
"It doesn't look lake a controversial new case management system is going anywhere. So the city plans to spend the next few months helping local social assistance workers learn to live with it."
>>> "doesn't" in unidecode(mystr)
True

关于python 将 unicode 转换为 "print"形式,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28053850/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com