gpt4 book ai didi

python - 从网络中提取数据时如何去除特殊字符?

转载 作者:太空宇宙 更新时间:2023-11-04 10:32:34 24 4
gpt4 key购买 nike

我正在从网站提取数据,它有一个包含特殊字符的条目,即 Comfort Inn And Suites�?炽热的树桩。当我尝试提取它时,它会抛出一个错误:

    Traceback (most recent call last):
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 824, in runUntilCurrent
call.func(*call.args, **call.kw)
File "C:\Python27\lib\site-packages\twisted\internet\task.py", line 638, in _tick
taskObj._oneWorkUnit()
File "C:\Python27\lib\site-packages\twisted\internet\task.py", line 484, in _oneWorkUnit
result = next(self._iterator)
File "C:\Python27\lib\site-packages\scrapy\utils\defer.py", line 57, in <genexpr>
work = (callable(elem, *args, **named) for elem in iterable)
--- <exception caught here> ---
File "C:\Python27\lib\site-packages\scrapy\utils\defer.py", line 96, in iter_errback
yield it.next()
File "C:\Python27\lib\site-packages\scrapy\contrib\spidermiddleware\offsite.py", line 24, in process_spider_output
for x in result:
File "C:\Python27\lib\site-packages\scrapy\contrib\spidermiddleware\referer.py", line 14, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Python27\lib\site-packages\scrapy\contrib\spidermiddleware\urllength.py", line 32, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Python27\lib\site-packages\scrapy\contrib\spidermiddleware\depth.py", line 48, in <genexpr>
return (r for r in result or () if _filter(r))
File "E:\Scrapy projects\emedia\emedia\spiders\test_spider.py", line 46, in parse
print repr(business.select('a[@class="name"]/text()').extract()[0])
File "C:\Python27\lib\site-packages\scrapy\selector\lxmlsel.py", line 51, in select
result = self.xpathev(xpath)
File "xpath.pxi", line 318, in lxml.etree.XPathElementEvaluator.__call__ (src\lxml\lxml.etree.c:145954)

File "xpath.pxi", line 241, in lxml.etree._XPathEvaluatorBase._handle_result (src\lxml\lxml.etree.c:144987)

File "extensions.pxi", line 621, in lxml.etree._unwrapXPathObject (src\lxml\lxml.etree.c:139973)

File "extensions.pxi", line 655, in lxml.etree._createNodeSetResult (src\lxml\lxml.etree.c:140328)

File "extensions.pxi", line 676, in lxml.etree._unpackNodeSetEntry (src\lxml\lxml.etree.c:140524)

File "extensions.pxi", line 784, in lxml.etree._buildElementStringResult (src\lxml\lxml.etree.c:141695)

File "apihelpers.pxi", line 1373, in lxml.etree.funicode (src\lxml\lxml.etree.c:26255)

exceptions.UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 22: invalid continuation byte

在网上搜索后,我尝试了很多不同的方法,例如 decode('utf-8'), unicodedata.normalize('NFC',business.select(' a[@class="name"]/text()').extract()[0]) 但问题仍然存在?

源 URL 是“http://www.truelocal.com.au/find/hotels/97/”,在这个页面上它是我正在谈论的第四个条目。

最佳答案

你有一个坏的Mojibake在原始网页中,可能是由于在某处的数据输入中对 Unicode 的处理不当。当以十六进制表示时,源中的实际 UTF-8 字节是 C3 3F C2 A0

认为它曾经是一个U+00A0 NO-BREAK SPACE .编码为 UTF-8 成为 C2 A0,将 that 解释为 Latin-1,然后再次编码为 UTF-8 成为 C3 82 C2 A0 , 但 82 是一个控制字符,如果再次解释为 Latin-1,那么它会被 ? 问号替换,编码时为十六进制 3F

当您点击指向 detail page for that venue 的链接时然后你会得到一个不同的同名 Mojibake:Comfort Inn And Suites Blazing Stump,为我们提供 Unicode 字符 U+00C3、U+201A、U+00C2 a   HTML 实体,或者又是 unicode 字符 U+00A0。将其编码为 Windows Codepage 1252 (Latin-1 的超集)你又得到了 C3 82 C2 A0

您只能通过直接在页面的源代码中定位它来摆脱它

pagesource.replace('\xc3?\xc2\xa0', '\xc2\xa0')

这通过用原始预期的 UTF-8 字节替换火车残骸来“修复”数据。

如果你有一个 scrapy Response 对象,替换主体:

body = response.body.replace('\xc3?\xc2\xa0', '\xc2\xa0')
response = response.replace(body=body)

关于python - 从网络中提取数据时如何去除特殊字符?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25502318/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com