gpt4 book ai didi

python - Scrapy将奇怪的符号导出到csv文件中

转载 作者:太空宇宙 更新时间:2023-11-03 16:49:38 25 4
gpt4 key购买 nike

好的,这就是问题所在。我是一个刚刚开始钻研scrapy/python的初学者。

我使用下面的代码来抓取网站并将结果保存到 csv 中。当我查看命令提示符时,它会将 Officiële 等单词变成 Offici\xele。在 csv 文件中,它将其更改为 officiàle。我认为这是因为它以 unicode 而不是 UTF-8 保存?然而,我不知道如何更改我的代码,到目前为止我整个早上都在尝试。

有人可以帮我吗?我特别关注确保 item["publicatietype"] 正常工作。我如何对其进行编码/解码?我需要写什么?我尝试使用replace('?', 'ë'),但这给了我一个错误(非ASCCI字符,但没有声明编码)。

class pagespider(Spider):
name = "OBSpider"
#max_page is put here to prevent endless loops; make it as large as you need. It will try and go up to that page
#even if there's nothing there. A number too high will just take way too much time and yield no results
max_pages = 1

def start_requests(self):
for i in range(self.max_pages):
yield scrapy.Request("https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=%d&sorttype=1&sortorder=4" % (i+1), callback = self.parse)


def parse(self, response):
for sel in response.xpath('//div[@class = "lijst"]/ul/li'):
item = ThingsToGather()
item["titel"] = ' '.join(sel.xpath('a/text()').extract())
deeplink = ''.join(["https://zoek.officielebekendmakingen.nl/", ' '.join(sel.xpath('a/@href').extract())])
request = scrapy.Request(deeplink, callback=self.get_page_info)
request.meta['item'] = item
yield request

def get_page_info(self, response):
for sel in response.xpath('//*[@id="Inhoud"]'):
item = response.meta['item']

#it loads some general info from the header. If this string is less than 5 characters, the site probably is a faulthy link (i.e. an error 404). If this is the case, then it drops the item. Else it continues

if len(' '.join(sel.xpath('//div[contains(@class, "logo-nummer")]/div[contains(@class, "nummer")]/text()').extract())) < 5:
raise DropItem()
else:
item["filename"] = ' '.join(sel.xpath('//*[@id="downloadPdfHyperLink"]/@href').extract())
item['publicatiedatum'] = sel.xpath('//span[contains(@property, "http://purl.org/dc/terms/available")]/text()').extract()
item["publicatietype"] = sel.xpath('//span[contains(@property, "http://purl.org/dc/terms/type")]/text()').extract()
item["filename"] = ' '.join(sel.xpath('//*[@id="downloadPdfHyperLink"]/@href').extract())
item = self.__normalise_item(item, response.url)

#if the string is less than 5, then the required data is not on the page. It then needs to be
#retrieved from the technical information link. If it's the proper link (the else clause), you're done and it proceeds to 'else'
if len(item['publicatiedatum']) < 5:
tech_inf_link = ''.join(["https://zoek.officielebekendmakingen.nl/", ' '.join(sel.xpath('//*[@id="technischeInfoHyperlink"]/@href').extract())])
request = scrapy.Request(tech_inf_link, callback=self.get_date_info)
request.meta['item'] = item
yield request
else:
yield item

def get_date_info (self, response):
for sel in response.xpath('//*[@id="Inhoud"]'):
item = response.meta['item']
item["filename"] = sel.xpath('//span[contains(@property, "http://standaarden.overheid.nl/oep/meta/publicationName")]/text()').extract()
item['publicatiedatum'] = sel.xpath('//span[contains(@property, "http://purl.org/dc/terms/available")]/text()').extract()
item['publicatietype'] = sel.xpath('//span[contains(@property, "http://purl.org/dc/terms/type")]/text()').extract()
item["filename"] = ' '.join(sel.xpath('//*[@id="downloadPdfHyperLink"]/@href').extract())
item = self.__normalise_item(item, response.url)
return item

# commands below are intended to clean up strings. Everything is sent to __normalise_item to clean unwanted characters (strip) and double spaces (split)

def __normalise_item(self, item, base_url):
for key, value in vars(item).values()[0].iteritems():
item[key] = self.__normalise(item[key])

item ['titel']= item['titel'].replace(';', '& ')
return item

def __normalise(self, value):
value = value if type(value) is not list else ' '.join(value)
value = value.strip()
value = " ".join(value.split())
return value

答案:

请参阅下面 paul trmbrth 的评论。问题不是scrapy,而是excel。

对于任何遇到这个问题的人。 tldr 是:导入 Excel 中的数据(在功能区的数据菜单中)并将 Windows (ANSI) 或其他任何内容切换为 Unicode (UTF-8)。

最佳答案

Officiële 在 Python 2 中将表示为 u'Offici\xeble',如下面的 python shell session 示例所示(无需担心 >\xXX 字符,这就是 Python 表示非 ASCII Unicode 字符的方式)

$ python
Python 2.7.9 (default, Apr 2 2015, 15:33:21)
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> u'Officiële'
u'Offici\xeble'
>>> u'Offici\u00EBle'
u'Offici\xeble'
>>>

I think this is because it's saving in unicode instead of UTF-8

UTF-8 是一种编码,Unicode 不是。

ë,又名 U+00EB,又名 带有分音符号的拉丁文小写字母 E,将采用 UTF-8 编码为 2 个字节,\xc3\xab

>>> u'Officiële'.encode('UTF-8')
'Offici\xc3\xable'
>>>

In the csv file, it changes it to officiële.

如果您看到此内容,则可能需要在程序中打开 CSV 文件时将输入编码设置为 UTF-8。

Scrapy CSV 导出器会将 Python Unicode 字符串作为 UTF-8 编码字符串写入输出文件中。

Scrapy 选择器将输出 Unicode 字符串:

$ scrapy shell "https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4"
2016-03-15 10:44:51 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
(...)
2016-03-15 10:44:52 [scrapy] DEBUG: Crawled (200) <GET https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4> (referer: None)
(...)
In [1]: response.css('div.menu-bmslink > ul > li > a::text').extract()
Out[1]:
[u'Offici\xeble bekendmakingen vandaag',
u'Uitleg nieuwe nummering Handelingen vanaf 1 januari 2011',
u'Uitleg nieuwe\r\n nummering Staatscourant vanaf 1 juli 2009']

In [2]: for t in response.css('div.menu-bmslink > ul > li > a::text').extract():
print t
...:
Officiële bekendmakingen vandaag
Uitleg nieuwe nummering Handelingen vanaf 1 januari 2011
Uitleg nieuwe
nummering Staatscourant vanaf 1 juli 2009

让我们看看蜘蛛提取项目中的这些字符串会得到什么 CSV:

$ cat testspider.py
import scrapy


class TestSpider(scrapy.Spider):
name = 'testspider'
start_urls = ['https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4']

def parse(self, response):
for t in response.css('div.menu-bmslink > ul > li > a::text').extract():
yield {"link": t}

运行蜘蛛并请求 CSV 输出:

$ scrapy runspider testspider.py -o test.csv
2016-03-15 11:00:13 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
2016-03-15 11:00:13 [scrapy] INFO: Optional features available: ssl, http11
2016-03-15 11:00:13 [scrapy] INFO: Overridden settings: {'FEED_FORMAT': 'csv', 'FEED_URI': 'test.csv'}
2016-03-15 11:00:14 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2016-03-15 11:00:14 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-03-15 11:00:14 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-03-15 11:00:14 [scrapy] INFO: Enabled item pipelines:
2016-03-15 11:00:14 [scrapy] INFO: Spider opened
2016-03-15 11:00:14 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-15 11:00:14 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-15 11:00:14 [scrapy] DEBUG: Crawled (200) <GET https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4> (referer: None)
2016-03-15 11:00:14 [scrapy] DEBUG: Scraped from <200 https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4>
{'link': u'Offici\xeble bekendmakingen vandaag'}
2016-03-15 11:00:14 [scrapy] DEBUG: Scraped from <200 https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4>
{'link': u'Uitleg nieuwe nummering Handelingen vanaf 1 januari 2011'}
2016-03-15 11:00:14 [scrapy] DEBUG: Scraped from <200 https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4>
{'link': u'Uitleg nieuwe\r\n nummering Staatscourant vanaf 1 juli 2009'}
2016-03-15 11:00:14 [scrapy] INFO: Closing spider (finished)
2016-03-15 11:00:14 [scrapy] INFO: Stored csv feed (3 items) in: test.csv
2016-03-15 11:00:14 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 488,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 12018,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 3, 15, 10, 0, 14, 991735),
'item_scraped_count': 3,
'log_count/DEBUG': 5,
'log_count/INFO': 8,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 3, 15, 10, 0, 14, 59471)}
2016-03-15 11:00:14 [scrapy] INFO: Spider closed (finished)

检查 CSV 文件的内容:

$ cat test.csv
link
Officiële bekendmakingen vandaag
Uitleg nieuwe nummering Handelingen vanaf 1 januari 2011
"Uitleg nieuwe
nummering Staatscourant vanaf 1 juli 2009"
$ hexdump -C test.csv
00000000 6c 69 6e 6b 0d 0a 4f 66 66 69 63 69 c3 ab 6c 65 |link..Offici..le|
00000010 20 62 65 6b 65 6e 64 6d 61 6b 69 6e 67 65 6e 20 | bekendmakingen |
00000020 76 61 6e 64 61 61 67 0d 0a 55 69 74 6c 65 67 20 |vandaag..Uitleg |
00000030 6e 69 65 75 77 65 20 6e 75 6d 6d 65 72 69 6e 67 |nieuwe nummering|
00000040 20 48 61 6e 64 65 6c 69 6e 67 65 6e 20 76 61 6e | Handelingen van|
00000050 61 66 20 31 20 6a 61 6e 75 61 72 69 20 32 30 31 |af 1 januari 201|
00000060 31 0d 0a 22 55 69 74 6c 65 67 20 6e 69 65 75 77 |1.."Uitleg nieuw|
00000070 65 0d 0a 20 20 20 20 20 20 20 20 20 20 20 20 6e |e.. n|
00000080 75 6d 6d 65 72 69 6e 67 20 53 74 61 61 74 73 63 |ummering Staatsc|
00000090 6f 75 72 61 6e 74 20 76 61 6e 61 66 20 31 20 6a |ourant vanaf 1 j|
000000a0 75 6c 69 20 32 30 30 39 22 0d 0a |uli 2009"..|
000000ab

您可以验证 ë 是否正确编码为 c3 ab

例如,在使用 LibreOffice 时,我可以正确查看文件数据(注意“字符集:Unicode UTF-8”):

Opening test.csv in LibreOffice

您可能正在使用 Latin-1。以下是使用 Latin-1 而不是 UTF-8 作为输入编码设置时得到的结果(再次在 LibreOffice 中):

enter image description here

关于python - Scrapy将奇怪的符号导出到csv文件中,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35989705/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com