python - Scrapy将奇怪的符号导出到csv文件中-6ren

python - Scrapy将奇怪的符号导出到csv文件中

转载作者：太空宇宙更新时间：2023-11-03 16:49:38

好的，这就是问题所在。我是一个刚刚开始钻研scrapy/python的初学者。

我使用下面的代码来抓取网站并将结果保存到 csv 中。当我查看命令提示符时，它会将 Officiële 等单词变成 Offici\xele。在 csv 文件中，它将其更改为 officiàle。我认为这是因为它以 unicode 而不是 UTF-8 保存？然而，我不知道如何更改我的代码，到目前为止我整个早上都在尝试。

有人可以帮我吗？我特别关注确保 item["publicatietype"] 正常工作。我如何对其进行编码/解码？我需要写什么？我尝试使用replace('?', 'ë')，但这给了我一个错误(非ASCCI字符，但没有声明编码)。

class pagespider(Spider):
    name = "OBSpider"
    #max_page is put here to prevent endless loops; make it as large as you need. It will try and go up to that page
    #even if there's nothing there. A number too high will just take way too much time and yield no results
    max_pages = 1

    def start_requests(self):
        for i in range(self.max_pages):
            yield scrapy.Request("https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=%d&sorttype=1&sortorder=4" % (i+1), callback = self.parse)


    def parse(self, response):
        for sel in response.xpath('//div[@class = "lijst"]/ul/li'):
            item = ThingsToGather()
            item["titel"] = ' '.join(sel.xpath('a/text()').extract())
            deeplink = ''.join(["https://zoek.officielebekendmakingen.nl/", ' '.join(sel.xpath('a/@href').extract())])
            request = scrapy.Request(deeplink, callback=self.get_page_info)
            request.meta['item'] = item
            yield request

    def get_page_info(self, response):
        for sel in response.xpath('//*[@id="Inhoud"]'):
            item = response.meta['item']

    #it loads some general info from the header. If this string is less than 5 characters, the site probably is a faulthy link (i.e. an error 404). If this is the case, then it drops the item. Else it continues

            if len(' '.join(sel.xpath('//div[contains(@class, "logo-nummer")]/div[contains(@class, "nummer")]/text()').extract())) < 5:
                raise DropItem()
            else:
                item["filename"] = ' '.join(sel.xpath('//*[@id="downloadPdfHyperLink"]/@href').extract())
                item['publicatiedatum'] = sel.xpath('//span[contains(@property, "http://purl.org/dc/terms/available")]/text()').extract()
                item["publicatietype"] = sel.xpath('//span[contains(@property, "http://purl.org/dc/terms/type")]/text()').extract()
                item["filename"] = ' '.join(sel.xpath('//*[@id="downloadPdfHyperLink"]/@href').extract())
                item = self.__normalise_item(item, response.url)

    #if the string is less than 5, then the required data is not on the page. It then needs to be
    #retrieved from the technical information link. If it's the proper link (the else clause), you're done and it proceeds to 'else'
                if len(item['publicatiedatum']) < 5:
                    tech_inf_link = ''.join(["https://zoek.officielebekendmakingen.nl/", ' '.join(sel.xpath('//*[@id="technischeInfoHyperlink"]/@href').extract())])
                    request = scrapy.Request(tech_inf_link, callback=self.get_date_info)
                    request.meta['item'] = item
                    yield request 
                else:
                    yield item

    def get_date_info (self, response):
        for sel in response.xpath('//*[@id="Inhoud"]'):
            item = response.meta['item']
            item["filename"] = sel.xpath('//span[contains(@property, "http://standaarden.overheid.nl/oep/meta/publicationName")]/text()').extract()
            item['publicatiedatum'] = sel.xpath('//span[contains(@property, "http://purl.org/dc/terms/available")]/text()').extract()
            item['publicatietype'] = sel.xpath('//span[contains(@property, "http://purl.org/dc/terms/type")]/text()').extract()
            item["filename"] = ' '.join(sel.xpath('//*[@id="downloadPdfHyperLink"]/@href').extract())
            item = self.__normalise_item(item, response.url)    
            return item

    # commands below are intended to clean up strings. Everything is sent to __normalise_item to clean unwanted characters (strip) and double spaces (split)

    def __normalise_item(self, item, base_url):
        for key, value in vars(item).values()[0].iteritems():
            item[key] = self.__normalise(item[key])

        item ['titel']= item['titel'].replace(';', '& ')
        return item

    def __normalise(self, value):
        value = value if type(value) is not list else ' '.join(value)
        value = value.strip()
        value = " ".join(value.split())
        return value

答案:

请参阅下面 paul trmbrth 的评论。问题不是scrapy，而是excel。

对于任何遇到这个问题的人。 tldr 是:导入 Excel 中的数据(在功能区的数据菜单中)并将 Windows (ANSI) 或其他任何内容切换为 Unicode (UTF-8)。

最佳答案

Officiële 在 Python 2 中将表示为 u'Offici\xeble'，如下面的 python shell session 示例所示(无需担心 >\xXX 字符，这就是 Python 表示非 ASCII Unicode 字符的方式)

$ python
Python 2.7.9 (default, Apr  2 2015, 15:33:21) 
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> u'Officiële'
u'Offici\xeble'
>>> u'Offici\u00EBle'
u'Offici\xeble'
>>>

I think this is because it's saving in unicode instead of UTF-8

UTF-8 是一种编码，Unicode 不是。

ë，又名 U+00EB，又名 带有分音符号的拉丁文小写字母 E，将采用 UTF-8 编码为 2 个字节，\xc3和\xab

>>> u'Officiële'.encode('UTF-8')
'Offici\xc3\xable'
>>>

In the csv file, it changes it to officiÃ«le.

如果您看到此内容，则可能需要在程序中打开 CSV 文件时将输入编码设置为 UTF-8。

Scrapy CSV 导出器会将 Python Unicode 字符串作为 UTF-8 编码字符串写入输出文件中。

Scrapy 选择器将输出 Unicode 字符串:

$ scrapy shell "https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4"
2016-03-15 10:44:51 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
(...)
2016-03-15 10:44:52 [scrapy] DEBUG: Crawled (200) <GET https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4> (referer: None)
(...)
In [1]: response.css('div.menu-bmslink > ul > li > a::text').extract()
Out[1]: 
[u'Offici\xeble bekendmakingen vandaag',
 u'Uitleg nieuwe nummering Handelingen vanaf 1 januari 2011',
 u'Uitleg nieuwe\r\n            nummering Staatscourant vanaf 1 juli 2009']

In [2]: for t in response.css('div.menu-bmslink > ul > li > a::text').extract():
    print t
   ...:     
Officiële bekendmakingen vandaag
Uitleg nieuwe nummering Handelingen vanaf 1 januari 2011
Uitleg nieuwe
            nummering Staatscourant vanaf 1 juli 2009

让我们看看蜘蛛提取项目中的这些字符串会得到什么 CSV:

$ cat testspider.py
import scrapy


class TestSpider(scrapy.Spider):
    name = 'testspider'
    start_urls = ['https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4']

    def parse(self, response):
        for t in response.css('div.menu-bmslink > ul > li > a::text').extract():
            yield {"link": t}

运行蜘蛛并请求 CSV 输出:

$ scrapy runspider testspider.py -o test.csv
2016-03-15 11:00:13 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
2016-03-15 11:00:13 [scrapy] INFO: Optional features available: ssl, http11
2016-03-15 11:00:13 [scrapy] INFO: Overridden settings: {'FEED_FORMAT': 'csv', 'FEED_URI': 'test.csv'}
2016-03-15 11:00:14 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2016-03-15 11:00:14 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-03-15 11:00:14 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-03-15 11:00:14 [scrapy] INFO: Enabled item pipelines: 
2016-03-15 11:00:14 [scrapy] INFO: Spider opened
2016-03-15 11:00:14 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-15 11:00:14 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-15 11:00:14 [scrapy] DEBUG: Crawled (200) <GET https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4> (referer: None)
2016-03-15 11:00:14 [scrapy] DEBUG: Scraped from <200 https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4>
{'link': u'Offici\xeble bekendmakingen vandaag'}
2016-03-15 11:00:14 [scrapy] DEBUG: Scraped from <200 https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4>
{'link': u'Uitleg nieuwe nummering Handelingen vanaf 1 januari 2011'}
2016-03-15 11:00:14 [scrapy] DEBUG: Scraped from <200 https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4>
{'link': u'Uitleg nieuwe\r\n            nummering Staatscourant vanaf 1 juli 2009'}
2016-03-15 11:00:14 [scrapy] INFO: Closing spider (finished)
2016-03-15 11:00:14 [scrapy] INFO: Stored csv feed (3 items) in: test.csv
2016-03-15 11:00:14 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 488,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 12018,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 3, 15, 10, 0, 14, 991735),
 'item_scraped_count': 3,
 'log_count/DEBUG': 5,
 'log_count/INFO': 8,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2016, 3, 15, 10, 0, 14, 59471)}
2016-03-15 11:00:14 [scrapy] INFO: Spider closed (finished)

检查 CSV 文件的内容:

$ cat test.csv
link
Officiële bekendmakingen vandaag
Uitleg nieuwe nummering Handelingen vanaf 1 januari 2011
"Uitleg nieuwe
            nummering Staatscourant vanaf 1 juli 2009"
$ hexdump -C test.csv 
00000000  6c 69 6e 6b 0d 0a 4f 66  66 69 63 69 c3 ab 6c 65  |link..Offici..le|
00000010  20 62 65 6b 65 6e 64 6d  61 6b 69 6e 67 65 6e 20  | bekendmakingen |
00000020  76 61 6e 64 61 61 67 0d  0a 55 69 74 6c 65 67 20  |vandaag..Uitleg |
00000030  6e 69 65 75 77 65 20 6e  75 6d 6d 65 72 69 6e 67  |nieuwe nummering|
00000040  20 48 61 6e 64 65 6c 69  6e 67 65 6e 20 76 61 6e  | Handelingen van|
00000050  61 66 20 31 20 6a 61 6e  75 61 72 69 20 32 30 31  |af 1 januari 201|
00000060  31 0d 0a 22 55 69 74 6c  65 67 20 6e 69 65 75 77  |1.."Uitleg nieuw|
00000070  65 0d 0a 20 20 20 20 20  20 20 20 20 20 20 20 6e  |e..            n|
00000080  75 6d 6d 65 72 69 6e 67  20 53 74 61 61 74 73 63  |ummering Staatsc|
00000090  6f 75 72 61 6e 74 20 76  61 6e 61 66 20 31 20 6a  |ourant vanaf 1 j|
000000a0  75 6c 69 20 32 30 30 39  22 0d 0a                 |uli 2009"..|
000000ab

您可以验证 ë 是否正确编码为 c3 ab

例如，在使用 LibreOffice 时，我可以正确查看文件数据(注意“字符集:Unicode UTF-8”):

您可能正在使用 Latin-1。以下是使用 Latin-1 而不是 UTF-8 作为输入编码设置时得到的结果(再次在 LibreOffice 中):

关于python - Scrapy将奇怪的符号导出到csv文件中，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/35989705/

文章推荐：反向代理的 Apache 基本身份验证问题

文章推荐： eclipse - 如何启用 eclipse 内部浏览器 (Ubuntu 12.10)？

文章推荐： ubuntu - 为 Raspberry Pi 交叉编译第三方库的非繁琐方法

文章推荐： ubuntu - 加载 guest 添加时出现 Virtualbox 错误

csv - 如何将 csv 文件读入数组并与另一个 csv 文件中的条目进行比较和替换？
我有两个 csv 文件 file1.csv 和 file2.csv。 file1.csv 包含 4 列。文件1: Header1,Header2,Header3,Header4 aaaaaaa,bb
csv - 如何测试 CSV 文件以及如何找出两个 CSV 文件之间的差异
我想知道是否有任何方法可以在导入数据库之前测试 CSV 文件？我有一个包含多列的巨大 CSV 文件，每列都有不同的数据类型和大小。如何测试生成的 CSV 文件中出现的数据是否与每列的大小一致？还有
csv - CSV 中的查找值和第二列的返回值
我正在从 SCOM 中提取服务器列表，并希望根据包含以下数据的 CSV 检查此列表: Computername,Collection Name Server01,NA - All DA Servers
csv - 如何使用 Super CSV 部分读取 CSV 文件
我有一个包含 24 列的 csv 文件。其中我只想阅读 3 列。我看到 super CSV 是一个非常强大的库，但我不知道如何部分读取 CSV。 partial reading上的链接坏了。请帮我举
csv - 如何将加特林日志文件导出到 csv
我正在尝试将加特林日志文件导出到 CSV，因为我需要更新 google 电子表格中的所有全局值，因为我的经理需要电子表格中的值。最佳答案此 CSV 文件被删除并替换为 JSON 文件，名为 glo
csv - csv 是结构化数据还是半结构化数据？
我对 csv 是结构化数据还是半结构化数据感到困惑。就像 RDBMS 是一个有关系的结构化数据，但 csv 没有关系。我无法找到确切的答案。最佳答案我可以说，具有恒定列和行(二维)的 CSV
csv - 使用 pipes-csv 从 csv 文件中读取第一行
我正在使用 pipes-csv 库读取一个 csv 文件。我想先读第一行，然后再读其余的。不幸的是，在 Pipes.Prelude.head 函数返回之后。管道正在以某种方式关闭。有没有办法先读取 c
csv - CSV 文件中空行的含义
起初这似乎很明显，但现在我不太确定。如果 CSV 文件具有以下行: a, 我会将其解释为具有值“a”和“”的两个字段。但是然后查看一个空行，我可以很容易地争辩说它表示一个值为“”的字段。我接受文件
csv - 将列表字典写入 csv
我正在尝试将列表字典写入 CSV 文件。我希望这些键是 CSV 文件的标题，以及与该键关联的列中的每个键关联的值。如果我的字典是: {'600': [321.4, 123.5, 564.1, 764
csv - 将数组导出为 CSV
关闭。这个问题是off-topic .它目前不接受答案。想改进这个问题吗？ Update the question所以它是on-topic用于堆栈溢出。关闭 10 年前。 Improve thi
csv - CSV 文件可以有注释吗？
是否有任何官方方法允许 CSV 格式的文件允许评论，无论是在其自己的行上还是在行尾？我尝试检查wikipedia关于此以及RFC 4180但两者都没有提到任何让我相信它不是文件格式的一部分的内容，所
csv - 如何从字符串中读取 CSV？
我有一些 csv 格式的数据。然而它们已经是一个字符串，因为我是从 HTTP 请求中获取它们的。我想使用数据框来查看数据。但是我不知道如何解析它，因为 CSV 包只接受文件，而不接受字符串。一种解决
csv - CSV 文件中的值列表应该使用哪个分隔符？
我有一个 CSV 文件，其中包含一些字段的值列表。它们作为 HTML“ul”元素存储在数据库中，但我想将它们转换为对电子表格更友好的东西。我应该使用什么作为分隔符？我可以使用转义的逗号、竖线、分号或
csv - 如何使用管道分隔符导出到 .csv
我使用 Google 表格(电子表格)来合并我的 Gambio 商店的不同来源的文章数据。要导入数据，我需要在 .csv 文件中使用管道符号作为分隔符/分隔符，并使用 "作为文本分隔符。在用于导出为
csv - 报表生成器导出到带有列标题空间的 CSV
这是一个奇怪的请求，因为我们都知道数据库头不应该包含空格。但是，我正在使用的系统需要在其标题中使用空格才能导入。我创建了一个 Report Builder 报告，它将数据构建到一个表中，并在我运行
csv - 当其中一个值可能在字符串中包含逗号时处理 .CSV
我有一个 .csv 文件，我需要将其转换为 coldfusion 查询。我使用了 cflib.org CSVtoQuery 方法，它工作正常......但是...... 如果 csv 中的“单元格”在
csv - 文化独立 CSV
我想知道是否有任何方法可以生成文化中性 CSV 文件，或者至少指定文件中存在的特定列的数据格式。例如，我生成了包含带小数点分隔符 (.) 的数字的 CSV 文件，然后将其传递给小数点分隔符为 (,
Javascript:如果 CSV 中没有，则将值添加到 CSV - 如果已经在 CSV 中 - 从 CSV 中删除
我正在构建一个 CSV 字符串 - 因此用户单击 div 的所有内容 - 5 个字符的字符串都会传递到隐藏字段中 - 我想做的是附加每个新值并创建一个 CSV 字符串 - 完成后- 在文本框中显示 -
Linux CSV - 将 CSV 文件中的列添加到另一个 CSV 文件
我正在努力从另外两个文件创建一个 CSV 文件这是我需要的我想要的文件(很多其他行) “AB”；“A”；“B”；“C”；“D”；“E” 我拥有的文件: 文件 1:"A";"B";"C";"D";"
csv - 将包含带引号的值的表导出到配置单元中的本地 csv
我正在尝试将表导出到配置单元中的本地 csv 文件。 INSERT OVERWRITE LOCAL DIRECTORY '/home/sofia/temp.csv' ROW FORMAT DELIMI

太空宇宙

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - Scrapy将奇怪的符号导出到csv文件中