gpt4 book ai didi

python - 如何避免数组在回调后重置?

转载 作者:太空宇宙 更新时间:2023-11-03 14:32:31 24 4
gpt4 key购买 nike

我想使用 scrapy 从网站上抓取评论数据。代码如下。

问题是,每次程序进入下一页时,它都会从头开始(由于回调)并重置records[]。因此,数组将再次为空,并且 records[] 中保存的每条评论都会丢失。这导致当我打开 csv 文件时,我只能看到最后一页的评论。

我想要的是所有数据都存储在我的 csv 文件中,这样 records[] 就不会在每次请求下一页时不断重置。我不能将行: records = [] 放在解析方法之前,因为数组未定义。

这是我的代码:

def parse(self, response):
records = []

for r in response.xpath('//div[contains(@class, "a-section review")]'):
rtext = r.xpath('.//div[contains(@class, "a-row review-data")]').extract_first()
rating = r.xpath('.//span[contains(@class, "a-icon-alt")]/text()').extract_first()
votes = r.xpath('normalize-space(.//span[contains(@class, "review-votes")]/text())').extract_first()

if not votes:
votes = "none"

records.append((rating, votes, rtext))
print(records)

nextPage = response.xpath('//li[contains(@class, "a-last")]/a/@href').extract_first()
if nextPage:
nextPage = response.urljoin(nextPage)
yield scrapy.Request(url = nextPage)

import pandas as pd
df = pd.DataFrame(records, columns=['rating' , 'votes', 'rtext'])
df.to_csv('ama.csv', sep = '|', index =False, encoding='utf-8')

最佳答案

将记录声明移动到方法调用将使用Python中概述的常见陷阱here in the python docs 。然而,在这种情况下,在方法声明中实例化列表的奇怪行为将对您有利。

Python’s default arguments are evaluated once when the function is defined, not each time the function is called (like it is in say, Ruby). This means that if you use a mutable default argument and mutate it, you will and have mutated that object for all future calls to the function as well.

def parse(self, response, records=[]):


for r in response.xpath('//div[contains(@class, "a-section review")]'):
rtext = r.xpath('.//div[contains(@class, "a-row review-data")]').extract_first()
rating = r.xpath('.//span[contains(@class, "a-icon-alt")]/text()').extract_first()
votes = r.xpath('normalize-space(.//span[contains(@class, "review-votes")]/text())').extract_first()

if not votes:
votes = "none"

records.append((rating, votes, rtext))
print(records)

nextPage = response.xpath('//li[contains(@class, "a-last")]/a/@href').extract_first()
if nextPage:
nextPage = response.urljoin(nextPage)
yield scrapy.Request(url = nextPage)

import pandas as pd
df = pd.DataFrame(records, columns=['rating' , 'votes', 'rtext'])
df.to_csv('ama.csv', sep = '|', index =False, encoding='utf-8')

上面的方法有点奇怪。更通用的解决方案是简单地使用全局变量。 Here is a post going over how to use globals.

关于python - 如何避免数组在回调后重置?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47185209/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com