gpt4 book ai didi

python - Scrapy:多个 "start_urls"产生重复的结果

转载 作者:行者123 更新时间:2023-12-01 08:44:54 25 4
gpt4 key购买 nike

虽然我的简单代码根据 the official document 看起来不错,它会生成意外重复的结果,例如:

  • 设置 3 个网址时为 9 行/结果
  • 设置 2 个网址时为 4 行/结果

当我只设置 1 个 URL 时,我的代码工作正常。另外,我尝试过the answer solution in this SO question ,但这并没有解决我的问题。

[Scrapy 命令]

$ scrapy crawl test -o test.csv

[Scrapy 蜘蛛:test.py]

import scrapy
from ..items import TestItem

class TestSpider(scrapy.Spider):
name = 'test'
start_urls = [
'file:///Users/Name/Desktop/tutorial/test1.html',
'file:///Users/Name/Desktop/tutorial/test2.html',
'file:///Users/Name/Desktop/tutorial/test3.html',
]

def parse(self, response):
for url in self.start_urls:
table_rows = response.xpath('//table/tbody/tr')

for table_row in table_rows:
item = TestItem()
item['test_01'] = table_row.xpath('td[1]/text()').extract_first()
item['test_02'] = table_row.xpath('td[2]/text()').extract_first()

yield item

[目标 HTML:test1.html、test2.html、test3.html]

<html>
<head>
<title>test2</title> <!-- Same as the file name -->
</head>
<body>
<table>
<tbody>
<tr>
<td>test2 A1</td> <!-- Same as the file name -->
<td>test2 B1</td> <!-- Same as the file name -->
</tr>
</tbody>
</table>
</body>
</html>

[为 3 个网址生成 CSV 结果]

test_01,test_02
test1 A1,test1 B1
test1 A1,test1 B1
test1 A1,test1 B1
test2 A1,test2 B1
test2 A1,test2 B1
test2 A1,test2 B1
test3 A1,test3 B1
test3 A1,test3 B1
test3 A1,test3 B1

[3 个网址的预期结果]

test_01,test_02
test1 A1,test1 B1
test2 A1,test2 B1
test3 A1,test3 B1

[为 2 个网址生成 CSV 结果]

test_01,test_02
test1 A1,test1 B1
test1 A1,test1 B1
test2 A1,test2 B1
test2 A1,test2 B1

[2 个网址的预期结果]

test_01,test_02
test1 A1,test1 B1
test2 A1,test2 B1

最佳答案

您正在再次迭代 start_urls,您不需要这样做,scrapy 已经为您完成了,所以现在您在 start_urls 上循环两次。

尝试一下:

import scrapy
from ..items import TestItem

class TestSpider(scrapy.Spider):
name = 'test'
start_urls = [
'file:///Users/Name/Desktop/tutorial/test1.html',
'file:///Users/Name/Desktop/tutorial/test2.html',
'file:///Users/Name/Desktop/tutorial/test3.html',
]

def parse(self, response):
table_rows = response.xpath('//table/tbody/tr')

for table_row in table_rows:
item = TestItem()
item['test_01'] = table_row.xpath('td[1]/text()').extract_first()
item['test_02'] = table_row.xpath('td[2]/text()').extract_first()

yield item

关于python - Scrapy:多个 "start_urls"产生重复的结果,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53348203/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com