gpt4 book ai didi

python - Scrapy跨多个表单请求页面存储项目?元? Python

转载 作者:行者123 更新时间:2023-11-30 23:29:03 28 4
gpt4 key购买 nike

所以我让我的抓取工具可以处理一个表单请求。我什至可以看到终端打印出单页版本中的抓取数据:

class MySpider(BaseSpider):
name = "swim"
start_urls = ["example.website"]
DOWNLAD_DELAY= 30.0

def parse(self, response):
return [FormRequest.from_response(response,formname="TTForm",
formdata={"Ctype":"A", "Req_Team": "", "AgeGrp": "0-6",
"lowage": "", "highage": "", "sex": "W", "StrkDist": "10025",
"How_Many": "50", "foolOldPerl": ""}
,callback=self.swimparse1,dont_click=True)]

def swimparse1(self, response):
open_in_browser(response)
hxs = Selector(response)
rows = hxs.xpath(".//tr")
items = []

for rows in rows[4:54]:
item = swimItem()
item["names"] = rows.xpath(".//td[2]/text()").extract()
item["age"] = rows.xpath(".//td[3]/text()").extract()
item["free"] = rows.xpath(".//td[4]/text()").extract()
item["team"] = rows.xpath(".//td[6]/text()").extract()
items.append(item)

return items

但是,当我添加第二个表单请求回调时,它只会抓取第二个表单中的项目。它还只打印第二页的抓取,就好像它完全跳过第一页抓取一样? :

class MySpider(BaseSpider):
name = "swim"
start_urls = ["example.website"]
DOWNLAD_DELAY= 30.0

def parse(self, response):
return [FormRequest.from_response(response,formname="TTForm",
formdata={"Ctype":"A", "Req_Team": "", "AgeGrp": "0-6",
"lowage": "", "highage": "", "sex": "W", "StrkDist": "10025",
"How_Many": "50", "foolOldPerl": ""}
,callback=self.swimparse1,dont_click=True)]

def swimparse1(self, response):
open_in_browser(response)
hxs = Selector(response)
rows = hxs.xpath(".//tr")
items = []

for rows in rows[4:54]:
item = swimItem()
item["names"] = rows.xpath(".//td[2]/text()").extract()
item["age"] = rows.xpath(".//td[3]/text()").extract()
item["free"] = rows.xpath(".//td[4]/text()").extract()
item["team"] = rows.xpath(".//td[6]/text()").extract()
items.append(item)
#print item[]
return [FormRequest.from_response(response,formname="TTForm",
formdata={"Ctype":"A", "Req_Team": "", "AgeGrp": "0-6",
"lowage": "", "highage": "", "sex": "W", "StrkDist": "40025",
"How_Many": "50", "foolOldPerl": ""}
,callback=self.Swimparse2,dont_click=True),]

def swimparse2(self, response):
open_in_browser(response)
hxs = Selector(response)
rows = hxs.xpath(".//tr")
items = []

for rows in rows[4:54]:
item = swimItem()
item["names"] = rows.xpath(".//td[2]/text()").extract()
item["age"] = rows.xpath(".//td[3]/text()").extract()
item["fly"] = rows.xpath(".//td[4]/text()").extract()
item["team"] = rows.xpath(".//td[6]/text()").extract()
items.append(item)
#print item[]
return items

猜测:A)如何将第一次抓取中的项目导出或返回到第二次抓取中,以便最终将所有项目数据放在一起,就像从一页抓取一样?

B)或者如果第一个抓取被完全跳过,我怎样才能停止跳过并将这些项目传递到下一个?

谢谢!

PS:另外:我尝试过使用:

item = response.request.meta = ["item]
item = response.request.meta = []
item = response.request.meta = ["names":item, "age":item, "free":item, "team":item]

所有这些都会引发关键错误或其他异常

我还尝试修改表单请求以包含 meta={"names":item, "age":item, "free":item, "team":item}。不会引发错误,但不会抓取或存储任何内容。

编辑:我尝试使用这样的产量:

class MySpider(BaseSpider):
name = "swim"
start_urls = ["www.website.com"]
DOWNLAD_DELAY= 30.0

def parse(self, response):
open_in_browser(response)
hxs = Selector(response)
rows = hxs.xpath(".//tr")
items = []

for rows in rows[4:54]:
item = swimItem()
item["names"] = rows.xpath(".//td[2]/text()").extract()
item["age"] = rows.xpath(".//td[3]/text()").extract()
item["free"] = rows.xpath(".//td[4]/text()").extract()
item["team"] = rows.xpath(".//td[6]/text()").extract()
items.append(item)
yield [FormRequest.from_response(response,formname="TTForm",
formdata={"Ctype":"A", "Req_Team": "", "AgeGrp": "0-6",
"lowage": "", "highage": "", "sex": "W", "StrkDist": "10025",
"How_Many": "50", "foolOldPerl": ""}
,callback=self.parse,dont_click=True)]

for rows in rows[4:54]:
item = swimItem()
item["names"] = rows.xpath(".//td[2]/text()").extract()
item["age"] = rows.xpath(".//td[3]/text()").extract()
item["fly"] = rows.xpath(".//td[4]/text()").extract()
item["team"] = rows.xpath(".//td[6]/text()").extract()
items.append(item)

yield [FormRequest.from_response(response,formname="TTForm",
formdata={"Ctype":"A", "Req_Team": "", "AgeGrp": "0-6",
"lowage": "", "highage": "", "sex": "W", "StrkDist": "40025",
"How_Many": "50", "foolOldPerl": ""}
,callback=self.parse,dont_click=True)]

仍然没有抓取任何东西。我知道 xpath 是正确的,因为当我只尝试并抓取一种形式(使用返回而不是 yield )时,它工作得很好。我已经阅读了杂乱的文档,但它并不是很有帮助:(

最佳答案

您缺少一个非常简单的解决方案,请将 return 更改为 yield

然后,您不必在数组中累积项目,只需从函数中生成所需数量的项目和请求,scrapy 将完成剩下的工作

来自scrapy docs :

from scrapy.selector import Selector
from scrapy.spider import Spider
from scrapy.http import Request
from myproject.items import MyItem

class MySpider(Spider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = [
'http://www.example.com/1.html',
'http://www.example.com/2.html',
'http://www.example.com/3.html',
]

def parse(self, response):
sel = Selector(response)
for h3 in sel.xpath('//h3').extract():
yield MyItem(title=h3)

for url in sel.xpath('//a/@href').extract():
yield Request(url, callback=self.parse)

关于python - Scrapy跨多个表单请求页面存储项目?元? Python,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21323123/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com