python - Scrapy跨多个表单请求页面存储项目？元？ Python-6ren

python - Scrapy跨多个表单请求页面存储项目？元？ Python

转载作者：行者123 更新时间：2023-11-30 23:29:03

所以我让我的抓取工具可以处理一个表单请求。我什至可以看到终端打印出单页版本中的抓取数据:

class MySpider(BaseSpider):
    name = "swim"
    start_urls = ["example.website"]
    DOWNLAD_DELAY= 30.0

    def parse(self, response):
        return [FormRequest.from_response(response,formname="TTForm",
                    formdata={"Ctype":"A", "Req_Team": "", "AgeGrp": "0-6", 
                    "lowage": "", "highage": "", "sex": "W", "StrkDist": "10025", 
                    "How_Many": "50", "foolOldPerl": ""}
                    ,callback=self.swimparse1,dont_click=True)]

    def swimparse1(self, response):       
        open_in_browser(response)
        hxs = Selector(response)
        rows = hxs.xpath(".//tr")
        items = []

        for rows in rows[4:54]:
            item = swimItem()
            item["names"] = rows.xpath(".//td[2]/text()").extract()
            item["age"] = rows.xpath(".//td[3]/text()").extract()
            item["free"] = rows.xpath(".//td[4]/text()").extract()
            item["team"] = rows.xpath(".//td[6]/text()").extract()
            items.append(item)

        return items

但是，当我添加第二个表单请求回调时，它只会抓取第二个表单中的项目。它还只打印第二页的抓取，就好像它完全跳过第一页抓取一样？ :

class MySpider(BaseSpider):
    name = "swim"
    start_urls = ["example.website"]
    DOWNLAD_DELAY= 30.0

    def parse(self, response):
        return [FormRequest.from_response(response,formname="TTForm",
                    formdata={"Ctype":"A", "Req_Team": "", "AgeGrp": "0-6", 
                    "lowage": "", "highage": "", "sex": "W", "StrkDist": "10025", 
                    "How_Many": "50", "foolOldPerl": ""}
                    ,callback=self.swimparse1,dont_click=True)]

    def swimparse1(self, response):       
        open_in_browser(response)
        hxs = Selector(response)
        rows = hxs.xpath(".//tr")
        items = []

        for rows in rows[4:54]:
            item = swimItem()
            item["names"] = rows.xpath(".//td[2]/text()").extract()
            item["age"] = rows.xpath(".//td[3]/text()").extract()
            item["free"] = rows.xpath(".//td[4]/text()").extract()
            item["team"] = rows.xpath(".//td[6]/text()").extract()
            items.append(item)
            #print item[]           
        return [FormRequest.from_response(response,formname="TTForm",
                    formdata={"Ctype":"A", "Req_Team": "", "AgeGrp": "0-6", 
                    "lowage": "", "highage": "", "sex": "W", "StrkDist": "40025", 
                    "How_Many": "50", "foolOldPerl": ""}
                    ,callback=self.Swimparse2,dont_click=True),]

    def swimparse2(self, response):
        open_in_browser(response)
        hxs = Selector(response)
        rows = hxs.xpath(".//tr")
        items = []

        for rows in rows[4:54]:
            item = swimItem()
            item["names"] = rows.xpath(".//td[2]/text()").extract()
            item["age"] = rows.xpath(".//td[3]/text()").extract()
            item["fly"] = rows.xpath(".//td[4]/text()").extract()
            item["team"] = rows.xpath(".//td[6]/text()").extract()
            items.append(item)
            #print item[]
        return items

猜测:A)如何将第一次抓取中的项目导出或返回到第二次抓取中，以便最终将所有项目数据放在一起，就像从一页抓取一样？

B)或者如果第一个抓取被完全跳过，我怎样才能停止跳过并将这些项目传递到下一个？

谢谢!

PS:另外:我尝试过使用:

item = response.request.meta = ["item]
item = response.request.meta = []
item = response.request.meta = ["names":item, "age":item, "free":item, "team":item]

所有这些都会引发关键错误或其他异常

我还尝试修改表单请求以包含 meta={"names":item, "age":item, "free":item, "team":item}。不会引发错误，但不会抓取或存储任何内容。

编辑:我尝试使用这样的产量:

class MySpider(BaseSpider):
name = "swim"
start_urls = ["www.website.com"]
DOWNLAD_DELAY= 30.0

def parse(self, response):
    open_in_browser(response)
    hxs = Selector(response)
    rows = hxs.xpath(".//tr")
    items = []

    for rows in rows[4:54]:
        item = swimItem()
        item["names"] = rows.xpath(".//td[2]/text()").extract()
        item["age"] = rows.xpath(".//td[3]/text()").extract()
        item["free"] = rows.xpath(".//td[4]/text()").extract()
        item["team"] = rows.xpath(".//td[6]/text()").extract()
        items.append(item) 
        yield [FormRequest.from_response(response,formname="TTForm",
                formdata={"Ctype":"A", "Req_Team": "", "AgeGrp": "0-6", 
                "lowage": "", "highage": "", "sex": "W", "StrkDist": "10025", 
                "How_Many": "50", "foolOldPerl": ""}
                ,callback=self.parse,dont_click=True)]

    for rows in rows[4:54]:
        item = swimItem()
        item["names"] = rows.xpath(".//td[2]/text()").extract()
        item["age"] = rows.xpath(".//td[3]/text()").extract()
        item["fly"] = rows.xpath(".//td[4]/text()").extract()
        item["team"] = rows.xpath(".//td[6]/text()").extract()
        items.append(item)

        yield [FormRequest.from_response(response,formname="TTForm",
                formdata={"Ctype":"A", "Req_Team": "", "AgeGrp": "0-6", 
                "lowage": "", "highage": "", "sex": "W", "StrkDist": "40025", 
                "How_Many": "50", "foolOldPerl": ""}
                ,callback=self.parse,dont_click=True)]

仍然没有抓取任何东西。我知道 xpath 是正确的，因为当我只尝试并抓取一种形式(使用返回而不是 yield )时，它工作得很好。我已经阅读了杂乱的文档，但它并不是很有帮助:(

最佳答案

您缺少一个非常简单的解决方案，请将 return 更改为 yield

然后，您不必在数组中累积项目，只需从函数中生成所需数量的项目和请求，scrapy 将完成剩下的工作

来自scrapy docs :

from scrapy.selector import Selector
from scrapy.spider import Spider
from scrapy.http import Request
from myproject.items import MyItem

class MySpider(Spider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = [
        'http://www.example.com/1.html',
        'http://www.example.com/2.html',
        'http://www.example.com/3.html',
    ]

    def parse(self, response):
        sel = Selector(response)
        for h3 in sel.xpath('//h3').extract():
            yield MyItem(title=h3)

        for url in sel.xpath('//a/@href').extract():
            yield Request(url, callback=self.parse)

关于python - Scrapy跨多个表单请求页面存储项目？元？ Python，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/21323123/

文章推荐： python - 数组中的条件，Python

文章推荐： php - 为什么从 Web 服务器执行时 Python 环境变量不同？

文章推荐： Python IO 不可哈希列表正则表达式

python - Python 请求(AJAX 请求)数据丢失
我正在尝试从该网站抓取历史天气数据: http://www.hko.gov.hk/cis/dailyExtract_uc.htm?y=2016&m=1 在阅读了 AJAX 调用后，我发现请求数据的正确
rest - 链接 postman 请求 - 从另一个请求调用 postman 请求？
我有两个 postman 请求 x,y，它们命中了两个不同的休息 api X,Y 中的端点。 x 会给我一个身份验证 token ，这是发出 y 请求所必需的。如何在请求 y 中发出请求 x ？也就是
javascript - Node.js 请求 - 处理多个 POST 请求
我使用请求库通过 API 与其他服务器进行通信。但现在我需要同时发送多个(10 个或更多)POST 请求，并且只有在所有响应都正确的情况下才能进一步前进。通常语法看起来有点像这样: var optio
javascript - 如果提交了新的 AJAX 请求，则取消 AJAX 请求
背景:当用户单击按钮时，其类会在class1和class2之间切换，并且此数据是通过 AJAX 提交。为了确认此数据已保存，服务器使用 js 进行响应(更新按钮 HTML)。问题:如果用户点击按钮的
Node.js 请求 - 打印帖子的整个 http 请求(原始)
我正在将 Node.js 中的请求库用于 Google 的文本转语音 API。我想打印出正在发送的请求，如 python example . 这是我的代码: const request = requi
python - 请求、请求 2 和请求 3 之间有什么区别？
我经常使用requests。最近我发现还有一个 requests2 和即将到来的 requests3 虽然有一个 page其中简要提到了 requests3 中的内容，我一直无法确定 requests
python - 在 POST 请求(python 请求)后获取响应/返回值
我正在尝试将图像发送到我的 API，然后从中获取结果。例如，我使用发送一个 bmp 图像文件 file = {"img": open("img.bmp)} r = requests.post(url,
azure - Azure 中两个虚拟机之间的内部 HTTP 请求 - 默认情况下安全还是需要发送 HTTPS 请求？
我发现 Google Cloud 确保移出其物理环境的任何请求都经过强制加密，请参阅(虚拟机到虚拟机标题下的第 6 页)this link Azure(和 AWS)是否遵循类似的程序？如果有人能给我指
javascript - jQuery:执行同步 AJAX 请求，然后执行一系列其他 ajax 请求
我有一个 ASP.NET MVC 应用程序，我正在尝试在 javascript 函数中使用 jQuery 来创建一系列操作。该函数由三部分组成。我想做的是:如果满足某些条件，那么我想执行同步 jQu
javascript - Http 请求 - 外部 url 请求 ember js
我找不到如何执行 get http 请求，所以我希望你们能帮助我。这个想法是从外部url(例如 https://api.twitter.com/1.1/search/tweets.json?q=tw
android - 请求 READ_SMS 请求 "send and view SMS messages"
我的应用只需要使用“READ_SMS”权限。我的问题是，在 Android 6.0 上，当我需要使用新的权限系统时，它会要求用户“发送和查看短信”。这是我的代码: ActivityCompat.re
node.js - 为什么即使我的前端代码只是发出 POST 请求，浏览器也会发送 OPTIONS 请求？
我的前端代码: { this.searchInput = input; }}/> 搜索 // search method: const baseUrl = 'http://localho
c# - 将 HTTP 请求 header 添加到 WCF 请求
我有一个由 AJAX 和 C# 应用程序使用的 WCF 服务，我需要通过 HTTP 请求 header 发送一个参数。在我的 AJAX 上，我添加了以下内容并且它有效: $.ajax({
javascript - node.js + 请求 => node.js + bluebird + 请求
我正在尝试了解如何使用 promises 编写代码。请检查我的代码。这样对吗？ Node.js + 请求: request(url, function (error, response, body)
gwt - 如果失败，如何重新发送 GWT RPC 请求(或如何创建持久的 RPC 请求)？
如果失败(除 HTTP 200 之外的任何响应代码)，我需要重试发送 GWT RPC 请求。原因很复杂，所以我不会详细说明。到目前为止，我在同一个地方处理所有请求响应，如下所示: // We
php - 发起 POST 请求，执行操作，然后完成 POST 请求 - 如何？
当用户单击提交按钮时，我希望提交表单。然而，就在这种情况发生之前，我希望弹出一个窗口并让他们填写一些数据。一旦他们执行此操作并关闭该子窗口，我希望发出 POST 请求。这可能吗？如果可能的话如何？我
javascript - 什么更好？更多 HTTP 请求 = 更少的数据传输或更少的 HTTP 请求 = 更多的数据传输？
像 Facebook 这样的网站使用“延迟”加载 js。当你必须考虑到我有一台服务器，流量很大时。我很感兴趣 - 哪一个更好？当我一次执行更多 HTTP 请求时 - 页面加载速度较慢(由于限制(一
java - Servlet 容器创建 Servlet 请求/响应对象还是 HttpServlet 请求/响应对象？
Servlet 容器是否创建 ServletRequest 和 Response 对象或 Http 对象？如果是ServletRequest，谁在调用服务方法之前将其转换为HttpServletReq
php - HTTP 请求 URL 不是 HTTP 请求 header 的一部分吗？
这是维基百科文章的摘录: In contrast to the GET request method where only a URL and headers are sent to the serv
node.js - 首先完成一个 HTTP post 请求，然后再循环执行下一个 HTTP post 请求
我有一个循环，每次循环时都会发出 HTTP post 请求。 for(let i = 1; i console.log("succes at " + i), error => con

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - Scrapy跨多个表单请求页面存储项目？元？ Python