javascript - Scrapyjs + Splash 点击 Controller 按钮-6ren

javascript - Scrapyjs + Splash 点击 Controller 按钮

转载作者：数据小太阳更新时间：2023-10-29 05:12:00

你好，我已经安装了 Scrapyjs + Splash，我使用下面的代码

import json

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spider import Spider
from scrapy.selector import Selector
import urlparse, random

class DmozSpider(scrapy.Spider):
   name = "dmoz"
   allowed_domains = ["whoscored.com"]
   start_urls = ['http://www.whoscored.com/Regions/81/Tournaments/3/Seasons/4336/Stages/9192/Fixtures/Germany-Bundesliga-2014-2015']

def start_requests(self):
   for url in self.start_urls:
      yield scrapy.Request(url, self.parse, meta={
     'splash': {
        'endpoint': 'render.html',
        'args': {'wait': 0.5}
        }
     })

def parse(self, response):
   cnt = 0
   with open('links2.txt', 'a') as f:
      while True:
         try:
             data = ''.join(Selector(text=response.body).xpath('//a[@class="match-link match-report rc"]/@href')[cnt].extract())
             data = "https://www.whoscored.com"+data                
         except:
            break
         f.write(data+'\n')
         cnt += 1

到目前为止它工作正常，但现在我想点击 Controller 中的“上一个”按钮，它没有 id 也没有真正的 href。

我试过了

splash:runjs("$('#date-controller').click()")

和

splash:runjs("window.location = document.getElementsByTagName('a')[64].href")

但都没有成功。

最佳答案

这是一个基本(但有效)示例，说明如何使用 /execute endpoint 在 Splash 的 lua 脚本中传递 JavaScript 代码。

# -*- coding: utf-8 -*-
import json
from six.moves.urllib.parse import urljoin

import scrapy


class WhoscoredspiderSpider(scrapy.Spider):
    name = "whoscoredspider"
    allowed_domains = ["whoscored.com"]
    start_urls = (
        'http://www.whoscored.com/Regions/81/Tournaments/3/Seasons/4336/Stages/9192/Fixtures/Germany-Bundesliga-2014-2015',
    )

    def start_requests(self):
        script = """
        function main(splash)
            local url = splash.args.url
            assert(splash:go(url))
            assert(splash:wait(1))

            -- go back 1 month in time and wait a little (1 second)
            assert(splash:runjs("$('#date-controller > a:first-child').click()"))
            assert(splash:wait(1))

            -- return result as a JSON object
            return {
                html = splash:html(),
                -- we don't need screenshot or network activity
                --png = splash:png(),
                --har = splash:har(),
            }
        end
        """
        for url in self.start_urls:
            yield scrapy.Request(url, self.parse_result, meta={
                'splash': {
                    'args': {'lua_source': script},
                    'endpoint': 'execute',
                }
            })

    def parse_result(self, response):

        # fetch base URL because response url is the Splash endpoint
        baseurl = response.meta["splash"]["args"]["url"]

        # decode JSON response
        splash_json = json.loads(response.body_as_unicode())

        # and build a new selector from the response "html" key from that object
        selector = scrapy.Selector(text=splash_json["html"], type="html")

        # loop on the table row
        for table in selector.css('table#tournament-fixture'):

            # seperating on each date (<tr> elements with a <th>)
            for cnt, header in enumerate(table.css('tr.rowgroupheader'), start=1):
                self.logger.info("date: %s" % header.xpath('string()').extract_first())

                # after each date, look for sibling <tr> elements
                # that have only N preceding tr/th,
                # N being the number of headers seen so far
                for row in header.xpath('''
                        ./following-sibling::tr[not(th/@colspan)]
                                               [count(preceding-sibling::tr[th/@colspan])=%d]''' % cnt):
                    self.logger.info("record: %s" % row.xpath('string()').extract_first())
                    match_report_href = row.css('td > a.match-report::attr(href)').extract_first()
                    if match_report_href:
                        self.logger.info("match report: %s" % urljoin(baseurl, match_report_href))

示例日志:

$ scrapy crawl whoscoredspider 
2016-03-07 19:21:38 [scrapy] INFO: Scrapy 1.0.5 started (bot: whoscored)
(...stripped...)
2016-03-07 19:21:38 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, SplashMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-03-07 19:21:38 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-03-07 19:21:38 [scrapy] INFO: Enabled item pipelines: 
2016-03-07 19:21:38 [scrapy] INFO: Spider opened
2016-03-07 19:21:38 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-07 19:21:43 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/execute> (referer: None)
2016-03-07 19:21:43 [whoscoredspider] INFO: date: Saturday, Apr 4 2015
2016-03-07 19:21:43 [whoscoredspider] INFO: record: 14:30FTWerder Bremen0 : 0Mainz 05Match Report2
2016-03-07 19:21:43 [whoscoredspider] INFO: match report: http://www.whoscored.com/Matches/834843/MatchReport
2016-03-07 19:21:43 [whoscoredspider] INFO: record: 14:30FTEintracht Frankfurt2 : 2Hannover 96Match Report1
2016-03-07 19:21:43 [whoscoredspider] INFO: match report: http://www.whoscored.com/Matches/834847/MatchReport
(...stripped...)
2016-03-07 19:21:43 [whoscoredspider] INFO: date: Sunday, Apr 26 2015
2016-03-07 19:21:43 [whoscoredspider] INFO: record: 14:30FT1Paderborn2 : 2Werder BremenMatch Report2
2016-03-07 19:21:43 [whoscoredspider] INFO: match report: http://www.whoscored.com/Matches/834837/MatchReport
2016-03-07 19:21:43 [whoscoredspider] INFO: record: 16:30FTBorussia M.Gladbach1 : 0WolfsburgMatch Report12
2016-03-07 19:21:43 [whoscoredspider] INFO: match report: http://www.whoscored.com/Matches/834809/MatchReport
2016-03-07 19:21:43 [scrapy] INFO: Closing spider (finished)
2016-03-07 19:21:43 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1015,
 'downloader/request_count': 1,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 143049,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 3, 7, 18, 21, 43, 662973),
 'log_count/DEBUG': 2,
 'log_count/INFO': 90,
 'log_count/WARNING': 3,
 'response_received_count': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'splash/execute/request_count': 1,
 'splash/execute/response_count/200': 1,
 'start_time': datetime.datetime(2016, 3, 7, 18, 21, 38, 772848)}
2016-03-07 19:21:43 [scrapy] INFO: Spider closed (finished)

关于javascript - Scrapyjs + Splash 点击 Controller 按钮，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/35720323/

文章推荐： javascript - 为什么此代码卡在 node.js - Javascript 上的错误？

文章推荐： ios - Xcode 测试未检测到我的类(class)

文章推荐： javascript - 如何监听 Ace Editor 更改事件并使用react

python - ScrapyJS - 如何正确等待页面加载？
我正在使用 ScrapyJS 和 Splash 来模拟表单提交按钮的点击 def start_requests(self): script = """ function
python - Scrapy + 飞溅 + ScrapyJS
我正在使用 Splash 2.0.2 + Scrapy 1.0.5 + Scrapyjs 0.1.1，但我仍然无法通过点击渲染 javascript。这是一个示例网址 https://olx.pt/a
python - 安装 ScrapyJS - python 新手
我正在尝试使用这个 scrapy 插件(或者它是什么):scrapyjs . 但是没有安装说明，而且我是 Python 新手。我缺少一些基本的东西吗？我如何将其与 scrapy 项目集成。注意:我更
python - ScrapyJs(scrapy+splash)无法加载脚本，但splash服务器运行良好
我正在尝试应用Scrapy(scrapyjs)来抓取带有脚本的页面，以获得完整加载的页面。我应用splash + scrapy使用以下代码渲染它。这与直接使用 localhost:8050 服务器的参
python - 来自官方 Github 的 ScrapyJS 示例未运行
我正在尝试编写一个获取 javascript 代码的小型网络解析器。为此，我尝试使用 ScrapyJS 通过 Javscript 扩展 Scrapy。我已按照 the official reposi
javascript - Scrapyjs + Splash 点击 Controller 按钮
你好，我已经安装了 Scrapyjs + Splash，我使用下面的代码 import json import scrapy from scrapy.linkextractors import Lin

数据小太阳

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

javascript - Scrapyjs + Splash 点击 Controller 按钮