python - 使用scrapyjs通过splash抓取onclick页面-6ren

python - 使用scrapyjs通过splash抓取onclick页面

转载作者：太空狗更新时间：2023-10-30 02:42:55

26

4

我正在尝试从使用 javascript 之类的页面获取 url

<span onclick="go1()">click here </span>
<script>function go1(){
        window.location = "../innerpages/" + myname + ".php";
    }
</script>

这是我使用带有 splash 的 scrapyjs 的代码

def start_requests(self):
    for url in self.start_urls:
        yield Request(url, self.parse, meta={
            'splash': {
                'endpoint': 'render.html',
                'args': {'wait': 4, 'html': 1, 'png': 1, 'render_all': 1, 'js_source': 'document.getElementsByTagName("span")[0].click()'},
            }
        })

如果我写

'js_source': 'document.title="hello world"'

它会起作用

似乎我可以处理页面内的文本，但我无法从 go1()

获取 url

如果我想获取 go1() 中的 url，我应该怎么做

谢谢!

最佳答案

您可以使用 /execute endpoint :

class MySpider(scrapy.Spider):
    ...

    def start_requests(self):
        script = """
        function main(splash)
            local url = splash.args.url
            assert(splash:go(url))
            assert(splash:wait(1))

            assert(splash:runjs('document.getElementsByTagName("span")[0].click()'))
            assert(splash:wait(1))

            -- return result as a JSON object
            return {
                html = splash:html()
            }
        end
        """
        for url in self.start_urls:
            yield scrapy.Request(url, self.parse_result, meta={
                'splash': {
                    'args': {'lua_source': script},
                    'endpoint': 'execute',
                }
            })

    def parse_result(self, response):

        # fetch base URL because response url is the Splash endpoint
        baseurl = response.meta["_splash_processed"]["args"]["url"]

        # decode JSON response
        splash_json = json.loads(response.body_as_unicode())

        # and build a new selector from the response "html" key from that object
        selector = scrapy.Selector(text=splash_json["html"], type="html")

        ...

关于python - 使用scrapyjs通过splash抓取onclick页面，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/35052999/

26

4

0

文章推荐： python - 在 openpyxl 中将 x 轴标签设置为底部

文章推荐： python - 如何通过 python 代码获取 jenkins 作业的内部版本号

文章推荐： python - pyinstaller 没有为树莓派预定义的编译器

文章推荐：用于 Visual Studio 内联图形的 Python 工具

python - ScrapyJS - 如何正确等待页面加载？
我正在使用 ScrapyJS 和 Splash 来模拟表单提交按钮的点击 def start_requests(self): script = """ function
python - Scrapy + 飞溅 + ScrapyJS
我正在使用 Splash 2.0.2 + Scrapy 1.0.5 + Scrapyjs 0.1.1，但我仍然无法通过点击渲染 javascript。这是一个示例网址 https://olx.pt/a
python - 安装 ScrapyJS - python 新手
我正在尝试使用这个 scrapy 插件(或者它是什么):scrapyjs . 但是没有安装说明，而且我是 Python 新手。我缺少一些基本的东西吗？我如何将其与 scrapy 项目集成。注意:我更
python - ScrapyJs(scrapy+splash)无法加载脚本，但splash服务器运行良好
我正在尝试应用Scrapy(scrapyjs)来抓取带有脚本的页面，以获得完整加载的页面。我应用splash + scrapy使用以下代码渲染它。这与直接使用 localhost:8050 服务器的参
python - 来自官方 Github 的 ScrapyJS 示例未运行
我正在尝试编写一个获取 javascript 代码的小型网络解析器。为此，我尝试使用 ScrapyJS 通过 Javscript 扩展 Scrapy。我已按照 the official reposi
javascript - Scrapyjs + Splash 点击 Controller 按钮
你好，我已经安装了 Scrapyjs + Splash，我使用下面的代码 import json import scrapy from scrapy.linkextractors import Lin

首页

博学

6Ren·AI

商城

python - 使用scrapyjs通过splash抓取onclick页面