python - Scrapy Splash 总是返回相同的页面-6ren

python - Scrapy Splash 总是返回相同的页面

转载作者：行者123 更新时间：2023-11-28 18:20:12

27

4

对于预先知道其个人资料 url 的几个 Disqus 用户中的每一个，我想抓取他们的姓名和他们的关注者的用户名。我正在使用 scrapy 和 splash 这样做。但是，当我解析响应时，它似乎总是在抓取第一个用户的页面。我尝试将 wait 设置为 10 并将 dont_filter 设置为 True，但它不起作用。我现在该怎么办？

这是我的蜘蛛:

import scrapy
from disqus.items import DisqusItem

class DisqusSpider(scrapy.Spider):
    name = "disqusSpider"
    start_urls = ["https://disqus.com/by/disqus_sAggacVY39/", "https://disqus.com/by/VladimirUlayanov/", "https://disqus.com/by/Beasleyhillman/", "https://disqus.com/by/Slick312/"]
    splash_def = {"endpoint" : "render.html", "args" : {"wait" : 10}}

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url = url, callback = self.parse_basic, dont_filter = True, meta = {
                "splash" : self.splash_def,
                "base_profile_url" : url
            })

    def parse_basic(self, response):
        name = response.css("h1.cover-profile-name.text-largest.truncate-line::text").extract_first()
        disqusItem = DisqusItem(name = name)
        request = scrapy.Request(url = response.meta["base_profile_url"] + "followers/", callback = self.parse_followers, dont_filter = True, meta = {
            "item" : disqusItem,
            "base_profile_url" : response.meta["base_profile_url"],
            "splash": self.splash_def
        })
        print "parse_basic", response.url, request.url
        yield request

    def parse_followers(self, response):
        print "parse_followers", response.meta["base_profile_url"], response.meta["item"]
        followers = response.css("div.user-info a::attr(href)").extract()

DisqusItem 定义如下:

class DisqusItem(scrapy.Item):
    name = scrapy.Field()
    followers = scrapy.Field()

结果如下:

2017-08-07 23:09:12 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://localhost:8050/render.html> (referer: None)
parse_followers https://disqus.com/by/disqus_sAggacVY39/ {'name': u'Trailer Trash'}
2017-08-07 23:09:14 [scrapy.extensions.logstats] INFO: Crawled 5 pages (at 5 pages/min), scraped 0 items (at 0 items/min)
2017-08-07 23:09:18 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://localhost:8050/render.html> (referer: None)
parse_followers https://disqus.com/by/VladimirUlayanov/ {'name': u'Trailer Trash'}
2017-08-07 23:09:27 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://localhost:8050/render.html> (referer: None)
parse_followers https://disqus.com/by/Beasleyhillman/ {'name': u'Trailer Trash'}
2017-08-07 23:09:40 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://localhost:8050/render.html> (referer: None)
parse_followers https://disqus.com/by/Slick312/ {'name': u'Trailer Trash'}

这是文件settings.py:

# -*- coding: utf-8 -*-

# Scrapy settings for disqus project
#

BOT_NAME = 'disqus'

SPIDER_MODULES = ['disqus.spiders']
NEWSPIDER_MODULE = 'disqus.spiders'

ROBOTSTXT_OBEY = False

SPLASH_URL = 'http://localhost:8050' 

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

DUPEFILTER_CLASS = 'scrapyjs.SplashAwareDupeFilter'
DUPEFILTER_DEBUG = True

DOWNLOAD_DELAY = 10

最佳答案

我能够使用 SplashRequest 而不是 scrapy.Request 让它工作。

例如:

import scrapy
from disqus.items import DisqusItem
from scrapy_splash import SplashRequest


class DisqusSpider(scrapy.Spider):
    name = "disqusSpider"
    start_urls = ["https://disqus.com/by/disqus_sAggacVY39/", "https://disqus.com/by/VladimirUlayanov/", "https://disqus.com/by/Beasleyhillman/", "https://disqus.com/by/Slick312/"]

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, self.parse_basic, dont_filter = True, endpoint='render.json',
                        args={
                            'wait': 2,
                            'html': 1
                        })

关于python - Scrapy Splash 总是返回相同的页面，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/45555878/

27

4

0

文章推荐： python - TensorFlow Estimator 1.3 无法获取 predict_proba？

文章推荐： ios - 删除 subview 并将其添加到自定义 UITableViewCell

文章推荐： javascript - 我无法在 webpack 中使用 Babel 的其余操作符

文章推荐： ios - 在启动时暂停 SKScene

python - Scrapy-Splash:无法使用 scrapinghub/splash:latest 作为基础镜像运行 docker 容器
正在构建一个使用一些 Azure 服务和 Scrapy-Splash 的 python Scrapy 应用程序。我尝试在本地 Windows 计算机中使用 scrapinghub/splash:lat
react-native - 使用 expo-splash-screen 的 react-native expo 错误 : No native splash screen registered for given view controller.
Unhandled promise rejection: Error: No native splash screen registered for given view controller. Ca
Scrapy + Splash = 连接被拒绝
我使用这个 link 安装了 Splash .按照所有步骤进行安装，但 Splash 不起作用。我的settings.py 文件: BOT_NAME = 'Teste' SPIDER_MODULES
javascript - Splash 无法获取整个页面
我首先使用以下命令在 docker 上运行splash: docker run -p 8050:8050 scrapinghub/splash 当我转到端口 8050 并尝试渲染时: http://w
Android Splash 运行时权限不起作用
我使用了来自 repo 的完全相同的代码: https://github.com/pcess/tutorials/tree/master/SplashPermissions repo 中的独立应用程序
CSS Splash 无法将每个尺寸的图像居中
我有一个初始图像，它会在页面加载时随我的软件一起加载。在窗口的当前大小(1024 像素)下，图像以居中方式加载，但是当窗口开始最大化时，它太靠左了。这是我的CSS: #splash { wid
iOS 应用程序加载器 (Splash)
我有一个应用程序需要在启动前或在后台闲置一段时间后获取一些远程配置文件。我使用加载器 View Controller 来完成这项工作，同时显示带有加载指示器的初始屏幕。显示加载程序的最佳方式是什么(
python - splash lua脚本做多次点击访问
我正在尝试抓取 Google Scholar search results并获取与搜索匹配的每个结果的所有 BiBTeX 格式。现在我有一个带有 Splash 的 Scrapy 爬虫。我有一个 lua
python - 使用旋转代理运行 scrapy splash
我正在尝试将 scrapy 与启动和旋转代理一起使用。这是我的 settings.py: ROBOTSTXT_OBEY = False BOT_NAME = 'mybot' SPIDER_MODULE
FFMPEG - 视频中带有 Splash 图像的绿色转换
我正在使用 FFMPEG 制作包含单个单色 JPG 图像的视频: ffmpeg -y -loop 1 -framerate 30 -t 5 -i SplashBW.jpg Splash.mp4 生成的
splash-screen - 为什么 'fbi'在系统启动时不显示启动画面？
我正在尝试使用 fbi 为 Raspbian Stretch 提供启动画面。根据一些教程，我在这里找到了我的情况: /etc/systemd/system/splashscreen.service [
xpath - Scrapy + Splash:在内部html内抓取元素
我正在使用Scrapy + Splash来爬网网页，并尝试从google广告横幅和其他广告中提取数据，但是我很难弄清楚要遵循xpath的方式。我正在使用Scrpay-Splash API渲染页面，以
Scrapy-Splash 与 Tor
我已经成功使用此链接通过Tor运行Scrapy:http://pkmishra.github.io/blog/2013/03/18/how-to-run-scrapy-with-TOR-and-mul
javascript - scrapy-splash 渲染多于第一页
我正在尝试抓取一个网站，但需要在所有页面中使用启动画面，因为它们的内容是动态创建的。现在它只呈现第一页，而不是内容页或分页页。代码如下: import scrapy from scrapy_spla
ios - Splash 完成后更改 ViewController
我想在 Splash 随着时间结束时更改 viewController；我有这个: //Implementación de los métodos: - (void) cargaImagenes{
python - Scrapy Splash 总是返回相同的页面
对于预先知道其个人资料 url 的几个 Disqus 用户中的每一个，我想抓取他们的姓名和他们的关注者的用户名。我正在使用 scrapy 和 splash 这样做。但是，当我解析响应时，它似乎总是在抓
python - Scrapy Splash 点击按钮不起作用
我想做什么在 avito.ru(俄罗斯房地产网站)上，某人的电话在您点击它之前是隐藏的。我想用Scrapy+Splash收集手机。示例网址:https://www.avito.ru/moskva/
python - Scrapy with Splash 不会等待网站加载
我正在尝试通过 Python 脚本调用 Splash 来呈现和抓取交互式网站，基本上遵循此 tutorial : import scrapy from scrapy_splash import Spl
python - Scrapy Splash - 保持记录
我设法使用 scrapy+splash 连接到网站(感谢 this thread )。我知道我已登录，因为我可以显示您登录后可用的一些元素。但是，当我尝试使用另一个 SplashRequest 访问
android - cordova-splash 无法处理未处理的错误事件
当我运行 cordova-splash 命令时出现此错误。获取未处理的错误事件 > > $ cordova-splash > > Checkin

首页

博学

6Ren·AI

商城

python - Scrapy Splash 总是返回相同的页面