python - 被 scrapy 困住了，下面是来自 subreddits 的 imgur 链接-6ren

python - 被 scrapy 困住了，下面是来自 subreddits 的 imgur 链接

转载作者：太空宇宙更新时间：2023-11-03 17:22:06

25

4

我正在抓取 reddit 以获取 subreddit 中每个条目的链接。我也想关注与 http://imgur.com/gallery/\w* 匹配的链接。但我在运行 Imgur 回调时遇到问题。它只是不执行它。出了什么问题？

我正在使用简单的 if "http://imgur.com/gallery/"in item['link'][0]: 语句检测 Imgur url，也许 scrapy 提供了有更好的方法来检测它们吗？

这是我尝试过的:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

from reddit.items import RedditItem


class RedditSpider(CrawlSpider):
    name = "reddit"
    allowed_domains = ["reddit.com"]
    start_urls = [
        "http://www.reddit.com/r/pics",
    ]

    rules = [
        Rule(
            LinkExtractor(allow=['/r/pics/\?count=\d.*&after=\w.*']),
            callback='parse_item',
            follow=True
        )
    ]

    def parse_item(self, response):
        for title in response.xpath("//div[contains(@class, 'entry')]/p/a"):
            item = RedditItem()
            item['title'] = title.xpath('text()').extract()
            item['link'] = title.xpath('@href').extract()

            yield item

            if "http://imgur.com/gallery/" in item['link'][0]:
                # print item['link'][0]
                url = response.urljoin(item['link'][0])
                print url
                yield scrapy.Request(url, callback=self.parse_imgur_gallery)

    def parse_imgur_gallery(self, response):
        print response.url

这是我的 Item 类:

import scrapy


class RedditItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()

这是使用 --nolog 执行蜘蛛并在 if 条件中打印 url 变量时的输出(它不是 response.url var output)，它仍然没有运行回调:

PS C:\repos\python\scrapy\reddit> scrapy crawl --output=export.json --nolog reddit
http://imgur.com/gallery/W7sXs/new
http://imgur.com/gallery/v26KnSX
http://imgur.com/gallery/fqqBq
http://imgur.com/gallery/9GDTP/new
http://imgur.com/gallery/5gjLCPV
http://imgur.com/gallery/l6Tpavl
http://imgur.com/gallery/Ow4gQ
...

最佳答案

我找到了。不允许使用 imgur.com 域。只需添加它...

allowed_domains = ["reddit.com", "imgur.com"]

关于python - 被 scrapy 困住了，下面是来自 subreddits 的 imgur 链接，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/33048105/

25

4

0

文章推荐： ruby - 将 ruby 类拆分为多个文件

文章推荐： c# - 设置字符串长度限制然后填充子字符串或空格

文章推荐： git - 如何将代码从远程服务器(Rackspace)推送到 bitbucket？

文章推荐： python - 如何搜索文件特定的 XML 代码模式

r - 如何在给定时间段内抓取所有 subreddit 帖子
我有一个功能可以在 2014-11-01 和 2015-10-31 之间抓取比特币 subreddit 中的所有帖子。但是，我只能提取到 10 月 25 日为止的大约 990 个帖子。我不明白发生了
subreddit 主题的 CSS 图像调整大小
我目前正在尝试使用 subreddit 主题并将图像调整为更大/上传图像的实际大小。 Logo 的当前代码是: /* SUBEDDIT LOGO ADDON ---------------------
python - 如何为 subreddit 构建网页抓取功能？
摘要:我想网络抓取 subreddit，然后将数据转换为数据帧。我知道如何单独完成它们。但我坚持使用一个函数。下面是我一一的做法。 url = 'https://api.pushshift.io/re
python - 使用 Praw 按关键字搜索每个 subreddit
我无法理解这在 praw API 中是否可行:我想获得所有评论中提到关键字(比如“python”)的帖子的列表。似乎搜索功能总是从特定的 subreddit 中调用，如 for submission
php - 将版主添加到论坛类别时没有模型 [App\Subreddit] 的查询结果
我正在尝试将用户指定为论坛类别的版主。目前，我只是试图显示用户可以添加版主 subreddit/{id}/moderators 的路线并显示 subreddit 名称。为此，我得到No query
python - 如何从 subreddit 获取所有提交 ID？
我正在尝试编写一个聊天机器人，我想为其提供来自特定子版 block 的数据，例如https://www.reddit.com/r/leagueoflegends/ 我已经能够在递归循环的同时抓取所有评
Python Praw 在 subreddits 中跳过粘性
我正在尝试遍历 subreddits，但想忽略顶部的置顶帖子。我能够打印前 5 个帖子，不幸的是包括即时贴。尝试跳过这些的各种 pythonic 方法都失败了。下面是我的代码的两个不同示例。
sql - 从某个 subreddit 获取多个用户的所有评论 - Reddit 数据集
我想获得在特定 subreddit(例如 r/gaming)中发表评论的每个 redditor 的所有评论。我知道如何查询 subreddits: SELECT * FROM [fh-bigquery
php - reddit API - 从 subreddit 中提取图像
我正在寻找 reddit API 中的示例。我想从某个 subreddit (http://www.reddit.com/r/VillagePorn) 中提取图像并将它们放在网页上。我见过其他网站这样
api - 如何从随机 subreddit 中提取随机帖子？ (Reddit API)
我试图在一次 api 调用中从 random subreddit 中提取随机发帖，但我不知道该怎么做。这可能吗？如果不可能，我将如何通过多次 api 调用和最小的开销来实现这一点？以下请求返回一个随
python - 虾 6 : Get all submission of a subreddit
我正在尝试使用 PRAW 从最新到最旧迭代某个 subreddit 的提交。我以前是这样做的: subreddit = reddit.subreddit('LandscapePhotography')
api - 从 JSON 中的 subreddit 获取新帖子
我将如何获得新品 JSON 中 subreddit 的帖子？只需将 .json 附加到 url (http://www.reddit.com/r/SOME_SUBREDDIT/new.json) 就
powershell - 如何通过 PowerShell 列出 subreddit 的热链接？
在 PowerShell 中通过 API 检索 subreddit 链接的好方法是什么？最佳答案如果将 .json 放在 URI 的末尾，Reddit 通过 JSON 提供内容，PowerShel
python - 使用 Praw 抓取 subreddit 帖子标题并将其用作文件名
我的代码当前从给定的 Reddit 子版下载图像，并将它们命名为原始文件名。我希望代码做的是将它们命名为 Reddit 上发布的名称。有人可以帮我吗？我认为这与 Submission.title 有关
python - 从特定的 subreddit 获取过去两个月的所有提交(使用 PRAW)？
我试图获取过去两个月的所有/r/politics 帖子以及所有评论和用户详细信息。我如何使用 PRAW 执行此操作？我应该浏览 get_hot() 中的帖子吗？关于如何解决这个问题的任何想法？是否有
python - 我如何使用 PRAW 列出 subreddit 中的热门评论？
我需要随时获取 subreddit 中的热门评论。我已经尝试抓取所有提交的内容，并遍历它们，但不幸的是，您可以获得的帖子数量限制为 1000。我试过使用 Subreddit.get_comment
python - 从没有 'submissions' 的 subreddit 获取所有提交
有没有办法在不使用 PRAW 的函数 submissions() 的情况下从 subreddit 获取所有提交？通过提交，我能够在两个时间戳之间搜索来自给定 subreddit 的所有提交。但现在 R
python - 我想使用 praw 获取 subreddit 中所有时间的热门帖子列表
我可以使用下面的代码获取某一天的热门帖子。有什么办法可以将其更改为所有时间的热门帖子吗？ import praw user_agent = "Comment Reader 1.0 by /u/crow
php - 收集 subreddit 标题(批量)的最佳方式是什么
我想收集 subreddit 上所有帖子的标题，我想知道最好的方法是什么？我环顾四周，发现了一些关于 Python 和机器人的内容。我还简要了解了 API，但不确定该往哪个方向走。因为我不想 pr
python - 获取使用特定 subreddit 的用户的 Reddit 用户名
我想生成一个使用特定 subreddit 的用户的用户名列表。据我所知，不可能简单地获得订阅 subreddit 的用户列表。如果那不可能，最好浏览所有线程并查看谁发表了评论。我该如何处理？最佳

首页

博学

6Ren·AI

商城

python - 被 scrapy 困住了，下面是来自 subreddits 的 imgur 链接