gpt4 book ai didi

python - 被 scrapy 困住了,下面是来自 subreddits 的 imgur 链接

转载 作者:太空宇宙 更新时间:2023-11-03 17:22:06 25 4
gpt4 key购买 nike

我正在抓取 reddit 以获取 subreddit 中每个条目的链接。我也想关注与 http://imgur.com/gallery/\w* 匹配的链接。但我在运行 Imgur 回调时遇到问题。它只是不执行它。出了什么问题?

我正在使用简单的 if "http://imgur.com/gallery/"in item['link'][0]: 语句检测 Imgur url,也许 scrapy 提供了有更好的方法来检测它们吗?

这是我尝试过的:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

from reddit.items import RedditItem


class RedditSpider(CrawlSpider):
name = "reddit"
allowed_domains = ["reddit.com"]
start_urls = [
"http://www.reddit.com/r/pics",
]

rules = [
Rule(
LinkExtractor(allow=['/r/pics/\?count=\d.*&after=\w.*']),
callback='parse_item',
follow=True
)
]

def parse_item(self, response):
for title in response.xpath("//div[contains(@class, 'entry')]/p/a"):
item = RedditItem()
item['title'] = title.xpath('text()').extract()
item['link'] = title.xpath('@href').extract()

yield item

if "http://imgur.com/gallery/" in item['link'][0]:
# print item['link'][0]
url = response.urljoin(item['link'][0])
print url
yield scrapy.Request(url, callback=self.parse_imgur_gallery)

def parse_imgur_gallery(self, response):
print response.url

这是我的 Item 类:

import scrapy


class RedditItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()

这是使用 --nolog 执行蜘蛛并在 if 条件中打印 url 变量时的输出(它不是 response.url var output),它仍然没有运行回调:

PS C:\repos\python\scrapy\reddit> scrapy crawl --output=export.json --nolog reddit
http://imgur.com/gallery/W7sXs/new
http://imgur.com/gallery/v26KnSX
http://imgur.com/gallery/fqqBq
http://imgur.com/gallery/9GDTP/new
http://imgur.com/gallery/5gjLCPV
http://imgur.com/gallery/l6Tpavl
http://imgur.com/gallery/Ow4gQ
...

最佳答案

我找到了。不允许使用 imgur.com 域。只需添加它...

allowed_domains = ["reddit.com", "imgur.com"]

关于python - 被 scrapy 困住了,下面是来自 subreddits 的 imgur 链接,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33048105/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com