gpt4 book ai didi

python - Scrapy 抓取提取的链接

转载 作者:行者123 更新时间:2023-11-28 17:32:33 24 4
gpt4 key购买 nike

我需要抓取一个网站,并在特定的 xpath 上抓取该网站的每个 url例如。:我需要抓取容器中有 10 个链接的“http://someurl.com/world/”(xpath("//div[@class='pane-content']")),我需要抓取所有这 10 个链接并从中提取图像,但是“http://someurl.com/world/”中的链接看起来像"http://someurl.com/node/xxxx "

我现在拥有的:

import scrapy
from scrapy.contrib.spiders import Rule, CrawlSpider
from scrapy.contrib.linkextractors import LinkExtractor
from imgur.items import ImgurItem

class ImgurSpider(CrawlSpider):
name = 'imgur'
allowed_domains = ['someurl.com/']
start_urls = ['http://someurl.com/news']
rules = [Rule(LinkExtractor(allow=('/node/.*')), callback='parse_imgur', follow=True)]

def parse_imgur(self, response):
image = ImgurItem()
image['title'] = response.xpath(\
"//h1[@class='pane-content']/a/text()").extract()
rel = response.xpath("//img/@src").extract()
image['image_urls'] = response.xpath("//img/@src").extract()
return image

最佳答案

您可以重写“规则”以适应您的所有要求:

rules = [Rule(LinkExtractor(allow=('/node/.*',), restrict_xpaths=('//div[@class="pane-content"]',)), callback='parse_imgur', follow=True)]

要从提取的图像链接下载图像,您可以使用 Scrapy 的捆绑 ImagePipeline

关于python - Scrapy 抓取提取的链接,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33318203/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com