gpt4 book ai didi

xml - Scrapy RSS爬虫

转载 作者:行者123 更新时间:2023-12-03 17:26:18 26 4
gpt4 key购买 nike

我正在尝试从Yahoo抓取RSS提要(他们的开放公司RSS提要| https://developer.yahoo.com/finance/company.html

我正在尝试抓取以下URL:https://feeds.finance.yahoo.com/rss/2.0/headline?s=BPMX

出于某种原因,我的蜘蛛无法正常工作,我认为这可能与所生成的XPath有关,如果没有,则定义parse_item可能会遇到一些问题。

import scrapy
from scrapy.spiders import CrawlSpider
from YahooScrape.items import YahooScrapeItem

class Spider(CrawlSpider):
name= "YahooScrape"
allowed_domains = ["yahoo.com"]
start_urls = ('https://feeds.finance.yahoo.com/rss/2.0/headline?s=BPMX',)

def parse_item(self, response):
self.logger.info('Hi, this is an item page! %s', response.url)
item = EmperyscraperItem()
item['title'] = response.xpath('//*[@id="collapsible"]/div[1]/div[2]/span',).extract() #define XPath for title
item['link'] = response.xpath('//*[@id="collapsible"]/div[1]/div[2]/span',).extract() #define XPath for link
item['description'] = response.xpath('//*[@id="collapsible"]/div[1]/div[2]/span',).extract() #define XPath for description
return item


代码可能是什么问题?如果不是,提取标题,描述和链接的正确XPath方向是什么。我是Scrapy的新手,只需要一些帮助就可以解决它!

编辑:我已经更新了我的蜘蛛并将其转换为XMLFeedSpider,如下所示:

import scrapy

from scrapy.spiders import XMLFeedSpider
from YahooScrape.items import YahooScrapeItem

class Spider(XMLFeedSpider):
name = "YahooScrape"
allowed_domains = ["yahoo.com"]
start_urls = ('https://feeds.finance.yahoo.com/rss/2.0/headline?s=BPMX') #Crawl BPMX
itertag = 'item'

def parse_node(self, response, node):
self.logger.info('Hi, this is a <%s> node!: %s', self.itertag, ''.join(node.extract()))

item = YahooScrapeItem()
item['title'] = node.xpath('item/title/text()',).extract() #define XPath for title
item['link'] = node.xpath('item/link/text()').extract()
item['pubDate'] = node.xpath('item/link/pubDate/text()').extract()
item['description'] = node.xpath('item/category/text()').extract() #define XPath for description
return item

#Yahoo RSS feeds http://finance.yahoo.com/rss/headline?s=BPMX,APPL


现在出现以下错误:

2017-06-13 11:25:57 [scrapy.core.engine] ERROR: Error while obtaining start requests


知道为什么会发生错误吗?我的HTML路径看起来正确。

最佳答案

据我所知,CrawlSpider only works for HTML responses。因此,我建议您以更简单的scrapy.Spider或更专业的XMLFeedSpider为基础。

然后,您在parse_items中使用的XPath似乎是根据您的浏览器从XML / RSS提要中以HTML呈现的形式构建的。
提要中没有*[@id="collapsible"]<div>

查看view-source:https://feeds.finance.yahoo.com/rss/2.0/headline?s=BPMX代替:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<rss version="2.0">
<channel>
<copyright>Copyright (c) 2017 Yahoo! Inc. All rights reserved.</copyright>
<description>Latest Financial News for BPMX</description>
<image>
<height>45</height>
<link>http://finance.yahoo.com/q/h?s=BPMX</link>
<title>Yahoo! Finance: BPMX News</title>
<url>http://l.yimg.com/a/i/brand/purplelogo/uh/us/fin.gif</url>
<width>144</width>
</image>
<item>
<description>MENLO PARK, Calif., June 7, 2017 /PRNewswire/ -- BioPharmX Corporation (NYSE MKT: BPMX), a specialty pharmaceutical company focusing on dermatology, today announced that it will release its financial results ...</description>
<guid isPermaLink="false">f56d5bf8-f278-37fd-9aa5-fe04b2e1fa53</guid>
<link>https://finance.yahoo.com/news/biopharmx-report-first-quarter-financial-101500259.html?.tsrc=rss</link>
<pubDate>Wed, 07 Jun 2017 10:15:00 +0000</pubDate>
<title>BioPharmX to Report First Quarter Financial Results</title>
</item>




工作蜘蛛示例:

import scrapy

from scrapy.spiders import XMLFeedSpider
#from YahooScrape.items import YahooScrapeItem

class Spider(XMLFeedSpider):
name = "YahooScrape"
allowed_domains = ["yahoo.com"]
start_urls = ('https://feeds.finance.yahoo.com/rss/2.0/headline?s=BPMX',) #Crawl BPMX
itertag = 'item'

def parse_node(self, response, node):
self.logger.info('Hi, this is a <%s> node!: %s', self.itertag, ''.join(node.extract()))

item = {}
item['title'] = node.xpath('title/text()',).extract_first() #define XPath for title
item['link'] = node.xpath('link/text()').extract_first()
item['pubDate'] = node.xpath('link/pubDate/text()').extract_first()
item['description'] = node.xpath('description/text()').extract_first() #define XPath for description
return item

关于xml - Scrapy RSS爬虫,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44507594/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com