gpt4 book ai didi

python - scrapy如何正确使用Rules、restrict_xpaths抓取解析URL?

转载 作者:太空狗 更新时间:2023-10-30 01:29:39 25 4
gpt4 key购买 nike

我正在尝试编写一个抓取蜘蛛程序来抓取网站的 RSS 提要,然后解析文章的元标记。

第一个 RSS 页面是显示 RSS 类别的页面。我设法提取了链接,因为标签在标签中。它看起来像这样:

        <tr>
<td class="xmlLink">
<a href="http://feeds.example.com/subject1">subject1</a>
</td>
</tr>
<tr>
<td class="xmlLink">
<a href="http://feeds.example.com/subject2">subject2</a>
</td>
</tr>

单击该链接后,它会为您带来该 RSS 类别的文章,如下所示:

   <li class="regularitem">
<h4 class="itemtitle">
<a href="http://example.com/article1">article1</a>
</h4>
</li>
<li class="regularitem">
<h4 class="itemtitle">
<a href="http://example.com/article2">article2</a>
</h4>
</li>

如您所见,如果我使用标签,我可以再次获得 xpath 的链接我希望我的抓取工具转到该标记内的链接并为我解析元标记。

这是我的爬虫代码:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from tutorial.items import exampleItem


class MetaCrawl(CrawlSpider):
name = 'metaspider'
start_urls = ['http://example.com/tools/rss'] # urls from which the spider will start crawling
rules = [Rule(SgmlLinkExtractor(restrict_xpaths=('//td[@class="xmlLink"]')), follow=True),
Rule(SgmlLinkExtractor(restrict_xpaths=('//h4[@class="itemtitle"]')), callback='parse_articles')]

def parse_articles(self, response):
hxs = HtmlXPathSelector(response)
meta = hxs.select('//meta')
items = []
for m in meta:
item = exampleItem()
item['link'] = response.url
item['meta_name'] =m.select('@name').extract()
item['meta_value'] = m.select('@content').extract()
items.append(item)
return items

然而,这是我运行爬虫时的输出:

DEBUG: Crawled (200) <GET http://http://feeds.example.com/subject1> (referer: http://example.com/tools/rss)
DEBUG: Crawled (200) <GET http://http://feeds.example.com/subject2> (referer: http://example.com/tools/rss)

我在这里做错了什么?我一直在一遍又一遍地阅读文档,但我觉得我总是忽略了一些东西。任何帮助将不胜感激。

编辑: 添加:items.append(item) 。原来的帖子忘记了。编辑::我也试过了,结果是一样的输出:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from reuters.items import exampleItem
from scrapy.http import Request

class MetaCrawl(CrawlSpider):
name = 'metaspider'
start_urls = ['http://example.com/tools/rss'] # urls from which the spider will start crawling
rules = [Rule(SgmlLinkExtractor(allow=[r'.*',], restrict_xpaths=('//td[@class="xmlLink"]')), follow=True),
Rule(SgmlLinkExtractor(allow=[r'.*'], restrict_xpaths=('//h4[@class="itemtitle"]')),follow=True),]


def parse(self, response):
hxs = HtmlXPathSelector(response)
meta = hxs.select('//td[@class="xmlLink"]/a/@href')
for m in meta:
yield Request(m.extract(), callback = self.parse_link)


def parse_link(self, response):
hxs = HtmlXPathSelector(response)
meta = hxs.select('//h4[@class="itemtitle"]/a/@href')
for m in meta:
yield Request(m.extract(), callback = self.parse_again)

def parse_again(self, response):
hxs = HtmlXPathSelector(response)
meta = hxs.select('//meta')
items = []
for m in meta:
item = exampleItem()
item['link'] = response.url
item['meta_name'] = m.select('@name').extract()
item['meta_value'] = m.select('@content').extract()
items.append(item)
return items

最佳答案

您返回了一个空的items,您需要将item附加到items
您还可以在循环中yield item

关于python - scrapy如何正确使用Rules、restrict_xpaths抓取解析URL?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/15192362/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com