gpt4 book ai didi

python - 如何在scrapy Spider上实现Request功能

转载 作者:太空宇宙 更新时间:2023-11-03 17:21:10 24 4
gpt4 key购买 nike

from string import join
from scrapy.contrib.spiders.crawl import CrawlSpider
from scrapy.selector import Selector
from scrapy.http.request import Request
from article.items import ArticleItem

class ArticleSpider(CrawlSpider):
name = "article"
allowed_domains = ["http://joongang.joins.com"]
j_classifications = ['politics','money','society','culture']

start_urls = ["http://news.joins.com/%s" % classification for
classification in j_classifications]

def parse_item(self, response):
sel = Selector(response)
urls = sel.xpath('//div[@class="bd"]/ul/li/strong')
items = []
for url in urls:
item = ArticleItem()
item['url'] = url.xpath('a/@href').extract()
items.append(item)

request = Request(items['url'], callback=self.parse_item2)
request.meta['item'] = items
return request

def parse_item2(self,response):
item = response.meta['item']
sel = Selector(response)
articles = sel.xpath('//div[@id=article_body]')
for article in articles:
item['article'] = article.xpath('text()').extract()
items.append(item)

return item

此代码用于文章废料。我一直用的是scrapy。parse_item 方法是针对使用 request 函数发送到 parse_item2 的文章 url 实现的。但这段代码不起作用。 Item 类确实实现了 url = Field()、article = Field()。我怎么解决这个问题。PS 网络标签是准确的。我确实在 scrapy shell 上进行了测试。

最佳答案

您的代码中存在问题:

request =  Request(items['url'], callback=self.parse_item2)

itemsitem 对象的列表。所以它会引发 TypeError。您可以使用第二个 for 循环来完成此操作,

for itm in items:
request = Request(itm['url'], callback=self.parse_item2)
request.meta['item'] = items
yield request

或者从第一个 for 循环产生一个请求,

for url in urls:
item = ArticleItem()
item['url'] = url.xpath('a/@href').extract()
request = Request(items['url'], callback=self.parse_item2)
request.meta['item'] = items
yield request

关于python - 如何在scrapy Spider上实现Request功能,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33120937/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com