gpt4 book ai didi

python - Scrapy 中的并行请求

转载 作者:行者123 更新时间:2023-11-28 17:30:35 24 4
gpt4 key购买 nike

我在 Scrapy 中遇到了这个问题:我正在尝试在函数 parse_additional_info 中填充我的 item,为此我需要在第二个回调 parse_player 中抓取一堆额外的 url :

for path in path_player:
url = path.xpath('url_extractor').extract()[0]
yield Request(url,meta = {'item' : item}, callback= self.parse_player, priority = 300)

当我这样做时,我的理解是请求稍后异步执行,填充 item ,但是 yield item 立即返回未完全填充的它。我知道不可能等待所有的 yield Request(url,meta = {'item' : item}, callback= self.parse_player, priority = 300) 完成,但是你怎么会解决这个问题?即当请求中的所有信息都已完成时,确保完成 item yield。

from scrapy.spiders import Spider, CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.linkextractors import LinkExtractor
from scrapy.http import Request
from datetime import datetime
from footscript.items import MatchResultItem
import re, json, string, datetime, uuid

class PreliminarySpider(Spider):
name = "script"
start_urls = [
start_url1,
start_url2,
start_url3,
start_url4,
start_url5,
start_url6,
start_url7,
start_url8,
start_url9,
start_url10,
]
allowed_domains = ['domain.com']

def parse(self, response):
sel = Selector(response)
matches = sel.xpath('match_selector')
for match in matches:
try:
item = MatchResultItem()
item['url'] = match.xpath('match_url_extractor').extract()[0]
except Exception:
print "Unable to get: %s" % match.extract()
yield Request(url=item['url'] ,meta = {'item' : item}, callback=self.parse_additional_info)

def parse_additional_info(self, response):
item = response.request.meta['item']
sel = Selector(response)

try:
item['roun'] = sel.xpath('round_extractor').extract()[0]
item['stadium'] = sel.xpath('stadium_extractor').extract()[0]
item['attendance'] = sel.xpath('attendance_extractor').extract()[0]
except Exception:
print "Attributes not found at:" % item['url']

item['player'] = []
path_player = sel.xpath('path_extractor')
for path in path_player:
player = path.xpath('player_extractor').extract()[0]
player_id = path.xpath('player_d_extractor').extract()[0]
country = path.xpath('country_extractor').extract()[0]
item['player'].append([player_id, player, country])
url = path.xpath('url_extractor').extract()[0]
yield Request(url,meta = {'item' : item}, callback= self.parse_player, priority = 300)
# except Exception:
# print "Unable to get players"
yield item

def parse_player(self, response):
item = response.request.meta['item']
sel = Selector(response)
play_id = re.sub("[^0-9]", "",response.url)
name = sel.xpath('//div[@class="fdh-wrap contentheader"]/h1/text()').extract()[0].encode('utf-8').rstrip()
index = [i for i, row in enumerate(item['player']) if play_id in row[0]]
item['player'][index[0]][1]=name
return item

编辑新代码:

yield Request(url,meta = {'item' : item}, callback= self.parse_player, errback= self.err_player)
# except Exception:
# print "Unable to get players"
yield item

def parse_player(self, response):
item = response.request.meta['item']
sel = Selector(response)
play_id = re.sub("[^0-9]", "",response.url)
name = sel.xpath('//div[@class="fdh-wrap contentheader"]/h1/text()').extract()[0].encode('utf-8').rstrip()
index = [i for i, row in enumerate(item['player']) if play_id in row[0]]
item['player'][index[0]][1]=name
item['player'][index[0]].append("1")
return item

def err_player(self, response):
print "****************"
print "Player not found"
print "****************"
item = response.request.meta['item']
play_id = re.sub("[^0-9]", "",response.url)
index = [i for i, row in enumerate(item['player']) if play_id in row[0]]
item['player'][index[0]].append("1")
return item

最佳答案

在多个回调中传递项目是非常微妙的做法。它可以在非常简单的情况下工作。但是,您可以遇到各种问题:

  • 请求失败(您可以使用 Request(..., errback=self.my_parse_err) 修复它,但是为每个请求创建 2 个回调非常乏味)
  • 第二个请求有重复的 url(您可以使用 Request(...., dont_filter=True) 并使用将 HTTPCACHE_ENABLED=True 添加到 settings 来修复它.py)

从开发角度和生产角度来看,安全的方法是为每种类型的页面创建一种类型的项目。然后结合 2 个相关项目作为后处理。

另请注意,如果您有重复的网址,您最终可能会在您的项目中得到重复的数据。这也会导致数据库中的数据规范化问题。

关于python - Scrapy 中的并行请求,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34634730/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com