gpt4 book ai didi

python - Yield Request调用在scrapy的递归方法中产生奇怪的结果

转载 作者:太空狗 更新时间:2023-10-29 17:07:15 24 4
gpt4 key购买 nike

我正在尝试使用 Python 和 Scrapy 在一天内从所有国家/地区的所有机场取消所有出发和到达。

这个著名网站(飞行雷达)使用的JSON数据库需要在一个机场出发或到达> 100时逐页查询。我还根据查询的实际日期 UTC 计算时间戳。

我尝试创建具有此层次结构的数据库:

country 1
- airport 1
- departures
- page 1
- page ...
- arrivals
- page 1
- page ...
- airport 2
- departures
- page 1
- page ...
- arrivals
- page
- page ...
...

我使用两种方法来按页面计算时间戳和 url 查询:

def compute_timestamp(self):
from datetime import datetime, date
import calendar
# +/- 24 heures
d = date(2017, 4, 27)
timestamp = calendar.timegm(d.timetuple())
return timestamp

def build_api_call(self,code,page,timestamp):
return 'https://api.flightradar24.com/common/v1/airport.json?code={code}&plugin\[\]=&plugin-setting\[schedule\]\[mode\]=&plugin-setting\[schedule\]\[timestamp\]={timestamp}&page={page}&limit=100&token='.format(
code=code, page=page, timestamp=timestamp)

我将结果存储到 CountryItem 中,其中包含很多 AirportItem 到机场。我的 item.py 是:

class CountryItem(scrapy.Item):
name = scrapy.Field()
link = scrapy.Field()
num_airports = scrapy.Field()
airports = scrapy.Field()
other_url= scrapy.Field()
last_updated = scrapy.Field(serializer=str)

class AirportItem(scrapy.Item):
name = scrapy.Field()
code_little = scrapy.Field()
code_total = scrapy.Field()
lat = scrapy.Field()
lon = scrapy.Field()
link = scrapy.Field()
departures = scrapy.Field()
arrivals = scrapy.Field()

我的主要解析为所有国家构建了一个 Country 项目(例如,我在这里限制为以色列)。接下来,我为每个国家/地区生成一个 scrapy.Request 来抓取机场。

###################################
# MAIN PARSE
####################################
def parse(self, response):
count_country = 0
countries = []
for country in response.xpath('//a[@data-country]'):
item = CountryItem()
url = country.xpath('./@href').extract()
name = country.xpath('./@title').extract()
item['link'] = url[0]
item['name'] = name[0]
item['airports'] = []
count_country += 1
if name[0] == "Israel":
countries.append(item)
self.logger.info("Country name : %s with link %s" , item['name'] , item['link'])
yield scrapy.Request(url[0],meta={'my_country_item':item}, callback=self.parse_airports)

此方法为每个机场抓取信息,并为每个机场调用一个 scrapy.request 和机场 url 以抓取出发和到达:

  ###################################
# PARSE EACH AIRPORT
####################################
def parse_airports(self, response):
item = response.meta['my_country_item']
item['airports'] = []

for airport in response.xpath('//a[@data-iata]'):
url = airport.xpath('./@href').extract()
iata = airport.xpath('./@data-iata').extract()
iatabis = airport.xpath('./small/text()').extract()
name = ''.join(airport.xpath('./text()').extract()).strip()
lat = airport.xpath("./@data-lat").extract()
lon = airport.xpath("./@data-lon").extract()
iAirport = AirportItem()
iAirport['name'] = self.clean_html(name)
iAirport['link'] = url[0]
iAirport['lat'] = lat[0]
iAirport['lon'] = lon[0]
iAirport['code_little'] = iata[0]
iAirport['code_total'] = iatabis[0]

item['airports'].append(iAirport)

urls = []
for airport in item['airports']:
json_url = self.build_api_call(airport['code_little'], 1, self.compute_timestamp())
urls.append(json_url)
if not urls:
return item

# start with first url
next_url = urls.pop()
return scrapy.Request(next_url, self.parse_schedule, meta={'airport_item': item, 'airport_urls': urls, 'i': 0})

使用递归方法 parse_schedule 我将每个机场添加到国家项目。 SO成员已经help me关于这一点。

###################################
# PARSE EACH AIRPORT OF COUNTRY
###################################
def parse_schedule(self, response):
"""we want to loop this continuously to build every departure and arrivals requests"""
item = response.meta['airport_item']
i = response.meta['i']
urls = response.meta['airport_urls']

urls_departures, urls_arrivals = self.compute_urls_by_page(response, item['airports'][i]['name'], item['airports'][i]['code_little'])

print("urls_departures = ", len(urls_departures))
print("urls_arrivals = ", len(urls_arrivals))

## YIELD NOT CALLED
yield scrapy.Request(response.url, self.parse_departures_page, meta={'airport_item': item, 'page_urls': urls_departures, 'i':0 , 'p': 0}, dont_filter=True)

# now do next schedule items
if not urls:
yield item
return
url = urls.pop()

yield scrapy.Request(url, self.parse_schedule, meta={'airport_item': item, 'airport_urls': urls, 'i': i + 1})

self.compute_urls_by_page 方法计算正确的 URL 以检索一个机场的所有出发和到达。

###################################
# PARSE EACH DEPARTURES / ARRIVALS
###################################
def parse_departures_page(self, response):
item = response.meta['airport_item']
p = response.meta['p']
i = response.meta['i']
page_urls = response.meta['page_urls']

print("PAGE URL = ", page_urls)

if not page_urls:
yield item
return
page_url = page_urls.pop()

print("GET PAGE FOR ", item['airports'][i]['name'], ">> ", p)

jsonload = json.loads(response.body_as_unicode())
json_expression = jmespath.compile("result.response.airport.pluginData.schedule.departures.data")
item['airports'][i]['departures'] = json_expression.search(jsonload)

yield scrapy.Request(page_url, self.parse_departures_page, meta={'airport_item': item, 'page_urls': page_urls, 'i': i, 'p': p + 1})

接下来,通常调用 self.parse_departure_page 递归方法的 parse_schedule 中的第一个 yield 会产生奇怪的结果。 Scrapy 调用了这个方法,但我只收集了一个机场的出发页面,我不明白为什么...... 我的请求或 yield 源代码中可能有一个订购错误,所以也许你可以帮忙我来找出答案。

完整代码在GitHub上https://github.com/IDEES-Rouen/Flight-Scrapping/tree/master/flight/flight_project

您可以使用 scrapy cawl airports 命令运行它。

更新 1:

我尝试使用 yield from 单独回答这个问题,但没有成功,因为您可以在底部看到答案……如果您有想法?

最佳答案

是的,我终于找到了答案here所以...

当你使用递归yield时,你需要使用yield from。这里有一个简化的例子:

airport_list = ["airport1", "airport2", "airport3", "airport4"]

def parse_page_departure(airport, next_url, page_urls):

print(airport, " / ", next_url)


if not page_urls:
return

next_url = page_urls.pop()

yield from parse_page_departure(airport, next_url, page_urls)

###################################
# PARSE EACH AIRPORT OF COUNTRY
###################################
def parse_schedule(next_airport, airport_list):

## GET EACH DEPARTURE PAGE

departures_list = ["p1", "p2", "p3", "p4"]

next_departure_url = departures_list.pop()
yield parse_page_departure(next_airport,next_departure_url, departures_list)

if not airport_list:
print("no new airport")
return

next_airport_url = airport_list.pop()

yield from parse_schedule(next_airport_url, airport_list)

next_airport_url = airport_list.pop()
result = parse_schedule(next_airport_url, airport_list)

for i in result:
print(i)
for d in i:
print(d)

更新,不要使用真正的程序:

我尝试重现相同的 yield from 模式 with the real program here , 但我在 scrapy.Request 上使用它时出错,不明白为什么...

这里是 python 回溯:

Traceback (most recent call last):
File "/home/reyman/.pyenv/versions/venv352/lib/python3.5/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/home/reyman/.pyenv/versions/venv352/lib/python3.5/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/home/reyman/.pyenv/versions/venv352/lib/python3.5/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/home/reyman/.pyenv/versions/venv352/lib/python3.5/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/reyman/.pyenv/versions/venv352/lib/python3.5/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/reyman/Projets/Flight-Scrapping/flight/flight_project/spiders/AirportsSpider.py", line 209, in parse_schedule
yield from scrapy.Request(url, self.parse_schedule, meta={'airport_item': item, 'airport_urls': urls, 'i': i + 1})
TypeError: 'Request' object is not iterable
2017-06-27 17:40:50 [scrapy.core.engine] INFO: Closing spider (finished)
2017-06-27 17:40:50 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

关于python - Yield Request调用在scrapy的递归方法中产生奇怪的结果,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43667622/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com