gpt4 book ai didi

python - 碎片(Python): Iterating over 'next' page without multiple functions

转载 作者:太空宇宙 更新时间:2023-11-04 03:32:43 25 4
gpt4 key购买 nike

我正在使用 Scrapy 从 Yahoo! 获取股票数据金融。

有时,我需要遍历多个页面,19 in this example , 为了得到所有的股票数据。

以前(当我知道只有两个页面时),我会为每个页面使用一个函数,如下所示:

def stocks_page_1(self, response):

returns_page1 = []

#Grabs data here...

current_page = response.url
next_page = current_page + "&z=66&y=66"
yield Request(next_page, self.stocks_page_2, meta={'returns_page1': returns_page1})

def stocks_page_2(self, response):

# Grab data again...

现在,无需编写 19 个或更多函数,我想知道是否有一种方法可以使用一个函数循环迭代,从给定股票可用的所有页面中获取所有数据。

像这样:

        for x in range(30): # 30 was randomly selected
current_page = response.url
# Grabs Data
# Check if there is a 'next' page:
if response.xpath('//td[@align="right"]/a[@rel="next"]').extract() != ' ':
u = x * 66
next_page = current_page + "&z=66&y={0}".format(u)
# Go to the next page somehow within the function???

更新代码:

有效,但只返回一页数据。

class DmozSpider(CrawlSpider):


name = "dnot"
allowed_domains = ["finance.yahoo.com", "http://eoddata.com/"]
start_urls = ['http://finance.yahoo.com/q?s=CAT']
rules = [
Rule(LinkExtractor(restrict_xpaths='//td[@align="right"]/a[@rel="next"]'),
callback='stocks1',
follow=True),
]
def stocks1(self, response):
returns = []
rows = response.xpath('//table[@class="yfnc_datamodoutline1"]//table/tr')[1:]
for row in rows:
cells = row.xpath('.//td/text()').extract()
try:
values = cells[-1]
try:
float(values)
returns.append(values)
except ValueError:
continue
except ValueError:
continue

unformatted_returns = response.meta.get('returns_pages')
returns = [float(i) for i in returns]
global required_amount_of_returns, counter
if counter == 1 and "CAT" in response.url:
required_amount_of_returns = len(returns)
elif required_amount_of_returns == 0:
raise CloseSpider("'Error with initiating required amount of returns'")

counter += 1
print counter

# Iterator to calculate Rate of return
# ====================================
if data_intervals == "m":
k = 12
elif data_intervals == "w":
k = 4
else:
k = 30

sub_returns_amount = required_amount_of_returns - k
sub_returns = returns[:sub_returns_amount]
rate_of_return = []

if len(returns) == required_amount_of_returns or "CAT" in response.url:
for number in sub_returns:
numerator = number - returns[k]
rate = numerator/returns[k]
if rate == '':
rate = 0
rate_of_return.append(rate)
k += 1

item = Website()
items = []
item['url'] = response.url
item['name'] = response.xpath('//div[@class="title"]/h2/text()').extract()
item['avg_returns'] = numpy.average(rate_of_return)
item['var_returns'] = numpy.cov(rate_of_return)
item['sd_returns'] = numpy.std(rate_of_return)
item['returns'] = returns
item['rate_of_returns'] = rate_of_return
item['exchange'] = response.xpath('//span[@class="rtq_exch"]/text()').extract()
item['ind_sharpe'] = ((numpy.average(rate_of_return) - RFR) / numpy.std(rate_of_return))
items.append(item)
yield item

最佳答案

你看,解析回调只是一个函数,它接受响应并返回或产生 ItemRequest 或两者。重用这些回调完全没有问题,因此您可以为每个请求传递相同的回调。

现在,您可以使用 Request 元传递当前页面信息,但相反,我会利用 CrawlSpider 来抓取每个页。真的很简单,开始用命令行生成Spider:

scrapy genspider --template crawl finance finance.yahoo.com

然后这样写:

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

Scrapy 1.0 已经弃用了上述模块的 scrapy.contrib 命名空间,但如果你坚持使用 0.24,请使用 scrapy.contrib.linkextractors scrapy.contrib.spiders.

from yfinance.items import YfinanceItem


class FinanceSpider(CrawlSpider):
name = 'finance'
allowed_domains = ['finance.yahoo.com']
start_urls = ['http://finance.yahoo.com/q/hp?s=PWF.TO&a=04&b=19&c=2005&d=04&e=19&f=2010&g=d&z=66&y=132']

rules = (
Rule(LinkExtractor(restrict_css='[rel="next"]'),
callback='parse_items',
follow=True),
)

LinkExtractor 将选取响应中的链接进行跟踪,但它可能受限于 XPath(或 CSS)和正则表达式。参见 documentation了解更多。

Rule 将遵循链接并在每次响应时调用 callbackfollow=True 将在每个新响应中继续提取链接,但它可能受深度限制。参见 documentation再次。

    def parse_items(self, response):
for line in response.css('.yfnc_datamodoutline1 table tr')[1:-1]:
yield YfinanceItem(date=line.css('td:first-child::text').extract()[0])

只需生成 Item,因为下一页的 Request 将由 CrawlSpider Rule 处理>s.

关于python - 碎片(Python): Iterating over 'next' page without multiple functions,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30420151/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com