gpt4 book ai didi

python - 如何在每日时间序列对象上迭代网络抓取脚本,以便从网页创建每日时间序列数据

转载 作者:太空宇宙 更新时间:2023-11-04 04:15:21 26 4
gpt4 key购买 nike

感谢您查看我的问题。我使用 BeautifulSoup 和 Pandas 创建了一个脚本,该脚本从美联储网站上抓取预测数据。预测每季度发布一次(约 3 个月)。我想编写一个脚本来创建每日时间序列并每天检查一次美联储网站,如果发布了新的预测,脚本会将其添加到时间序列中。如果没有更新,那么脚本只会将时间序列附加到最后一个有效的、更新的投影中。

从我最初的挖掘来看,似乎有外部资源我可以用来每天“触发”脚本,但我更愿意将所有内容都保留为纯 python。

我为完成抓取而编写的代码是这样的:

from bs4 import BeautifulSoup
import requests
import re
import wget
import pandas as pd

# Starting url and the indicator (key) for links of interest
url = "https://www.federalreserve.gov/monetarypolicy/fomccalendars.htm"
key = '/monetarypolicy/fomcprojtabl'

# Cook the soup
page = requests.get(url)
data = page.text
soup = BeautifulSoup(data)

# Create the tuple of links for projection pages
projections = []
for link in soup.find_all('a', href=re.compile(key)):
projections.append(link["href"])

# Create a tuple to store the projections
decfcasts = []
for i in projections:
url = "https://www.federalreserve.gov{}".format(i)
file = wget.download(url)
df_list = pd.read_html(file)
fcast = df_list[-1].iloc[:,0:2]
fcast.columns = ['Target', 'Votes']
fcast.fillna(0, inplace = True)
decfcasts.append(fcast)

到目前为止,我编写的代码将所有内容都放在一个元组中,但数据没有时间/日期索引。我一直在考虑编写伪代码,我猜它看起来像

Create daily time series object
for each day in time series:
if day in time series = day in link:
run webscraper
other wise, append time series with last available observation

至少,这是我的想法。最终的时间序列可能最终看起来相当“ block 状”,因为会有很多天进行相同的观察,然后当新的预测出现时,会有一个“跳跃”然后更多重复直到下一个投影出来。

显然,我们非常感谢任何帮助。无论如何,提前致谢!

最佳答案

我已经为您编辑了代码。现在它从 url 获取日期。日期在数据框中保存为期间。只有当日期不存在于数据框中(从泡菜中恢复)时,它才会被处理和追加。

from bs4 import BeautifulSoup
import requests
import re
import wget
import pandas as pd

# Starting url and the indicator (key) for links of interest
url = "https://www.federalreserve.gov/monetarypolicy/fomccalendars.htm"
key = '/monetarypolicy/fomcprojtabl'

# Cook the soup
page = requests.get(url)
data = page.text
soup = BeautifulSoup(data)

# Create the tuple of links for projection pages
projections = []
for link in soup.find_all('a', href=re.compile(key)):
projections.append(link["href"])

# past results from pickle, when no pickle init empty dataframe
try:
decfcasts = pd.read_pickle('decfcasts.pkl')
except FileNotFoundError:
decfcasts = pd.DataFrame(columns=['target', 'votes', 'date'])


for i in projections:

# parse date from url
date = pd.Period(''.join(re.findall(r'\d+', i)), 'D')

# process projection if it wasn't included in data from pickle
if date not in decfcasts['date'].values:

url = "https://www.federalreserve.gov{}".format(i)
file = wget.download(url)
df_list = pd.read_html(file)
fcast = df_list[-1].iloc[:, 0:2]
fcast.columns = ['target', 'votes']
fcast.fillna(0, inplace=True)

# set date time
fcast.insert(2, 'date', date)
decfcasts = decfcasts.append(fcast)

# save to pickle
pd.to_pickle(decfcasts, 'decfcasts.pkl')

关于python - 如何在每日时间序列对象上迭代网络抓取脚本,以便从网页创建每日时间序列数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55553929/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com