gpt4 book ai didi

python - 使用 pandas 使用时间作为自变量滚动 OLS

转载 作者:行者123 更新时间:2023-12-01 08:07:58 27 4
gpt4 key购买 nike

我正在尝试使用股票价格的数据框/时间序列在 pandas 中构建滚动 OLS 模型。我想要做的是在过去 N 天执行 OLS 计算并返回预测价格和斜率,并将它们添加到数据框中各自的列中。据我所知,我唯一的选择是使用 PandasRollingOLS来自pyfinance所以我将在我的示例中使用它,但如果有其他方法,我很乐意使用它。

我的数据框如下所示:

Date                     Price
....
2019-03-31 08:59:59.999 1660
2019-03-31 09:59:59.999 1657
2019-03-31 10:59:59.999 1656
2019-03-31 11:59:59.999 1652
2019-03-31 12:59:59.999 1646
2019-03-31 13:59:59.999 1645
2019-03-31 14:59:59.999 1650
2019-03-31 15:59:59.999 1669
2019-03-31 16:59:59.999 1674

我想使用 Date 执行滚动回归列作为自变量。通常我会这样做:

X = df['Date']
y = df['Price']
model = ols.PandasRollingOLS(y, X, window=250)

但是,毫不奇怪地使用 df['Date']因为我的 X 返回错误。

所以我的第一个问题是,我需要对我的 Date 做什么?列以获取 PandasRollingOLS在职的。我的下一个问题是我到底需要调用什么来返回预测值和斜率?常规OLS我会做类似 model.predict 的事情和model.slope但这些选项显然不适用于 PandasRollingOLS .

我实际上想将这些值添加到我的 df 中的新列中,所以我在想类似 df['Predict'] = model.predict 的东西例如,但显然这不是答案。理想的结果 df 是这样的:

Date                     Price  Predict  Slope
....
2019-03-31 08:59:59.999 1660 1665 0.10
2019-03-31 09:59:59.999 1657 1663 0.10
2019-03-31 10:59:59.999 1656 1661 0.09
2019-03-31 11:59:59.999 1652 1658 0.08
2019-03-31 12:59:59.999 1646 1651 0.07
2019-03-31 13:59:59.999 1645 1646 0.07
2019-03-31 14:59:59.999 1650 1643 0.07
2019-03-31 15:59:59.999 1669 1642 0.07
2019-03-31 16:59:59.999 1674 1645 0.08

任何帮助将不胜感激,干杯。

最佳答案

您可以使用 datetime.datetime.strptimetime.mktime 将日期转换为整数,然后使用 为数据帧的所需子集构建模型statsmodels 和处理滚动窗口的自定义函数:

输出:

                         Price      Predict     Slope
Date
2019-03-31 10:59:59.999 1656 1657.670504 0.000001
2019-03-31 11:59:59.999 1652 1655.003830 0.000001
2019-03-31 12:59:59.999 1646 1651.337151 0.000001
2019-03-31 13:59:59.999 1645 1647.670478 0.000001
2019-03-31 14:59:59.999 1650 1647.003818 0.000001
2019-03-31 15:59:59.999 1669 1654.670518 0.000001
2019-03-31 16:59:59.999 1674 1664.337207 0.000001

代码:

#%%
# imports
import datetime, time
import pandas as pd
import numpy as np
import statsmodels.api as sm
from collections import OrderedDict

# your data in a more easily reprodicible format
data = {'Date': ['2019-03-31 08:59:59.999', '2019-03-31 09:59:59.999', '2019-03-31 10:59:59.999',
'2019-03-31 11:59:59.999', '2019-03-31 12:59:59.999', '2019-03-31 13:59:59.999',
'2019-03-31 14:59:59.999', '2019-03-31 15:59:59.999', '2019-03-31 16:59:59.999'],
'Price': [1660, 1657, 1656, 1652, 1646, 1645, 1650, 1669, 1674]}

# function to make a useful time structure as independent variable
def myTime(date_time_str):
date_time_obj = datetime.datetime.strptime(date_time_str, '%Y-%m-%d %H:%M:%S.%f')
return(time.mktime(date_time_obj.timetuple()))

# add time structure to dataset
data['Time'] = [myTime(obs) for obs in data['Date']]

# time for pandas
df = pd.DataFrame(data)

# Function for rolling OLS of a desired window size on a pandas dataframe

def RegressionRoll(df, subset, dependent, independent, const, win):
"""
RegressionRoll takes a dataframe, makes a subset of the data if you like,
and runs a series of regressions with a specified window length, and
returns a dataframe with BETA or R^2 for each window split of the data.

Parameters:
===========
df -- pandas dataframe
subset -- integer - has to be smaller than the size of the df or 0 if no subset.
dependent -- string that specifies name of denpendent variable
independent -- LIST of strings that specifies name of indenpendent variables
const -- boolean - whether or not to include a constant term
win -- integer - window length of each model

Example:
========
df_rolling = RegressionRoll(df=df, subset = 0,
dependent = 'Price', independent = ['Time'],
const = False, win = 3)

"""

# Data subset
if subset != 0:
df = df.tail(subset)
else:
df = df

# Loopinfo
end = df.shape[0]+1
win = win
rng = np.arange(start = win, stop = end, step = 1)

# Subset and store dataframes
frames = {}
n = 1

for i in rng:
df_temp = df.iloc[:i].tail(win)
newname = 'df' + str(n)
frames.update({newname: df_temp})
n += 1

# Analysis on subsets
df_results = pd.DataFrame()
for frame in frames:

#debug
#print(frames[frame])

# Rolling data frames
dfr = frames[frame]
y = dependent
x = independent

# Model with or without constant
if const == True:
x = sm.add_constant(dfr[x])
model = sm.OLS(dfr[y], x).fit()
else:
model = sm.OLS(dfr[y], dfr[x]).fit()

# Retrieve price and price prediction
Prediction = model.predict()[-1]
d = {'Price':dfr['Price'].iloc[-1], 'Predict':Prediction}
df_prediction = pd.DataFrame(d, index = dfr['Date'][-1:])

# Retrieve parameters (constant and slope, or slope only)
theParams = model.params[0:]
coefs = theParams.to_frame()
df_temp = pd.DataFrame(coefs.T)
df_temp.index = dfr['Date'][-1:]

# Build dataframe with Price, Prediction and Slope (+constant if desired)
df_temp2 = pd.concat([df_prediction, df_temp], axis = 1)
df_temp2=df_temp2.rename(columns = {'Time':'Slope'})
df_results = pd.concat([df_results, df_temp2], axis = 0)

return(df_results)

# test run
df_rolling = RegressionRoll(df=df, subset = 0,
dependent = 'Price', independent = ['Time'],
const = False, win = 3)
print(df_rolling)

通过不指定这么多变量,而是将更多表达式直接放入字典和函数中,可以轻松缩短代码,但我们可以看看生成的输出是否确实代表了您想要的输出。另外,您没有指定是否在分析中包含常数项,因此我也提供了一个选项来处理该问题。

关于python - 使用 pandas 使用时间作为自变量滚动 OLS,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55443071/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com