gpt4 book ai didi

Python Pandas -- 时间序列的随机抽样

转载 作者:太空宇宙 更新时间:2023-11-03 11:55:09 24 4
gpt4 key购买 nike

Pandas 新手,正在寻找最有效的方法。

我有一系列数据框。每个 DataFrame 都有相同的列但索引不同,并且它们按日期索引。该系列由股票代码索引。因此,序列中的每一项都代表每只股票表现的单一时间序列。

我需要随机生成一个包含 n 个数据框的列表,其中每个数据框都是可用股票历史的一些随机分类的子集。如果有重叠也没关系,只要开始结束日期不同即可。

下面的代码可以做到,但它真的很慢,我想知道是否有更好的方法来解决它:

代码

def random_sample(data=None, timesteps=100, batch_size=100, subset='train'):
if type(data) != pd.Series:
return None

if subset=='validate':
offset = 0
elif subset=='test':
offset = 200
elif subset=='train':
offset = 400

tickers = np.random.randint(0, len(data), size=len(data))

ret_data = []
while len(ret_data) != batch_size:
for t in tickers:
data_t = data[t]
max_len = len(data_t)-timesteps-1
if len(ret_data)==batch_size: break
if max_len-offset < 0: continue

index = np.random.randint(offset, max_len)
d = data_t[index:index+timesteps]
if len(d)==timesteps: ret_data.append(d)

return ret_data

配置文件输出:

Timer unit: 1e-06 s

File: finance.py
Function: random_sample at line 137
Total time: 0.016142 s

Line # Hits Time Per Hit % Time Line Contents
==============================================================
137 @profile
138 def random_sample(data=None, timesteps=100, batch_size=100, subset='train'):
139 1 5 5.0 0.0 if type(data) != pd.Series:
140 return None
141
142 1 1 1.0 0.0 if subset=='validate':
143 offset = 0
144 1 1 1.0 0.0 elif subset=='test':
145 offset = 200
146 1 0 0.0 0.0 elif subset=='train':
147 1 1 1.0 0.0 offset = 400
148
149 1 1835 1835.0 11.4 tickers = np.random.randint(0, len(data), size=len(data))
150
151 1 2 2.0 0.0 ret_data = []
152 2 3 1.5 0.0 while len(ret_data) != batch_size:
153 116 148 1.3 0.9 for t in tickers:
154 116 2497 21.5 15.5 data_t = data[t]
155 116 317 2.7 2.0 max_len = len(data_t)-timesteps-1
156 116 80 0.7 0.5 if len(ret_data)==batch_size: break
157 115 69 0.6 0.4 if max_len-offset < 0: continue
158
159 100 101 1.0 0.6 index = np.random.randint(offset, max_len)
160 100 10840 108.4 67.2 d = data_t[index:index+timesteps]
161 100 241 2.4 1.5 if len(d)==timesteps: ret_data.append(d)
162
163 1 1 1.0 0.0 return ret_data

最佳答案

您确定需要找到更快的方法吗?您当前的方法并不那么慢。以下更改可能会简化,但不一定会更快:

第 1 步:从数据帧列表中随机抽样(带替换)

rand_stocks = np.random.randint(0, len(data), size=batch_size) 

您可以将此数组 rand_stocks 视为要应用于您的数据帧系列的索引列表。大小已经是批量大小,因此无需 while 循环和第 156 行的比较。

也就是说,您可以遍历 rand_stocks 并像这样访问股票:

for idx in rand_stocks: 
stock = data.ix[idx]
# Get a sample from this stock.

第 2 步:为您随机选择的每只股票获取一个随机数据范围。

start_idx = np.random.randint(offset, len(stock)-timesteps)
d = data_t[start_idx:start_idx+timesteps]

我没有你的数据,但我是这样整理的:

def random_sample(data=None, timesteps=100, batch_size=100, subset='train'):
if subset=='train': offset = 0 #you can obviously change this back
rand_stocks = np.random.randint(0, len(data), size=batch_size)
ret_data = []
for idx in rand_stocks:
stock = data[idx]
start_idx = np.random.randint(offset, len(stock)-timesteps)
d = stock[start_idx:start_idx+timesteps]
ret_data.append(d)
return ret_data

创建数据集:

In [22]: import numpy as np
In [23]: import pandas as pd

In [24]: rndrange = pd.DateRange('1/1/2012', periods=72, freq='H')
In [25]: rndseries = pd.Series(np.random.randn(len(rndrange)), index=rndrange)
In [26]: rndseries.head()
Out[26]:
2012-01-02 2.025795
2012-01-03 1.731667
2012-01-04 0.092725
2012-01-05 -0.489804
2012-01-06 -0.090041

In [27]: data = [rndseries,rndseries,rndseries,rndseries,rndseries,rndseries]

测试功能:

In [42]: random_sample(data, timesteps=2, batch_size = 2)
Out[42]:
[2012-01-23 1.464576
2012-01-24 -1.052048,
2012-01-23 1.464576
2012-01-24 -1.052048]

关于Python Pandas -- 时间序列的随机抽样,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/13239297/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com