gpt4 book ai didi

python - Pandas dataframe.apply() 将值错误应用于数据框列

转载 作者:太空宇宙 更新时间:2023-11-03 15:05:33 25 4
gpt4 key购买 nike

我的代码使用 dataframe.apply() 调用一个函数。该函数使用 pandas.Series 返回多个值。但是,dataframe.apply() 将值应用于错误的列。

下面的代码尝试返回 dte、mark 和 iv。这些值在返回语句之前打印出来以验证值。

import pandas as pd
from pandas import Timestamp
from pandas.tseries.holiday import get_calendar, HolidayCalendarFactory, GoodFriday
from datetime import datetime
from math import sqrt, pi, log, exp, isnan
from scipy.stats import norm


# dff = Daily Fed Funds Rate https://research.stlouisfed.org/fred2/data/DFF.csv
dff = pd.read_csv('https://research.stlouisfed.org/fred2/data/DFF.csv', parse_dates=[0], index_col='DATE')
rf = float('%.4f' % (dff['VALUE'][-1:][0] / 100))
tradingMinutesDay = 450 # 7.5 hours per day * 60 minutes per hour
tradingMinutesAnnum = 113400 # trading minutes per day * 252 trading days per year
USFedCal = get_calendar('USFederalHolidayCalendar') # Load US Federal holiday calendar
USFedCal.rules.pop(7) # Remove Veteran's Day
USFedCal.rules.pop(6) # Remove Columbus Day
tradingCal = HolidayCalendarFactory('TradingCalendar', USFedCal, GoodFriday) # Add Good Friday
cal = tradingCal()


def newtonRap(row):
# Initialize variables
dte, mark, iv = 0.0, 0.0, 0.0
if row['Bid'] == 0.0 or row['Ask'] == 0.0 or row['RootPrice'] == 0.0 or row['Strike'] == 0.0 or \
row['TimeStamp'] == row['Expiry']:
iv, vega = 0.0, 0.0 # Set iv and vega to zero if option contract is invalid or expired
else:
# dte (Days to expiration) uses pandas bdate_range method to determine the number of business days to expiration
# minus USFederalHolidays minus constant of 1 for the TimeStamp date
dte = float(len(pd.bdate_range(row['TimeStamp'], row['Expiry'])) -
len(cal.holidays(row['TimeStamp'], row['Expiry']).to_pydatetime()) - 1)
mark = (row['Bid'] + row['Ask']) / 2
cp = 1 if row['OptType'] == 'C' else -1
S = row['RootPrice']
K = row['Strike']
T = (dte * tradingMinutesDay) / tradingMinutesAnnum
iv = sqrt(2 * pi / T) * mark / S # Initialize IV (Brenner and Subrahmanyam 1988)
vega = 0.0 # Initialize vega
for i in range(1, 100):
d1 = (log(S / K) + T * (rf + iv ** 2 / 2)) / (iv * sqrt(T))
d2 = d1 - iv * sqrt(T)
vega = S * norm.pdf(d1) * sqrt(T)
model = cp * S * norm.cdf(cp * d1) - cp * K * exp(-rf * T) * norm.cdf(cp * d2)
iv -= (model - mark) / vega
if abs(model - mark) < 1.0e-5:
break
if isnan(iv) or isnan(vega):
iv, vega = 0.0, 0.0
print 'DTE', dte, 'Mark', mark, 'newtRaphIV', iv
return pd.Series({'DTE': dte, 'Mark': mark, 'IV': iv})


if __name__ == "__main__":
# sample data
col_order = ['TimeStamp', 'OpraSymbol', 'RootSymbol', 'Expiry', 'Strike', 'OptType', 'RootPrice', 'Last', 'Bid', 'Ask', 'Volume', 'OpenInt', 'IV']
df = pd.DataFrame({'Ask': {0: 3.7000000000000002, 1: 2.4199999999999999, 2: 3.0, 3: 2.7999999999999998, 4: 2.4500000000000002, 5: 3.25, 6: 5.9500000000000002, 7: 6.2999999999999998},
'Bid': {0: 3.6000000000000001, 1: 2.3399999999999999, 2: 2.8599999999999999, 3: 2.7400000000000002, 4: 2.4399999999999999, 5: 3.1000000000000001, 6: 5.7000000000000002, 7: 6.0999999999999996},
'Expiry': {0: Timestamp('2015-10-16 16:00:00'), 1: Timestamp('2015-10-16 16:00:00'), 2: Timestamp('2015-10-16 16:00:00'), 3: Timestamp('2015-10-16 16:00:00'), 4: Timestamp('2015-10-16 16:00:00'), 5: Timestamp('2015-10-16 16:00:00'), 6: Timestamp('2015-11-20 16:00:00'), 7: Timestamp('2015-11-20 16:00:00')},
'IV': {0: 0.3497, 1: 0.3146, 2: 0.3288, 3: 0.3029, 4: 0.3187, 5: 0.2926, 6: 0.3635, 7: 0.3842},
'Last': {0: 3.46, 1: 2.34, 2: 3.0, 3: 2.81, 4: 2.35, 5: 3.20, 6: 5.90, 7: 6.15},
'OpenInt': {0: 1290.0, 1: 3087.0, 2: 28850.0, 3: 44427.0, 4: 2318.0, 5: 3773.0, 6: 17112.0, 7: 15704.0},
'OpraSymbol': {0: 'AAPL151016C00109000', 1: 'AAPL151016P00109000', 2: 'AAPL151016C00110000', 3: 'AAPL151016P00110000', 4: 'AAPL151016C00111000', 5: 'AAPL151016P00111000', 6: 'AAPL151120C00110000', 7: 'AAPL151120P00110000'},
'OptType': {0: 'C', 1: 'P', 2: 'C', 3: 'P', 4: 'C', 5: 'P', 6: 'C', 7: 'P'},
'RootPrice': {0: 109.95, 1: 109.95, 2: 109.95, 3: 109.95, 4: 109.95, 5: 109.95, 6: 109.95, 7: 109.95},
'RootSymbol': {0: 'AAPL', 1: 'AAPL', 2: 'AAPL', 3: 'AAPL', 4: 'AAPL', 5: 'AAPL', 6: 'AAPL', 7: 'AAPL'},
'Strike': {0: 109.0, 1: 109.0, 2: 110.0, 3: 110.0, 4: 111.0, 5: 111.0, 6: 110.0, 7: 110.0},
'TimeStamp': {0: Timestamp('2015-09-30 16:00:00'), 1: Timestamp('2015-09-30 16:00:00'), 2: Timestamp('2015-09-30 16:00:00'), 3: Timestamp('2015-09-30 16:00:00'), 4: Timestamp('2015-09-30 16:00:00'), 5: Timestamp('2015-09-30 16:00:00'), 6: Timestamp('2015-09-30 16:00:00'), 7: Timestamp('2015-09-30 16:00:00')},
'Volume': {0: 1565.0, 1: 3790.0, 2: 10217.0, 3: 12113.0, 4: 6674.0, 5: 2031.0, 6: 5330.0, 7: 3724.0}})
df = df[col_order]


df[['DTE', 'Mark', 'newtRaphIV']] = df.apply(newtonRap, axis=1)
print df[['DTE', 'Mark', 'newtRaphIV']]

当我打印 dte、mark 和 iv 的数据帧列时,iv 的值应用于 mark 列,而 mark 的值应用于 iv 列。

查看下面的输出:

DTE 12.0 Mark 3.65 newtRaphIV 0.330446529117
DTE 12.0 Mark 2.38 newtRaphIV 0.297287843836
DTE 12.0 Mark 2.93 newtRaphIV 0.308354580411
DTE 12.0 Mark 2.77 newtRaphIV 0.287119199001
DTE 12.0 Mark 2.445 newtRaphIV 0.305461340472
DTE 12.0 Mark 3.175 newtRaphIV 0.272517270403
DTE 37.0 Mark 5.825 newtRaphIV 0.347642501561
DTE 37.0 Mark 6.2 newtRaphIV 0.368273860485
DTE Mark newtRaphIV
0 12 0.330447 3.650
1 12 0.297288 2.380
2 12 0.308355 2.930
3 12 0.287119 2.770
4 12 0.305461 2.445
5 12 0.272517 3.175
6 37 0.347643 5.825
7 37 0.368274 6.200

这不是我预期的行为。怎么回事?

最佳答案

df.apply(newtonRap, axis=1)

是一个包含列 ['DTE', 'Mark', 'IV'] 的 DataFrame,但不保证列的顺序(原因见下文)。因此,要修复 DataFrame 列的顺序,您可以修复 newtonRap 返回的 Series 索引的顺序:

return pd.Series((dte, mark, iv), index=['DTE','Mark','IV'])

或在 df.apply 返回后修复列的顺序:

df[['DTE', 'Mark', 'newtRaphIV']] = df.apply(newtonRap, axis=1)[['DTE', 'Mark', 'IV']]

第一个选项更好,因为

df.apply(newtonRap, axis=1)[['DTE', 'Mark', 'IV']]

创建两个中间 DataFrame -- df.apply(newtonRap, axis=1)df.apply(newtonRap, axis=1)[['DTE', 'Mark', 'IV']],而第一个选项从一开始就创建了正确的 DataFrame。


DataFrame 分配按索引对齐但不按列对齐:

注意表单的赋值

df[['C','E','D']] = other_df

基于索引而不是列名对齐。所以 df.apply(newtonRap, axis=1) 的列名是什么并不重要。例如,它不会帮助改变

return pd.Series({'DTE': dte, 'Mark': mark, 'IV': iv})

return pd.Series({'DTE': dte, 'Mark': mark, 'newtRaphIV': iv})

使 df.apply(newtonRap, axis=1) 的列名与df[['DTE', 'Mark', 'newtRaphIV']]。如果是这样,那将是愚蠢的运气df.apply(newtonRap, axis=1) 返回的列的顺序 发生 以匹配所需的顺序。为了证实这一说法,请考虑示例

df = pd.DataFrame(np.random.randint(10, size=(3,2)), columns=list('AB'))
new = pd.DataFrame(np.arange(9).reshape((3,3)), columns=list('CDE'), index=[2,1,0])
# C D E
# 2 0 1 2
# 1 3 4 5
# 0 6 7 8

df[['C','E','D']] = new
# A B C E D
# 0 7 9 6 7 8
# 1 4 9 3 4 5
# 2 8 2 0 1 2

请注意,newdf 的索引对齐,但没有基于列标签的对齐。


修复 apply 返回的 DataFrame 列的顺序:

请注意,字典键是无序的。换句话说,当迭代时,字典键可能以任何顺序出现。事实上,在 Python3 中,每次运行相同的代码时,dict.keys() 可能会以不同的顺序返回相同的键。

因为字典键有不确定的顺序,

pd.Series({'DTE': dte, 'Mark': mark, 'IV': iv})

是一个序列,其索引具有不确定的顺序,因此 df.apply(newtonRap, axis=1) 是一个 DataFrame,其列以不确定的顺序出现。

但是,如果你使用

return pd.Series((dte, mark, iv), index=['DTE','Mark','IV'])

那么Series索引的顺序就固定了。因此 df.apply(newtonRap, axis=1) 具有固定的列顺序,然后

df[['DTE', 'Mark', 'newtRaphIV']] = df.apply(newtonRap, axis=1)

将按需要工作。

关于python - Pandas dataframe.apply() 将值错误应用于数据框列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33138116/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com