gpt4 book ai didi

python - 替代 df.iterrows() 用于连接两个 Postgres 表和计算功能

转载 作者:行者123 更新时间:2023-11-29 12:16:29 25 4
gpt4 key购买 nike

我有一个数据框 (game_df) 和一个 Postgres 表 (team_stats_1970_2017)。

game_df 由几千行数据组成,如下所示......

      season_yr home_team visitor_team  home_team_runs  visitor_team_runs
0 2017 ARI SFG 6 5
1 2017 ARI SFG 4 8
2 2017 ARI SFG 8 6
3 2017 ARI SFG 9 3
4 2017 ARI CLE 7 3
5 2017 ARI CLE 11 2
6 2017 ATL LAD 2 3

team_stats_1970_2017 会有相应的数据

   team season_yr  r_per_g pa      ab    b_r  b_h   b2   b3   b_hr
0 ARI 2017 5.01 6224.0 5525 812 1405 314 39 220
1 ATL 2017 4.52 6216.0 5584 732 1467 289 26 165
2 CLE 2017 5.05 6234.0 5511 818 1449 333 29 212
3 LAD 2017 4.75 6191.0 5408 770 1347 312 20 221
4 SFG 2017 3.94 6137.0 5551 639 1382 290 28 128

例如,对于 game_df 的第 1 行,代码从 Postgres 中的 team_stats_1970_2017 选择“ARI”和“SFG”数据,并从中创建特征。然后对 game_df 中的其余行重复此操作。

我目前正在使用 df.iterrows,但我注意到它非常慢,因为我只测试了一小部分数据,而且它仍然需要一段时间。有人对此有更好/更快的选择吗?

features = []
results = []
for index,row in game_df.iterrows():
import psycopg2 as pg2
connect = pg2.connect(login)
cursor=connect.cursor()

year, t1, t2, p1, p2 = row
p1, p2 = map(int, [p1, p2])
feature1 = cursor.execute("SELECT * FROM team_stats_1970_2017 WHERE team = (%s) and season_yr = (%s)",(t1,year))
feature1 = list(cursor.fetchone()[2::])
feature2 = cursor.execute("SELECT * FROM team_stats_1970_2017 WHERE team = (%s) and season_yr = (%s)",(t2,year))
feature2 = list(cursor.fetchone()[2::])
feature = np.array(feature2) - np.array(feature1)

## Calculate result of game
if (p2 - p1) > 0:
result = 1
else:
result = 0

features.append(feature)
results.append(result)

最佳答案

这是另一种更容易理解的方法,但使用 merge 作为@sacul 的解决方案。我将使用 team_stats_1970_2017 中的值创建两个数据帧 df_visitordf_home,为每一行和 'visitor_team 列中的团队''home_team' 分别为:

df_visitor = (game_df[['season_yr','visitor_team']].rename(columns={'visitor_team':'team'})
.merge(team_stats_1970_2017, how='left'))

df_home = (game_df[['season_yr','home_team']].rename(columns={'home_team':'team'})
.merge(team_stats_1970_2017, how='left'))

例如,对于 df_home,您会得到:

   season_yr team  r_per_g      pa    ab  b_r   b_h   b2  b3  b_hr
0 2017 ARI 5.01 6224.0 5525 812 1405 314 39 220
1 2017 ARI 5.01 6224.0 5525 812 1405 314 39 220
2 2017 ARI 5.01 6224.0 5525 812 1405 314 39 220
3 2017 ARI 5.01 6224.0 5525 812 1405 314 39 220
4 2017 ARI 5.01 6224.0 5525 812 1405 314 39 220
5 2017 ARI 5.01 6224.0 5525 812 1405 314 39 220
6 2017 ATL 4.52 6216.0 5584 732 1467 289 26 165

对于每一行,它是来自 team_stats_1970_2017 的值,该值与 game_df 中的团队相关联,位于 'home_team' 列的同一行。

现在要在原始数据帧 game_df 处添加差异,您可以执行以下操作:

# first get the lists of columns you want to add
col_features = team_stats_1970_2017.columns[2:]
game_df[col_features] = df_visitor[col_features] - df_home[col_features]

最后要添加列结果,您可以使用 np.where :

import numpy as np
game_df['results'] = np.where(game_df['visitor_team_runs'] > game_df['home_team_runs'], 1, 0)
# meaning if p2 > p1 then 1 else 0 as in your code I think

关于python - 替代 df.iterrows() 用于连接两个 Postgres 表和计算功能,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51662445/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com