gpt4 book ai didi

python - 如何使用 Featuretools 按列值从单个数据框中的多列创建特征?

转载 作者:太空宇宙 更新时间:2023-11-04 02:11:52 26 4
gpt4 key购买 nike

我正在尝试根据之前的结果预测足球比赛的结果。我在 Windows 上运行 Python 3.6 并使用 Featuretools 0.4.1。

假设我有以下表示结果历史记录的数据框。

Original DataFame

使用上面的数据框,我想创建以下数据框,它将作为 X 提供给机器学习算法。请注意,尽管过去的比赛 field 不同,但主客场球队的平均进球数需要按球队计算。有没有办法使用 Featuretools 创建这样的数据框? ?

Resulting Dataframe

可以找到用于模拟转换的Excel文件here .

最佳答案

这是一个棘手的功能,但在 Featuretools 中很好地使用了自定义原语。

第一步是将匹配的 CSV 加载到 Featuretools 实体集中

es = ft.EntitySet()
matches_df = pd.read_csv("./matches.csv")
es.entity_from_dataframe(entity_id="matches",
index="match_id",
time_index="match_date",
dataframe=matches_df)

然后我们定义一个自定义转换原语,用于计算过去 n 场比赛的平均进球数。它有一个参数来控制过去的比赛次数以及是否为主队或客队计算。有关定义自定义原语的信息在我们的文档中 herehere .

from featuretools.variable_types import Numeric, Categorical
from featuretools.primitives import make_trans_primitive

def avg_goals_previous_n_games(home_team, away_team, home_goals, away_goals, which_team=None, n=1):
# make dataframe so it's easier to work with
df = pd.DataFrame({
"home_team": home_team,
"away_team": away_team,
"home_goals": home_goals,
"away_goals": away_goals
})

result = []
for i, current_game in df.iterrows():
# get the right team for this game
team = current_game[which_team]

# find all previous games that have been played
prev_games = df.iloc[:i]

# only get games the team participated in
participated = prev_games[(prev_games["home_team"] == team) | (prev_games["away_team"] == team)]
if participated.shape[0] < n:
result.append(None)
continue

# get last n games
last_n = participated.tail(n)

# calculate games per game
goal_as_home = (last_n["home_team"] == team) * last_n["home_goals"]
goal_as_away = (last_n["away_team"] == team) * last_n["away_goals"]

# calculate mean across all home and away games
mean = (goal_as_home + goal_as_away).mean()

result.append(mean)

return result

# custom function so the name of the feature prints out correctly
def make_name(self):
return "%s_goal_last_%d" % (self.kwargs['which_team'], self.kwargs['n'])


AvgGoalPreviousNGames = make_trans_primitive(function=avg_goals_previous_n_games,
input_types=[Categorical, Categorical, Numeric, Numeric],
return_type=Numeric,
cls_attributes={"generate_name": make_name, "uses_full_entity":True})

现在我们可以使用这个原语定义特征。在这种情况下,我们将不得不手动完成。

input_vars = [es["matches"]["home_team"], es["matches"]["away_team"], es["matches"]["home_goals"], es["matches"]["away_goals"]]
home_team_last1 = AvgGoalPreviousNGames(*input_vars, which_team="home_team", n=1)
home_team_last3 = AvgGoalPreviousNGames(*input_vars, which_team="home_team", n=3)
home_team_last5 = AvgGoalPreviousNGames(*input_vars, which_team="home_team", n=5)
away_team_last1 = AvgGoalPreviousNGames(*input_vars, which_team="away_team", n=1)
away_team_last3 = AvgGoalPreviousNGames(*input_vars, which_team="away_team", n=3)
away_team_last5 = AvgGoalPreviousNGames(*input_vars, which_team="away_team", n=5)

features = [home_team_last1, home_team_last3, home_team_last5,
away_team_last1, away_team_last3, away_team_last5]

最后,我们可以计算特征矩阵

fm = ft.calculate_feature_matrix(entityset=es, features=features)

返回

          home_team_goal_last_1  home_team_goal_last_3  home_team_goal_last_5  away_team_goal_last_1  away_team_goal_last_3  away_team_goal_last_5
match_id
1 NaN NaN NaN NaN NaN NaN
2 2.0 NaN NaN 0.0 NaN NaN
3 1.0 NaN NaN 0.0 NaN NaN
4 3.0 1.000000 NaN 0.0 1.000000 NaN
5 1.0 1.333333 NaN 1.0 0.666667 NaN
6 2.0 2.000000 1.2 0.0 0.333333 0.8
7 1.0 0.666667 0.6 2.0 1.666667 1.6
8 2.0 1.000000 0.8 2.0 2.000000 2.0
9 0.0 1.000000 0.8 1.0 1.666667 1.6
10 3.0 2.000000 2.0 1.0 1.000000 0.8
11 3.0 2.333333 2.2 1.0 0.666667 1.0
12 2.0 2.666667 2.2 2.0 1.333333 1.2

最后,我们还可以使用这些手动定义的特征作为使用深度特征合成的自动化特征工程的输入,这在 here 中有解释。 .通过将手动定义的特征作为 seed_features 传递,ft.dfs 将自动堆叠在它们之上。

fm, feature_defs = ft.dfs(entityset=es, 
target_entity="matches",
seed_features=features,
agg_primitives=[],
trans_primitives=["day", "month", "year", "weekday", "percentile"])

feature_defs

[<Feature: home_team>,
<Feature: away_team>,
<Feature: home_goals>,
<Feature: away_goals>,
<Feature: label>,
<Feature: home_team_goal_last_1>,
<Feature: home_team_goal_last_3>,
<Feature: home_team_goal_last_5>,
<Feature: away_team_goal_last_1>,
<Feature: away_team_goal_last_3>,
<Feature: away_team_goal_last_5>,
<Feature: DAY(match_date)>,
<Feature: MONTH(match_date)>,
<Feature: YEAR(match_date)>,
<Feature: WEEKDAY(match_date)>,
<Feature: PERCENTILE(home_goals)>,
<Feature: PERCENTILE(away_goals)>,
<Feature: PERCENTILE(home_team_goal_last_1)>,
<Feature: PERCENTILE(home_team_goal_last_3)>,
<Feature: PERCENTILE(home_team_goal_last_5)>,
<Feature: PERCENTILE(away_team_goal_last_1)>,
<Feature: PERCENTILE(away_team_goal_last_3)>,
<Feature: PERCENTILE(away_team_goal_last_5)>]

特征矩阵为

         home_team away_team  home_goals  away_goals label  home_team_goal_last_1  home_team_goal_last_3  home_team_goal_last_5  away_team_goal_last_1  away_team_goal_last_3  away_team_goal_last_5  DAY(match_date)  MONTH(match_date)  YEAR(match_date)  WEEKDAY(match_date)  PERCENTILE(home_goals)  PERCENTILE(away_goals)  PERCENTILE(home_team_goal_last_1)  PERCENTILE(home_team_goal_last_3)  PERCENTILE(home_team_goal_last_5)  PERCENTILE(away_team_goal_last_1)  PERCENTILE(away_team_goal_last_3)  PERCENTILE(away_team_goal_last_5)
match_id
1 Arsenal Chelsea 2 0 1 NaN NaN NaN NaN NaN NaN 1 1 2014 2 0.666667 0.166667 NaN NaN NaN NaN NaN NaN
2 Arsenal Chelsea 1 0 1 2.0 NaN NaN 0.0 NaN NaN 2 1 2014 3 0.333333 0.166667 0.590909 NaN NaN 0.227273 NaN NaN
3 Arsenal Chelsea 0 3 2 1.0 NaN NaN 0.0 NaN NaN 3 1 2014 4 0.125000 0.958333 0.272727 NaN NaN 0.227273 NaN NaN
4 Chelsea Arsenal 1 1 X 3.0 1.000000 NaN 0.0 1.000000 NaN 4 1 2014 5 0.333333 0.500000 0.909091 0.333333 NaN 0.227273 0.500000 NaN
5 Chelsea Arsenal 2 0 1 1.0 1.333333 NaN 1.0 0.666667 NaN 5 1 2014 6 0.666667 0.166667 0.272727 0.555556 NaN 0.590909 0.277778 NaN
6 Chelsea Arsenal 2 1 1 2.0 2.000000 1.2 0.0 0.333333 0.8 6 1 2014 0 0.666667 0.500000 0.590909 0.722222 0.571429 0.227273 0.111111 0.214286
7 Arsenal Chelsea 2 2 X 1.0 0.666667 0.6 2.0 1.666667 1.6 7 1 2014 1 0.666667 0.791667 0.272727 0.111111 0.142857 0.909091 0.833333 0.785714
8 Arsenal Chelsea 0 1 2 2.0 1.000000 0.8 2.0 2.000000 2.0 8 1 2014 2 0.125000 0.500000 0.590909 0.333333 0.357143 0.909091 1.000000 1.000000
9 Arsenal Chelsea 1 3 2 0.0 1.000000 0.8 1.0 1.666667 1.6 9 1 2014 3 0.333333 0.958333 0.090909 0.333333 0.357143 0.590909 0.833333 0.785714
10 Chelsea Arsenal 3 1 1 3.0 2.000000 2.0 1.0 1.000000 0.8 10 1 2014 4 0.916667 0.500000 0.909091 0.722222 0.714286 0.590909 0.500000 0.214286
11 Chelsea Arsenal 2 2 X 3.0 2.333333 2.2 1.0 0.666667 1.0 11 1 2014 5 0.666667 0.791667 0.909091 0.888889 0.928571 0.590909 0.277778 0.428571
12 Chelsea Arsenal 4 1 1 2.0 2.666667 2.2 2.0 1.333333 1.2 12 1 2014 6 1.000000 0.500000 0.590909 1.000000 0.928571 0.909091 0.666667 0.571429

关于python - 如何使用 Featuretools 按列值从单个数据框中的多列创建特征?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53579465/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com