gpt4 book ai didi

python - FeatureTools:处理多对多关系

转载 作者:太空宇宙 更新时间:2023-11-04 00:10:28 25 4
gpt4 key购买 nike

我有一个包含多个列的购买数据框,包括以下三个:

 PURCHASE_ID (index of purchase)
WORKER_ID (index of worker)
ACCOUNT_ID (index of account)

一个工作人员可以关联多个帐户,一个帐户可以有多个工作人员。

如果我创建 WORKER 和 ACCOUNT 实体并添加关系,则会出现错误:

KeyError: 'Variable: ACCOUNT_ID not found in entity'

到目前为止,这是我的代码:

import pandas as pd
import featuretools as ft
import featuretools.variable_types as vtypes

d = {'PURCHASE_ID': [1, 2],
'WORKER_ID': [0, 0],
'ACCOUNT_ID': [1, 2],
'COST': [5, 10],
'PURCHASE_TIME': ['2018-01-01 01:00:00', '2016-01-01 02:00:00']}
df = pd.DataFrame(data=d)

data_variable_types = {'PURCHASE_ID': vtypes.Id,
'WORKER_ID': vtypes.Id,
'ACCOUNT_ID': vtypes.Id,
'COST': vtypes.Numeric,
'PURCHASE_TIME': vtypes.Datetime}

es = ft.EntitySet('Purchase')
es = es.entity_from_dataframe(entity_id='purchases',
dataframe=df,
index='PURCHASE_ID',
time_index='PURCHASE_TIME',
variable_types=data_variable_types)

es.normalize_entity(base_entity_id='purchases',
new_entity_id='workers',
index='WORKER_ID',
additional_variables=['ACCOUNT_ID'],
make_time_index=False)

es.normalize_entity(base_entity_id='purchases',
new_entity_id='accounts',
index='ACCOUNT_ID',
additional_variables=['WORKER_ID'],
make_time_index=False)

fm, features = ft.dfs(entityset=es,
target_entity='purchases',
agg_primitives=['mean'],
trans_primitives=[],
verbose=True)
features

如何分离实体以包含多对多关系?

最佳答案

您的方法是正确的,但是您不需要使用 additional_variables 变量参数。如果省略它,您的代码将毫无问题地运行。

additional_variablesEntitySet.normalize_entity 的目的是在您正在创建的新父实体中包含您想要的其他变量。例如,假设您有关于雇用日期、薪水、地点等的变量。您会将这些变量作为附加变量,因为它们对于 worker 而言是静态的。在这种情况下,我认为您没有这样的变量。

这是我看到的代码和输出

import pandas as pd
import featuretools as ft
import featuretools.variable_types as vtypes

d = {'PURCHASE_ID': [1, 2],
'WORKER_ID': [0, 0],
'ACCOUNT_ID': [1, 2],
'COST': [5, 10],
'PURCHASE_TIME': ['2018-01-01 01:00:00', '2016-01-01 02:00:00']}
df = pd.DataFrame(data=d)

data_variable_types = {'PURCHASE_ID': vtypes.Id,
'WORKER_ID': vtypes.Id,
'ACCOUNT_ID': vtypes.Id,
'COST': vtypes.Numeric,
'PURCHASE_TIME': vtypes.Datetime}

es = ft.EntitySet('Purchase')
es = es.entity_from_dataframe(entity_id='purchases',
dataframe=df,
index='PURCHASE_ID',
time_index='PURCHASE_TIME',
variable_types=data_variable_types)

es.normalize_entity(base_entity_id='purchases',
new_entity_id='workers',
index='WORKER_ID',
make_time_index=False)

es.normalize_entity(base_entity_id='purchases',
new_entity_id='accounts',
index='ACCOUNT_ID',
make_time_index=False)

fm, features = ft.dfs(entityset=es,
target_entity='purchases',
agg_primitives=['mean'],
trans_primitives=[],
verbose=True)
features

这个输出

[<Feature: WORKER_ID>,
<Feature: ACCOUNT_ID>,
<Feature: COST>,
<Feature: workers.MEAN(purchases.COST)>,
<Feature: accounts.MEAN(purchases.COST)>]

如果我们改变目标实体并增加深度

fm, features = ft.dfs(entityset=es,
target_entity='workers',
agg_primitives=['mean', 'count'],
max_depth=3,
trans_primitives=[],
verbose=True)
features

输出现在是 worker 实体的特征

[<Feature: COUNT(purchases)>,
<Feature: MEAN(purchases.COST)>,
<Feature: MEAN(purchases.accounts.MEAN(purchases.COST))>,
<Feature: MEAN(purchases.accounts.COUNT(purchases))>]

让我们解释一下名为 MEAN(purchases.accounts.COUNT(purchases))>

的功能
  1. 对于给定的 worker ,找到与该 worker 相关的每笔购买。
  2. 对于其中的每笔购买,计算参与该特定购买的帐户的购买总数。
  3. 计算给定员工所有购买的平均数。

换句话说,“与该员工的购买相关的帐户的平均购买数量是多少”。

关于python - FeatureTools:处理多对多关系,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52629549/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com