gpt4 book ai didi

data-science - 变换基元的深度特征合成深度 |特征工具

转载 作者:行者123 更新时间:2023-12-02 02:27:23 25 4
gpt4 key购买 nike

我正在尝试使用 featuretools 库在简单的数据集上创建新功能,但是,每当我尝试使用更大的 max_深度 时,什么也不会发生......这是迄今为止我的代码:

# imports
import featuretools as ft

# creating the EntitySet
es = ft.EntitySet()
es.entity_from_dataframe(entity_id='data', dataframe=data, make_index=True, index='index')

# Run deep feature synthesis with transformation primitives
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity='data', max_depth=3,
trans_primitives=['add_numeric', 'multiply_numeric'])

当我查看创建的功能时,我得到了基本的内容 f1*f2f1+f2,但我想要更复杂的工程功能,例如 f2*(f1+f2)f1+(f2+f1)。我认为增加 max_depth 可以做到这一点,但显然不是。
如果有的话我该怎么做?

最佳答案

我已经设法回答了我自己的问题,所以我将其发布在这里。
您可以通过在已生成的特征上运行“深度特征合成”来创建更深层次的特征。这是一个例子:

# imports
import featuretools as ft

# creating the EntitySet
es = ft.EntitySet()
es.entity_from_dataframe(entity_id='data', dataframe=data, make_index=True, index='index')

# Run deep feature synthesis with transformation primitives
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity='data',
trans_primitives=['add_numeric','multiply_numeric'])

# creating an EntitySet from the new features
deep_es = ft.EntitySet()
deep_es.entity_from_dataframe(entity_id='data', index='index', dataframe=feature_matrix)

# Run deep feature synthesis with transformation primitives
deep_feature_matrix, deep_feature_defs=ft.dfs(entityset=deep_es, target_entity='data',
trans_primitives=['add_numeric','multiply_numeric'])

现在,看看 deep_feature_matrix 的列,我们看到的是这样的(假设数据集具有 2 个特征):
“f1”、“f2”、“f1+f2”、“f1*f2”、“f1+f1*f2”、“f1+f1+f2”、“f1*f2+f1+f2”、“f1*f2+f2"、"f1+f2+f2"、"f1*f1*f2"、"f1*f1+f2"、"f1*f2*f1+f2"、"f1*f2*f2"、"f1+f2*f2"

我还制作了一个自动执行此操作的函数(包括完整的文档字符串):

def auto_feature_engineering(X, y, selection_percent=0.1, selection_strategy="best", num_depth_steps=2, transformatives=['divide_numeric', 'multiply_numeric']):
"""
Automatically perform deep feature engineering and
feature selection.

Parameters
----------
X : pd.DataFrame
Data to perform automatic feature engineering on.
y : pd.DataFrame
Target variable to find correlations of all
features at each depth step to perform feature
selection, y is not needed if selection_percent=1.
selection_percent : float, optional
Defines what percent of all the new features to
keep for the next depth step.
selection_strategy : {'best', 'random'}, optional
Strategy used for feature selection, if 'best',
it will select the best features for the next depth
step, if 'random', it will select features at random.
num_depth_steps : integer, optional
The number of depth steps. Every depth step, the model
generates brand new features from the features made in
the last step, then selects a percent of these new
features.
transformatives : list, optional
List of all possible transformations of the data to use
when feature engineering, you can find the full list
of possible transformations as well as what each one
does using the following code:
`ft.primitives.list_primitives()[ft.primitives.list_primitives()["type"]=="transform"]`
make sure to `import featuretools as ft`.

Returns
-------
pd.DataFrame
a dataframe of the brand new features.
"""
from sklearn.feature_selection import mutual_info_classif
selected_feature_df = X.copy()
for i in range(num_depth_steps):

# Perform feature engineering
es = ft.EntitySet()
es.entity_from_dataframe(entity_id='data', dataframe=selected_feature_df,
make_index=True, index='index')
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity='data', trans_primitives=transformatives)

# Remove features that are the same
feature_corrs = feature_matrix.corr()[list(feature_matrix.keys())[0]]

existing_corrs = []
good_keys = []
for key in feature_corrs.to_dict().keys():
if feature_corrs[key] not in existing_corrs:
existing_corrs.append(feature_corrs[key])
good_keys.append(key)
feature_matrix = feature_matrix[good_keys]

# Remove illegal features
legal_features = list(feature_matrix.columns)
for feature in list(feature_matrix.columns):
raw_feature_list = []
for j in range(len(feature.split(" "))):
if j%2==0:
raw_feature_list.append(feature.split(" ")[j])
if len(raw_feature_list) > i+2: # num_depth_steps = 1, means max_num_raw_features_in_feature = 2
legal_features.remove(feature)
feature_matrix = feature_matrix[legal_features]

# Perform feature selection
if int(selection_percent)!=1:
if selection_strategy=="best":
corrs = mutual_info_classif(feature_matrix.reset_index(drop=True), y)
corrs = pd.Series(corrs, name="")
selected_corrs = corrs[corrs>=corrs.quantile(1-selection_percent)]
selected_feature_df = feature_matrix.iloc[:, list(selected_corrs.keys())].reset_index(drop=True)
elif selection_strategy=="random":
selected_feature_df = feature_matrix.sample(frac=(selection_percent), axis=1).reset_index(drop=True)
else:
raise Exception("selection_strategy can be either 'best' or 'random', got '"+str(selection_strategy)+"'.")
else:
selected_feature_df = feature_matrix.reset_index(drop=True)
if num_depth_steps!=1:
rename_dict = {}
for col in list(selected_feature_df.columns):
rename_dict[col] = "("+col+")"
selected_feature_df = selected_feature_df.rename(columns=rename_dict)
if num_depth_steps!=1:
rename_dict = {}
for feature_name in list(selected_feature_df.columns):
rename_dict[feature_name] = feature_name[int(num_depth_steps-1):-int(num_depth_steps-1)]
selected_feature_df = selected_feature_df.rename(columns=rename_dict)
return selected_feature_df

这是一个使用它的示例:

# Imports
>>> import seaborn as sns
>>> import pandas as pd
>>> import numpy as np
>>> from sklearn.preprocessing import OrdinalEncoder

# Load the penguins dataset
>>> penguins = sns.load_dataset("penguins")
>>> penguins.head()

species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female

# Fill in NaN values of features using the distribution of the feature
>>> for feature in ["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g", "sex"]:
... s = penguins[feature].value_counts(normalize=True)
... dist = penguins[feature].value_counts(normalize=True).values
... missing = penguins[feature].isnull()
... penguins.loc[missing, feature] = np.random.choice(s.index, size=len(penguins[missing]),p=s.values)

# Make X and y
>>> X = penguins[["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"]]
>>> y = penguins[["sex"]]

# Encode "sex" so that "Male" is 1 and "Female" is 0
>>> ord_enc = OrdinalEncoder()
>>> y = pd.DataFrame(ord_enc.fit_transform(y).astype(np.int8), columns=["sex"])

# Generate new dataset with more features
>>> penguins_with_more_features = auto_feature_engineering(X, y, selection_percent=1.)

# Correlations of the raw features
>>> find_correlations(X, y)
body_mass_g 0.422959
bill_depth_mm 0.353526
bill_length_mm 0.342109
flipper_length_mm 0.246944
Name: sex, dtype: float64

# Top 10% correlations of new features
>>> summarize_corr_series(find_top_percent(find_correlations(penguins_with_more_features, y), 0.1))
(flipper_length_mm / bill_depth_mm) / (body_mass_g): 0.7241123396175027
(bill_depth_mm * body_mass_g) / (flipper_length_mm): 0.7237223914820166
(bill_depth_mm * body_mass_g) * (bill_depth_mm): 0.7222108721971968
(bill_depth_mm * body_mass_g): 0.7202272416625914
(bill_depth_mm * body_mass_g) * (flipper_length_mm): 0.6425813490692588
(bill_depth_mm * bill_length_mm) * (body_mass_g): 0.6398235593646668
(bill_depth_mm * flipper_length_mm) * (flipper_length_mm): 0.6360645935216128
(bill_depth_mm * flipper_length_mm): 0.6083364815975281
(bill_depth_mm * body_mass_g) * (body_mass_g): 0.5888925994060027

在此示例中,我们希望根据企鹅的属性 body_mass_gbill_depth_mmbill_length_mmflipper_length_mm 来预测其性别

您可能会注意到我在示例中使用的其他神秘函数,即 find_correlationssummarize_corr_seriesfind_top_percent。这些是我创建的其他方便的函数,用于帮助总结 auto_feature_engineering 的结果。以下是它们的代码(请注意,它们尚未记录在案):

def summarize_corr_series(feature_corr_series):
max_feature_name_size = 0
for key in feature_corr_series.to_dict().keys():
if len(key) > max_feature_name_size:
max_feature_name_size = len(key)

max_new_feature_corr = feature_corr_series.max()

for key in feature_corr_series.to_dict().keys():
whitespace = []
for i in range(max_feature_name_size-len(key)):
whitespace.append(" ")
whitespace = "".join(whitespace)
print(key+": "+whitespace+str(abs(feature_corr_series[key])))

def find_top_percent(series, percent):
return series[series>series.quantile(1-percent)]

def find_correlations(X, y):
return abs(pd.concat([X.reset_index(drop=True), y.reset_index(drop=True)], axis=1).corr())[y.columns[0]].drop(y.columns[0]).sort_values(ascending=False)

关于data-science - 变换基元的深度特征合成深度 |特征工具,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/65448806/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com