data-science - 变换基元的深度特征合成深度 |特征工具-6ren

data-science - 变换基元的深度特征合成深度 |特征工具

转载作者：行者123 更新时间：2023-12-02 02:27:23

我正在尝试使用 featuretools 库在简单的数据集上创建新功能，但是，每当我尝试使用更大的 max_深度 时，什么也不会发生......这是迄今为止我的代码:

# imports
import featuretools as ft

# creating the EntitySet
es = ft.EntitySet()
es.entity_from_dataframe(entity_id='data', dataframe=data, make_index=True, index='index')

# Run deep feature synthesis with transformation primitives
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity='data', max_depth=3,
                                      trans_primitives=['add_numeric', 'multiply_numeric'])

当我查看创建的功能时，我得到了基本的内容 f1*f2 和 f1+f2，但我想要更复杂的工程功能，例如 f2*(f1+f2) 或 f1+(f2+f1)。我认为增加 max_depth 可以做到这一点，但显然不是。
如果有的话我该怎么做？

最佳答案

我已经设法回答了我自己的问题，所以我将其发布在这里。
您可以通过在已生成的特征上运行“深度特征合成”来创建更深层次的特征。这是一个例子:

# imports
import featuretools as ft

# creating the EntitySet
es = ft.EntitySet()
es.entity_from_dataframe(entity_id='data', dataframe=data, make_index=True, index='index')

# Run deep feature synthesis with transformation primitives
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity='data',
                                      trans_primitives=['add_numeric','multiply_numeric'])

# creating an EntitySet from the new features
deep_es = ft.EntitySet()
deep_es.entity_from_dataframe(entity_id='data', index='index', dataframe=feature_matrix)

# Run deep feature synthesis with transformation primitives
deep_feature_matrix, deep_feature_defs=ft.dfs(entityset=deep_es, target_entity='data',
                                              trans_primitives=['add_numeric','multiply_numeric'])

现在，看看 deep_feature_matrix 的列，我们看到的是这样的(假设数据集具有 2 个特征):
“f1”、“f2”、“f1+f2”、“f1*f2”、“f1+f1*f2”、“f1+f1+f2”、“f1*f2+f1+f2”、“f1*f2+f2"、"f1+f2+f2"、"f1*f1*f2"、"f1*f1+f2"、"f1*f2*f1+f2"、"f1*f2*f2"、"f1+f2*f2"

我还制作了一个自动执行此操作的函数(包括完整的文档字符串):

def auto_feature_engineering(X, y, selection_percent=0.1, selection_strategy="best", num_depth_steps=2, transformatives=['divide_numeric', 'multiply_numeric']):
    """
    Automatically perform deep feature engineering and 
    feature selection.

    Parameters
    ----------
    X : pd.DataFrame
        Data to perform automatic feature engineering on.
    y : pd.DataFrame
        Target variable to find correlations of all
        features at each depth step to perform feature
        selection, y is not needed if selection_percent=1.
    selection_percent : float, optional
        Defines what percent of all the new features to
        keep for the next depth step.
    selection_strategy : {'best', 'random'}, optional
        Strategy used for feature selection, if 'best', 
        it will select the best features for the next depth
        step, if 'random', it will select features at random.
    num_depth_steps : integer, optional
        The number of depth steps. Every depth step, the model
        generates brand new features from the features made in 
        the last step, then selects a percent of these new
        features.
    transformatives : list, optional
        List of all possible transformations of the data to use
        when feature engineering, you can find the full list
        of possible transformations as well as what each one
        does using the following code: 
        `ft.primitives.list_primitives()[ft.primitives.list_primitives()["type"]=="transform"]`
        make sure to `import featuretools as ft`.

    Returns
    -------
    pd.DataFrame
        a dataframe of the brand new features.
    """
    from sklearn.feature_selection import mutual_info_classif
    selected_feature_df = X.copy()
    for i in range(num_depth_steps):
        
        # Perform feature engineering
        es = ft.EntitySet()
        es.entity_from_dataframe(entity_id='data', dataframe=selected_feature_df, 
                                 make_index=True, index='index')
        feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity='data', trans_primitives=transformatives)
        
        # Remove features that are the same
        feature_corrs = feature_matrix.corr()[list(feature_matrix.keys())[0]]
        
        existing_corrs = []
        good_keys = []
        for key in feature_corrs.to_dict().keys():
            if feature_corrs[key] not in existing_corrs:
                existing_corrs.append(feature_corrs[key])
                good_keys.append(key)
        feature_matrix = feature_matrix[good_keys]
        
        # Remove illegal features
        legal_features = list(feature_matrix.columns)
        for feature in list(feature_matrix.columns):
            raw_feature_list = []
            for j in range(len(feature.split(" "))):
                if j%2==0:
                    raw_feature_list.append(feature.split(" ")[j])
            if len(raw_feature_list) > i+2: # num_depth_steps = 1, means max_num_raw_features_in_feature = 2
                legal_features.remove(feature)
        feature_matrix = feature_matrix[legal_features]
        
        # Perform feature selection
        if int(selection_percent)!=1:
            if selection_strategy=="best":
                corrs = mutual_info_classif(feature_matrix.reset_index(drop=True), y)
                corrs = pd.Series(corrs, name="")
                selected_corrs = corrs[corrs>=corrs.quantile(1-selection_percent)]
                selected_feature_df = feature_matrix.iloc[:, list(selected_corrs.keys())].reset_index(drop=True)
            elif selection_strategy=="random":
                selected_feature_df = feature_matrix.sample(frac=(selection_percent), axis=1).reset_index(drop=True)
            else:
                raise Exception("selection_strategy can be either 'best' or 'random', got '"+str(selection_strategy)+"'.")
        else:
            selected_feature_df = feature_matrix.reset_index(drop=True)
        if num_depth_steps!=1:
            rename_dict = {}
            for col in list(selected_feature_df.columns):
                rename_dict[col] = "("+col+")"
            selected_feature_df = selected_feature_df.rename(columns=rename_dict)
    if num_depth_steps!=1:
        rename_dict = {}
        for feature_name in list(selected_feature_df.columns):
            rename_dict[feature_name] = feature_name[int(num_depth_steps-1):-int(num_depth_steps-1)]
        selected_feature_df = selected_feature_df.rename(columns=rename_dict)
    return selected_feature_df

这是一个使用它的示例:

# Imports
>>> import seaborn as sns
>>> import pandas as pd
>>> import numpy as np
>>> from sklearn.preprocessing import OrdinalEncoder

# Load the penguins dataset
>>> penguins = sns.load_dataset("penguins")
>>> penguins.head()

  species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g     sex
0  Adelie  Torgersen            39.1           18.7              181.0       3750.0    Male
1  Adelie  Torgersen            39.5           17.4              186.0       3800.0  Female
2  Adelie  Torgersen            40.3           18.0              195.0       3250.0  Female 
3  Adelie  Torgersen             NaN            NaN                NaN          NaN     NaN
4  Adelie  Torgersen            36.7           19.3              193.0       3450.0  Female

# Fill in NaN values of features using the distribution of the feature
>>> for feature in ["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g", "sex"]:
...     s = penguins[feature].value_counts(normalize=True)
...     dist = penguins[feature].value_counts(normalize=True).values
...     missing = penguins[feature].isnull()
...     penguins.loc[missing, feature] = np.random.choice(s.index, size=len(penguins[missing]),p=s.values)

# Make X and y
>>> X = penguins[["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"]]
>>> y = penguins[["sex"]]

# Encode "sex" so that "Male" is 1 and "Female" is 0
>>> ord_enc = OrdinalEncoder()
>>> y = pd.DataFrame(ord_enc.fit_transform(y).astype(np.int8), columns=["sex"])

# Generate new dataset with more features
>>> penguins_with_more_features = auto_feature_engineering(X, y, selection_percent=1.)

# Correlations of the raw features
>>> find_correlations(X, y)
body_mass_g          0.422959
bill_depth_mm        0.353526
bill_length_mm       0.342109
flipper_length_mm    0.246944
Name: sex, dtype: float64

# Top 10% correlations of new features
>>> summarize_corr_series(find_top_percent(find_correlations(penguins_with_more_features, y), 0.1))
(flipper_length_mm / bill_depth_mm) / (body_mass_g):       0.7241123396175027
(bill_depth_mm * body_mass_g) / (flipper_length_mm):       0.7237223914820166
(bill_depth_mm * body_mass_g) * (bill_depth_mm):           0.7222108721971968
(bill_depth_mm * body_mass_g):                             0.7202272416625914
(bill_depth_mm * body_mass_g) * (flipper_length_mm):       0.6425813490692588
(bill_depth_mm * bill_length_mm) * (body_mass_g):          0.6398235593646668
(bill_depth_mm * flipper_length_mm) * (flipper_length_mm): 0.6360645935216128
(bill_depth_mm * flipper_length_mm):                       0.6083364815975281
(bill_depth_mm * body_mass_g) * (body_mass_g):             0.5888925994060027

在此示例中，我们希望根据企鹅的属性 body_mass_g、bill_depth_mm、bill_length_mm 和 flipper_length_mm 来预测其性别。

您可能会注意到我在示例中使用的其他神秘函数，即 find_correlations、summarize_corr_series 和 find_top_percent。这些是我创建的其他方便的函数，用于帮助总结 auto_feature_engineering 的结果。以下是它们的代码(请注意，它们尚未记录在案):

def summarize_corr_series(feature_corr_series):
    max_feature_name_size = 0
    for key in feature_corr_series.to_dict().keys():
        if len(key) > max_feature_name_size:
            max_feature_name_size = len(key)

    max_new_feature_corr = feature_corr_series.max()

    for key in feature_corr_series.to_dict().keys():
        whitespace = []
        for i in range(max_feature_name_size-len(key)):
            whitespace.append(" ")
        whitespace = "".join(whitespace)
        print(key+": "+whitespace+str(abs(feature_corr_series[key])))

def find_top_percent(series, percent):
    return series[series>series.quantile(1-percent)]

def find_correlations(X, y):
    return abs(pd.concat([X.reset_index(drop=True), y.reset_index(drop=True)], axis=1).corr())[y.columns[0]].drop(y.columns[0]).sort_values(ascending=False)

关于data-science - 变换基元的深度特征合成深度 |特征工具，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/65448806/

文章推荐： python - 使用 ThreadPoolExecutor 优雅退出

文章推荐： c - 为什么不能根据它指向的变量类型推断指针类型？

文章推荐： python - 请求 Python 中的 url 超出了最大重试次数

文章推荐： r - 如何有效地将 cumprod 应用于 tidyverse 中的所有列

class - 模板方法模式和使用抽象(基)类之间的区别？
经过几个小时的(重新)搜索，我无法想出普通抽象类和使用模板模式之间的可解释区别。我唯一看到的是: 使用抽象类时，您需要实现所有方法。但是在使用模板方法时，您只需要实现这两个抽象方法。有人可以向我解
algorithm - 从直骨架中提取的最小 Cyle 基
我正在尝试实现一种算法，该算法可找到以下形状给出的外多边形的每个单独边的对应区域。也就是说，1,2 边的相应区域是 [1,6,7,8,2]，2,3 边的区域是 [2,8,3] 等等，CCW 或 CW
c# - 基 Controller 类的属性注入(inject)
我正在尝试在派生 self 的 BaseController 类的任何 Controller 上自动设置一个属性。这是我的 Application_Start 方法中的代码。 UnitOfWork 属
r - 在自适应平滑中提取 P 样条的节点、基、系数和预测
我正在使用 mgcv 包通过以下方式将一些多项式样条拟合到一些数据: x.gam smooth$knots [1] -0.081161 -0.054107 -0.027053 0.000001
c - C语言中被调用(派生)函数如何访问调用者(基)函数的变量？
考虑以下代码: void foo(){ ..... } int main() { int arr[3][3] ; char string[10]; foo();
c++ - 使用 dynamic_cast 向上转换到 protected 基
本书The c++ programming language有这个代码: class BB_ival_slider : public Ival_slider, protected BBslider {
javascript - NPM package.json 基/根属性？
是否有一个 package.json 属性可用于指定模块解析应启动的根文件夹？例如，假设我们在 node_modules/mypackage/src/file1 中有一个安装。我们要导入的所有文件都
sql - 基 R : Aggregate and sum by two columns
我正在尝试使用聚合函数来实现与 SQL 查询相同的结果: 查询语句: sqldf(" SELECT PhotoID, UserID,
r - 如果可能，在 R - 基 R 中着色置信区间
我正在比较使用 LOESS 回归的两条线。我想清楚地显示两条线的置信区间，我遇到了一些困难。我尝试过使用各种线型和颜色，但在我看来，结果仍然是忙碌和凌乱。我认为置信区间之间的阴影可能会使事情变得更清
c# - 具有抽象(基)/具体(继承)类的 DataContractSerializer
给定这段代码 public override void Serialize(BaseContentObject obj) { string file = ObjectDataStoreFold
c++ - 多态性是否适用于值？或者在按(基)值返回时使用派生类的 move 构造函数
我正在构建某种工厂方法，它按以下方式将 DerivedClass 作为 BaseClass 返回: BaseClass Factory() { return DerivedClass(); }
kotlin - 从重写中使用类委托(delegate)时调用(基)委托(delegate)函数
当重写 class delegation 实现的接口(interface)方法时，是否可以调用通常从重写函数中委托(delegate)给的类？类似于使用继承时调用 super 的方式。来自docum
java - 基 fragment 类中的共享变量是公共(public)的还是私有(private)的？
我有一个基类 fragment (如下所示)。我在其他 3 个 fragment 类中扩展了此类，每个类都共享需要在这 3 个 fragment 中访问的相同 EditText。因此，我在基类中设置了
r - 基 R 中排列之间的 Kendall tau 距离(又名冒泡排序距离)
如何在不加载额外库的情况下在 R 中计算两个排列之间的 Kendall tau 距离(又名冒泡排序距离)？最佳答案这是一个 O(n.log(n)) 的实现，在阅读后拼凑而成，但我怀疑可能有更好的
angular - 在 ng build --prod --localize 时更改 Angular 基 href
情况我创建了一个具有国际化 (i18n) 的 Angular 应用程序。我想在子域中托管不同的版本，例如: zh.myexample.com es.myexample.com 问题当我使用命令 n
c++ - 为什么要检查 Base 是 Derived 的私有(private)基还是 protected 基？
std::is_base_of 之间的唯一区别和 std::is_convertible是前者在 Base 时也成立是私有(private)或 protected Derived 的基类.但是，您何
ios - 如何将子类 View Controller 的 View 放入父类(super class)基 View Controller 的内容 View 中？
我创建了一个名为 baseviewcontroller 的父类(super class) uiviewcontroller 类，用于包含大多数应用屏幕所需的基本 UI。它包括一个自定义导航栏和一个“自

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

data-science - 变换基元的深度特征合成深度 |特征工具