data-science - 变换基元的深度特征合成深度 |特征工具-6ren

data-science - 变换基元的深度特征合成深度 |特征工具

转载作者：行者123 更新时间：2023-12-02 19:01:54

我正在尝试使用 featuretools 库在简单的数据集上创建新功能，但是，每当我尝试使用更大的 max_深度 时，什么也不会发生......这是迄今为止我的代码:

# imports
import featuretools as ft

# creating the EntitySet
es = ft.EntitySet()
es.entity_from_dataframe(entity_id='data', dataframe=data, make_index=True, index='index')

# Run deep feature synthesis with transformation primitives
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity='data', max_depth=3,
                                      trans_primitives=['add_numeric', 'multiply_numeric'])

当我查看创建的功能时，我得到了基本的内容 f1*f2 和 f1+f2，但我想要更复杂的工程功能，例如 f2*(f1+f2) 或 f1+(f2+f1)。我认为增加 max_depth 可以做到这一点，但显然不是。
如果有的话我该怎么做？

最佳答案

我已经设法回答了我自己的问题，所以我将其发布在这里。
您可以通过在已生成的特征上运行“深度特征合成”来创建更深层次的特征。这是一个例子:

# imports
import featuretools as ft

# creating the EntitySet
es = ft.EntitySet()
es.entity_from_dataframe(entity_id='data', dataframe=data, make_index=True, index='index')

# Run deep feature synthesis with transformation primitives
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity='data',
                                      trans_primitives=['add_numeric','multiply_numeric'])

# creating an EntitySet from the new features
deep_es = ft.EntitySet()
deep_es.entity_from_dataframe(entity_id='data', index='index', dataframe=feature_matrix)

# Run deep feature synthesis with transformation primitives
deep_feature_matrix, deep_feature_defs=ft.dfs(entityset=deep_es, target_entity='data',
                                              trans_primitives=['add_numeric','multiply_numeric'])

现在，看看 deep_feature_matrix 的列，我们看到的是这样的(假设数据集具有 2 个特征):
“f1”、“f2”、“f1+f2”、“f1*f2”、“f1+f1*f2”、“f1+f1+f2”、“f1*f2+f1+f2”、“f1*f2+f2"、"f1+f2+f2"、"f1*f1*f2"、"f1*f1+f2"、"f1*f2*f1+f2"、"f1*f2*f2"、"f1+f2*f2"

我还制作了一个自动执行此操作的函数(包括完整的文档字符串):

def auto_feature_engineering(X, y, selection_percent=0.1, selection_strategy="best", num_depth_steps=2, transformatives=['divide_numeric', 'multiply_numeric']):
    """
    Automatically perform deep feature engineering and 
    feature selection.

    Parameters
    ----------
    X : pd.DataFrame
        Data to perform automatic feature engineering on.
    y : pd.DataFrame
        Target variable to find correlations of all
        features at each depth step to perform feature
        selection, y is not needed if selection_percent=1.
    selection_percent : float, optional
        Defines what percent of all the new features to
        keep for the next depth step.
    selection_strategy : {'best', 'random'}, optional
        Strategy used for feature selection, if 'best', 
        it will select the best features for the next depth
        step, if 'random', it will select features at random.
    num_depth_steps : integer, optional
        The number of depth steps. Every depth step, the model
        generates brand new features from the features made in 
        the last step, then selects a percent of these new
        features.
    transformatives : list, optional
        List of all possible transformations of the data to use
        when feature engineering, you can find the full list
        of possible transformations as well as what each one
        does using the following code: 
        `ft.primitives.list_primitives()[ft.primitives.list_primitives()["type"]=="transform"]`
        make sure to `import featuretools as ft`.

    Returns
    -------
    pd.DataFrame
        a dataframe of the brand new features.
    """
    from sklearn.feature_selection import mutual_info_classif
    selected_feature_df = X.copy()
    for i in range(num_depth_steps):
        
        # Perform feature engineering
        es = ft.EntitySet()
        es.entity_from_dataframe(entity_id='data', dataframe=selected_feature_df, 
                                 make_index=True, index='index')
        feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity='data', trans_primitives=transformatives)
        
        # Remove features that are the same
        feature_corrs = feature_matrix.corr()[list(feature_matrix.keys())[0]]
        
        existing_corrs = []
        good_keys = []
        for key in feature_corrs.to_dict().keys():
            if feature_corrs[key] not in existing_corrs:
                existing_corrs.append(feature_corrs[key])
                good_keys.append(key)
        feature_matrix = feature_matrix[good_keys]
        
        # Remove illegal features
        legal_features = list(feature_matrix.columns)
        for feature in list(feature_matrix.columns):
            raw_feature_list = []
            for j in range(len(feature.split(" "))):
                if j%2==0:
                    raw_feature_list.append(feature.split(" ")[j])
            if len(raw_feature_list) > i+2: # num_depth_steps = 1, means max_num_raw_features_in_feature = 2
                legal_features.remove(feature)
        feature_matrix = feature_matrix[legal_features]
        
        # Perform feature selection
        if int(selection_percent)!=1:
            if selection_strategy=="best":
                corrs = mutual_info_classif(feature_matrix.reset_index(drop=True), y)
                corrs = pd.Series(corrs, name="")
                selected_corrs = corrs[corrs>=corrs.quantile(1-selection_percent)]
                selected_feature_df = feature_matrix.iloc[:, list(selected_corrs.keys())].reset_index(drop=True)
            elif selection_strategy=="random":
                selected_feature_df = feature_matrix.sample(frac=(selection_percent), axis=1).reset_index(drop=True)
            else:
                raise Exception("selection_strategy can be either 'best' or 'random', got '"+str(selection_strategy)+"'.")
        else:
            selected_feature_df = feature_matrix.reset_index(drop=True)
        if num_depth_steps!=1:
            rename_dict = {}
            for col in list(selected_feature_df.columns):
                rename_dict[col] = "("+col+")"
            selected_feature_df = selected_feature_df.rename(columns=rename_dict)
    if num_depth_steps!=1:
        rename_dict = {}
        for feature_name in list(selected_feature_df.columns):
            rename_dict[feature_name] = feature_name[int(num_depth_steps-1):-int(num_depth_steps-1)]
        selected_feature_df = selected_feature_df.rename(columns=rename_dict)
    return selected_feature_df

这是一个使用它的示例:

# Imports
>>> import seaborn as sns
>>> import pandas as pd
>>> import numpy as np
>>> from sklearn.preprocessing import OrdinalEncoder

# Load the penguins dataset
>>> penguins = sns.load_dataset("penguins")
>>> penguins.head()

  species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g     sex
0  Adelie  Torgersen            39.1           18.7              181.0       3750.0    Male
1  Adelie  Torgersen            39.5           17.4              186.0       3800.0  Female
2  Adelie  Torgersen            40.3           18.0              195.0       3250.0  Female 
3  Adelie  Torgersen             NaN            NaN                NaN          NaN     NaN
4  Adelie  Torgersen            36.7           19.3              193.0       3450.0  Female

# Fill in NaN values of features using the distribution of the feature
>>> for feature in ["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g", "sex"]:
...     s = penguins[feature].value_counts(normalize=True)
...     dist = penguins[feature].value_counts(normalize=True).values
...     missing = penguins[feature].isnull()
...     penguins.loc[missing, feature] = np.random.choice(s.index, size=len(penguins[missing]),p=s.values)

# Make X and y
>>> X = penguins[["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"]]
>>> y = penguins[["sex"]]

# Encode "sex" so that "Male" is 1 and "Female" is 0
>>> ord_enc = OrdinalEncoder()
>>> y = pd.DataFrame(ord_enc.fit_transform(y).astype(np.int8), columns=["sex"])

# Generate new dataset with more features
>>> penguins_with_more_features = auto_feature_engineering(X, y, selection_percent=1.)

# Correlations of the raw features
>>> find_correlations(X, y)
body_mass_g          0.422959
bill_depth_mm        0.353526
bill_length_mm       0.342109
flipper_length_mm    0.246944
Name: sex, dtype: float64

# Top 10% correlations of new features
>>> summarize_corr_series(find_top_percent(find_correlations(penguins_with_more_features, y), 0.1))
(flipper_length_mm / bill_depth_mm) / (body_mass_g):       0.7241123396175027
(bill_depth_mm * body_mass_g) / (flipper_length_mm):       0.7237223914820166
(bill_depth_mm * body_mass_g) * (bill_depth_mm):           0.7222108721971968
(bill_depth_mm * body_mass_g):                             0.7202272416625914
(bill_depth_mm * body_mass_g) * (flipper_length_mm):       0.6425813490692588
(bill_depth_mm * bill_length_mm) * (body_mass_g):          0.6398235593646668
(bill_depth_mm * flipper_length_mm) * (flipper_length_mm): 0.6360645935216128
(bill_depth_mm * flipper_length_mm):                       0.6083364815975281
(bill_depth_mm * body_mass_g) * (body_mass_g):             0.5888925994060027

在此示例中，我们希望根据企鹅的属性 body_mass_g、bill_depth_mm、bill_length_mm 和 flipper_length_mm 来预测其性别。

您可能会注意到我在示例中使用的其他神秘函数，即 find_correlations、summarize_corr_series 和 find_top_percent。这些是我创建的其他方便的函数，用于帮助总结 auto_feature_engineering 的结果。以下是它们的代码(请注意，它们尚未记录在案):

def summarize_corr_series(feature_corr_series):
    max_feature_name_size = 0
    for key in feature_corr_series.to_dict().keys():
        if len(key) > max_feature_name_size:
            max_feature_name_size = len(key)

    max_new_feature_corr = feature_corr_series.max()

    for key in feature_corr_series.to_dict().keys():
        whitespace = []
        for i in range(max_feature_name_size-len(key)):
            whitespace.append(" ")
        whitespace = "".join(whitespace)
        print(key+": "+whitespace+str(abs(feature_corr_series[key])))

def find_top_percent(series, percent):
    return series[series>series.quantile(1-percent)]

def find_correlations(X, y):
    return abs(pd.concat([X.reset_index(drop=True), y.reset_index(drop=True)], axis=1).corr())[y.columns[0]].drop(y.columns[0]).sort_values(ascending=False)

关于data-science - 变换基元的深度特征合成深度 |特征工具，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/65448806/

文章推荐： c - 为什么不能根据它指向的变量类型推断指针类型？

文章推荐： javascript - 基于Tab切换的页面刷新

文章推荐： javascript - jsPlumb 的流程图

文章推荐： python - 请求 Python 中的 url 超出了最大重试次数

python - 错误:函数arcLength中的(-215)计数> = 0 &&(深度== CV_32F ||深度== CV_32S)
我正在使用python 2.7 当我尝试在其上运行epsilon操作时出现此错误，这是我的代码 import cv2 import numpy as np img = cv2.imread('img
深度！程序员生涯的垃圾时间（上）
1 很多程序员对互联网行业中广泛讨论的“35岁危机”表示不满，似乎所有的程序员都有着35岁的职业保质期。然而，随着AI技术的兴起，这场翻天覆地的技术革命正以更加残酷且直接的方式渗透到各行各业。程序员
git - 如何打印子模块级别/深度
我有一个包含多个子模块的项目，我想列出每个子模块的相对深度该项目: main_project submodule1 submodule1\submodule1_1 submo
c++ - 深度+颜色的3D投影
我有一张彩色图像及其深度图，它们都是由 Kinect 捕获的。我想将它投影到另一个位置(以查看它在另一个视角下的样子)。由于我没有 Kinect 的内在参数(相机参数)；我该如何实现？ P.S:我正在
android - 使用包含路径和查询参数的(深度)链接打开应用程序
给出了这三个网址: 1) https://example.com 2) https://example.com/app 3) https://example.com/app?param=hello 假
unity3d - 你如何在着色器中编写 z 深度？
这个着色器(最后的代码)使用 raymarching 来渲染程序几何: 但是，在图像(上图)中，背景中的立方体应该部分遮挡粉红色实体；不是因为这个: struct fragmentOutput {
javascript - ThreeJS - 房间内 - 深度
我希望能够在 ThreeJS 中创建一个房间。这是我到目前为止所拥有的: http://jsfiddle.net/7oyq4yqz/ var camera, scene, renderer, geom
haskell - 深度 Haskell 递归中异常的替代方案是什么？
我正在尝试通过编写小程序来学习 Haskell...所以我目前正在为简单表达式编写一个词法分析器/解析器。 (是的，我可以使用 Alex/Happy...但我想先学习核心语言)。我的解析器本质上是一
php parse_ini_file oop & 深度
我想使用像 [parse_ini_file][1] 这样的东西。例如，我有一个 boot.ini 文件，我将加载该文件以进行进一步的处理: ;database connection sett
java - Mockito - 深度 stub
我正在使用 Mockito 来测试我的类(class)。我正在尝试使用深度 stub ，因为我没有办法在 Mockito 中的另一个模拟对象中注入(inject) Mock。 class MyServ
javascript - polymer/深度/选择器在移动设备中不起作用
我试图在调整设备屏幕大小时重新排列布局，所以我这样做: if(screenOrientation == SCREEN_ORIENTATION_LANDSCAPE) { document
c - OpenGL Ubuntu 深度
我正在 Ubuntu 上编写一个简单的 OpenGL 程序，它使用顶点数组绘制两个正方形(一个在另一个前面)。由于某种原因，GL_DEPTH_TEST 似乎不起作用。后面的物体出现在前面的物体前面
c - int 深度 UNUSED_PARAM
static FAST_FUNC int fileAction(const char *pathname, struct stat *sb UNUSED_PARAM, void *mo
c++ - std::is_base_of() 深度
我有这样的层次结构: namespace MyService{ class IBase { public: virtual ~IBase(){} protected: IPointer
php - 循环到子级的 FINITIE 深度
我正在制作一个图片库，需要一些循环类别方面的帮助。下一个深度是图库配置文件中的已知设置，因此这不是关于无限深度循环的问题，而是循环已知深度并输出所有结果的最有效方法。本质上，我想创建一个包含系统中
java - 在树状结构中迭代 n 深度
如何以编程方式在树状结构上获取 n 深度迭代器？在根目录中我有 List 每个节点有 Map> n+1 深度。我已修复 1 个深度: // DEPTH 1 nodeData.forEach(base
css - polymer 深度 CSS
我正在构建一个包含大量自定义元素的 Polymer 单页界面。现在我希望我的元素具有某种主样式，我可以在 index.html 或我的主要内容元素中定义它。可以这样想: index.html
java - 深度 sleep 连接蓝牙设备失败
我正在尝试每 25 秒连接到配对的蓝牙设备，通过 AlarmManager 安排，它会触发 WakefulBroadcastReceiver 以启动服务以进行连接。设备进入休眠状态后，前几个小时一切正
c++ - 如何处理(深度)嵌套函数调用中的默认值？
假设有一个有默认值的函数: int foo(int x=42); 如果这被其他人这样调用: int bar(int x=42) { return foo(x); } int moo(int x=42)
Javascript URL 深度(级别)
是否可以使用 Javascript 获取 url 深度(级别)？如果我有这个网址:www.website.com/site/product/category/item -> depth=4www.w

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

data-science - 变换基元的深度特征合成深度 |特征工具