gpt4 book ai didi

python - 如何在数据帧上的 DataFrameMapper 中使用 Imputer?

转载 作者:太空宇宙 更新时间:2023-11-03 16:10:03 24 4
gpt4 key购买 nike

我想在数据帧的所有 float64 列上使用 DataFrameMapper Imputer+Scaler 映射。我的代码适用于 StandardScaler,但当我添加 Imputer 时,映射器仅返回一行全为零。

我看到这个问题了 Imputer on some Dataframe columns in Python和教程https://github.com/paulgb/sklearn-pandas并且有一个警告:

site-packages\sklearn\utils\validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.

所以我知道存在形状不匹配。下面的示例数据框应该如何 reshape ?

import pandas as pd
import numpy as np
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import StandardScaler, Imputer

# just a random dataframe from http://pandas.pydata.org/pandas-docs/stable/10min.html
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))

print "Starting with a random dataframe of 6 rows and 4 columns of floats:"
print df.shape
print df

mapping=[('A', [Imputer(), StandardScaler()]), ('C', [Imputer(), StandardScaler()])]
mapper = DataFrameMapper(mapping)

result = mapper.fit_transform(df)

print "I get an unexpected result of all zeroes in just one row."
print result.shape
print result

print "Expected is a dataframe of 2 columns and 6 rows of scaled floats."
print "something like this:"

mapping=[('A', [StandardScaler()]), ('C', [StandardScaler()])]
mapper = DataFrameMapper(mapping)

result_scaler = mapper.fit_transform(df)
print result_scaler.shape
print result_scaler

这是输出

Starting with a random dataframe of 6 rows and 4 columns of floats.
(6, 4)
A B C D
2013-01-01 -0.070551 0.039074 0.513491 -0.830585
2013-01-02 -0.313069 -1.028936 2.359338 -0.830518
2013-01-03 -1.264926 -0.830575 0.461515 0.427228
2013-01-04 -0.374400 0.619986 0.318128 0.361712
2013-01-05 -0.235587 -1.647786 -0.819940 -1.036435
2013-01-06 1.436073 0.312183 1.566990 -0.272224
Unexpected result is all zeroes in just one row.
(1L, 12L)
[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
Expected is a dataframe of 2 columns and 6 rows of scaled floats.
something like this
(6L, 2L)
[[ 0.08306789 -0.21892275]
[-0.21975387 1.61986719]
[-1.40829622 -0.27069922]
[-0.29633508 -0.4135387 ]
[-0.12300572 -1.54725542]
[ 1.964323 0.83054889]]

还有一个后续问题 - 我的原始数据框是 float 、 bool 值和对象(标签)的组合。所以当我有一个列表

floats = list(df.select_dtypes(include=['float64']).columns)
mapping=[(f, [Imputer(missing_values=0,strategy="mean"), StandardScaler()]) for f in floats]

我如何为这些列准备数据框(为Imputer塑造它)?

最佳答案

截至目前(版本 1.1.0),有更简单的方法可以做到这一点,而无需创建额外的包装类。

第一个是列选择器的规范,它定义了传递给转换器的数组的形状:

  • 简单字符串(如'A') - 将传递一个一维数组
  • 包含一个元素的列表(如 ['A']) - 将传递一个包含一列的二维数组

因此,在您的情况下,更改映射定义就足够了(注意列名称周围的括号):

mapping=[(['A'], [Imputer(), StandardScaler()]), (['C'], [Imputer(), StandardScaler()])]
mapper = DataFrameMapper(mapping)

如果您想对所有选定的列使用相同的转换,另一个选项是使用 gen_features 函数。您可以执行以下操作:

from sklearn_pandas import DataFrameMapper, gen_features

feature_def = gen_features(columns=[['A'], ['C']], classes=[Imputer, StandardScaler])
mapper = DataFrameMapper(feature_def)

这也回答了你的第二个问题。只需选择您的列,使用正确的列选择器类型并将其与 gen_features 组合即可。

float_cols = list(df.select_dtypes(include=['float64']).columns)
# Use brackets for every column for 2D input shape
float_cols_2d = [[f] for f in float_cols]

最后一个“技巧”,如果您更喜欢 DataFrame 输出而不是 numpy 数组,您可以为 DataFrameMapper 使用 df_out=True 选项。最后的示例可能如下所示(请注意,我用当前的 SimpleImputer 替换了 Imputer):

import pandas as pd
import numpy as np
from sklearn_pandas import DataFrameMapper, gen_features
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))

float_cols = list(df.select_dtypes(include=['float64']).columns)
float_cols_2d = [[f] for f in float_cols]
feature_def = gen_features(columns=float_cols_2d, classes=[SimpleImputer, StandardScaler])
mapper = DataFrameMapper(feature_def, df_out=True)

result = mapper.fit_transform(df)

关于python - 如何在数据帧上的 DataFrameMapper 中使用 Imputer?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39410776/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com