gpt4 book ai didi

python - 如何在 Python/Sklearn 中进行正确的插补

转载 作者:行者123 更新时间:2023-11-30 21:57:54 27 4
gpt4 key购买 nike

我有以下数据。注意时代有 Nan。我的目标是正确估算所有列。

+----+-------------+----------+--------+------+-------+-------+---------+
| ID | PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare |
+----+-------------+----------+--------+------+-------+-------+---------+
| 0 | 1 | 0 | 3 | 22.0 | 1 | 0 | 7.2500 |
| 1 | 2 | 1 | 1 | 38.0 | 1 | 0 | 71.2833 |
| 2 | 3 | 1 | 3 | 26.0 | 0 | 0 | 7.9250 |
| 3 | 4 | 1 | 1 | 35.0 | 1 | 0 | 53.1000 |
| 4 | 5 | 0 | 3 | 35.0 | 0 | 0 | 8.0500 |
| 5 | 6 | 0 | 3 | NaN | 0 | 0 | 8.4583 |
+----+-------------+----------+--------+------+-------+-------+---------+

我有一个可以计算所有列的工作代码。结果如下。结果看起来有问题。

+----+-------------+----------+--------+-----------+-------+-------+---------+
| ID | PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare |
+----+-------------+----------+--------+-----------+-------+-------+---------+
| 0 | 1.0 | 0.0 | 3.0 | 22.000000 | 1.0 | 0.0 | 7.2500 |
| 1 | 2.0 | 1.0 | 1.0 | 38.000000 | 1.0 | 0.0 | 71.2833 |
| 2 | 3.0 | 1.0 | 3.0 | 26.000000 | 0.0 | 0.0 | 7.9250 |
| 3 | 4.0 | 1.0 | 1.0 | 35.000000 | 1.0 | 0.0 | 53.1000 |
| 4 | 5.0 | 0.0 | 3.0 | 35.000000 | 0.0 | 0.0 | 8.0500 |
| 5 | 6.0 | 0.0 | 3.0 | 2.909717 | 0.0 | 0.0 | 8.4583 |
+----+-------------+----------+--------+-----------+-------+-------+---------+

我的代码如下:

import pandas as pd
import numpy as np

#https://www.kaggle.com/shivamp629/traincsv/downloads/traincsv.zip/1
data = pd.read_csv("train.csv")

data2 = data[['PassengerId', 'Survived','Pclass','Age','SibSp','Parch','Fare']].copy()

from sklearn.preprocessing import Imputer

fill_NaN = Imputer(missing_values=np.nan, strategy='mean', axis=1)
data2_im = pd.DataFrame(fill_NaN.fit_transform(data2), columns = data2.columns)

data2_im

很奇怪,年龄是 2.909717。有没有正确的方法来进行简单的平均插补。我可以逐列进行,但我不清楚语法/方法。感谢您的帮助。

最佳答案

问题的根源是这一行:

fill_NaN = Imputer(missing_values=np.nan, strategy='mean', axis=1)

,这意味着您正在对行(橙子和苹果)进行平均。

尝试将其更改为:

fill_NaN = Imputer(missing_values=np.nan, strategy='mean', axis=0) # axis=0

您将获得预期的行为。

strategy='median' 可能会更好,因为它对异常值具有鲁棒性:

fill_NaN = Imputer(missing_values=np.nan, strategy='median', axis=0)

关于python - 如何在 Python/Sklearn 中进行正确的插补,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55115958/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com