gpt4 book ai didi

python - 预处理数据时 ValueError : Input contains NaN, 无穷大或对于 dtype ('float64' 的值太大)

转载 作者:行者123 更新时间:2023-11-28 20:15:08 33 4
gpt4 key购买 nike

我有两个 CSV 文件(Training setTest Set)。由于在少数列中有可见的 NaN 值(statushedge_valueindicator_codeportfolio_id , desk_id, office_id).

我通过将 NaN 值替换为与该列对应的巨大值来开始该过程。然后我正在执行 LabelEncoding 以删除文本数据并将它们转换为数值数据。现在,当我尝试对分类数据执行 OneHotEncoding 时,出现错误。我尝试将输入一个一个输入到 OneHotEncoding 构造函数中,但每一列都出现相同的错误。

基本上,我的最终目标是预测返回值,但因此我陷入了数据预处理部分。我该如何解决这个问题?

我正在使用 Python3.6PandasSklearn 进行数据处理。

代码

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

test_data = pd.read_csv('test.csv')
train_data = pd.read_csv('train.csv')

# Replacing Nan values here
train_data['status']=train_data['status'].fillna(2.0)
train_data['hedge_value']=train_data['hedge_value'].fillna(2.0)
train_data['indicator_code']=train_data['indicator_code'].fillna(2.0)
train_data['portfolio_id']=train_data['portfolio_id'].fillna('PF99999999')
train_data['desk_id']=train_data['desk_id'].fillna('DSK99999999')
train_data['office_id']=train_data['office_id'].fillna('OFF99999999')

x_train = train_data.iloc[:, :-1].values
y_train = train_data.iloc[:, 17].values

# =============================================================================
# from sklearn.preprocessing import Imputer
# imputer = Imputer(missing_values="NaN", strategy="mean", axis=0)
# imputer.fit(x_train[:, 15:17])
# x_train[:, 15:17] = imputer.fit_transform(x_train[:, 15:17])
#
# imputer.fit(x_train[:, 12:13])
# x_train[:, 12:13] = imputer.fit_transform(x_train[:, 12:13])
# =============================================================================


# Encoding categorical data, i.e. Text data, since calculation happens on numbers only, so having text like
# Country name, Purchased status will give trouble
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
x_train[:, 0] = labelencoder_X.fit_transform(x_train[:, 0])
x_train[:, 1] = labelencoder_X.fit_transform(x_train[:, 1])
x_train[:, 2] = labelencoder_X.fit_transform(x_train[:, 2])
x_train[:, 3] = labelencoder_X.fit_transform(x_train[:, 3])
x_train[:, 6] = labelencoder_X.fit_transform(x_train[:, 6])
x_train[:, 8] = labelencoder_X.fit_transform(x_train[:, 8])
x_train[:, 14] = labelencoder_X.fit_transform(x_train[:, 14])


# =============================================================================
# import numpy as np
# x_train[:, 3] = x_train[:, 3].reshape(x_train[:, 3].size,1)
# x_train[:, 3] = x_train[:, 3].astype(np.float64, copy=False)
# np.isnan(x_train[:, 3]).any()
# =============================================================================


# =============================================================================
# from sklearn.preprocessing import StandardScaler
# sc_X = StandardScaler()
# x_train = sc_X.fit_transform(x_train)
# =============================================================================

onehotencoder = OneHotEncoder(categorical_features=[0,1,2,3,6,8,14])
x_train = onehotencoder.fit_transform(x_train).toarray() # Replace Country Names with One Hot Encoding.

错误

Traceback (most recent call last):

File "<ipython-input-4-4992bf3d00b8>", line 58, in <module>
x_train = onehotencoder.fit_transform(x_train).toarray() # Replace Country Names with One Hot Encoding.

File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 2019, in fit_transform
self.categorical_features, copy=True)

File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 1809, in _transform_selected
X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)

File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 453, in check_array
_assert_all_finite(array)

File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 44, in _assert_all_finite
" or a value too large for %r." % X.dtype)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

最佳答案

我在发布问题后再次查看数据集,发现另一列包含 NaN。我不敢相信我在这上面浪费了这么多时间,而我本可以使用 Pandas 函数来获取包含 NaN 的列的列表。所以,使用下面的代码,我发现我错过了三列。当我本可以使用此功能时,我正在视觉上搜索 NaN。在处理了这些新的 NaN 之后,代码可以正常工作。

pd.isnull(train_data).sum() > 0

结果

portfolio_id      False
desk_id False
office_id False
pf_category False
start_date False
sold True
country_code False
euribor_rate False
currency False
libor_rate True
bought True
creation_date False
indicator_code False
sell_date False
type False
hedge_value False
status False
return False
dtype: bool

关于python - 预处理数据时 ValueError : Input contains NaN, 无穷大或对于 dtype ('float64' 的值太大),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47767162/

33 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com