python - 预处理数据时 ValueError : Input contains NaN, 无穷大或对于 dtype ('float64' 的值太大)-6ren

python - 预处理数据时 ValueError : Input contains NaN, 无穷大或对于 dtype ('float64' 的值太大)

转载作者：行者123 更新时间：2023-11-28 20:15:08

我有两个 CSV 文件(Training set 和 Test Set)。由于在少数列中有可见的 NaN 值(status、hedge_value、indicator_code、portfolio_id , desk_id, office_id).

我通过将 NaN 值替换为与该列对应的巨大值来开始该过程。然后我正在执行 LabelEncoding 以删除文本数据并将它们转换为数值数据。现在，当我尝试对分类数据执行 OneHotEncoding 时，出现错误。我尝试将输入一个一个输入到 OneHotEncoding 构造函数中，但每一列都出现相同的错误。

基本上，我的最终目标是预测返回值，但因此我陷入了数据预处理部分。我该如何解决这个问题？

我正在使用 Python3.6 与 Pandas 和 Sklearn 进行数据处理。

代码

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

test_data = pd.read_csv('test.csv')
train_data = pd.read_csv('train.csv')

# Replacing Nan values here
train_data['status']=train_data['status'].fillna(2.0)
train_data['hedge_value']=train_data['hedge_value'].fillna(2.0)
train_data['indicator_code']=train_data['indicator_code'].fillna(2.0)
train_data['portfolio_id']=train_data['portfolio_id'].fillna('PF99999999')
train_data['desk_id']=train_data['desk_id'].fillna('DSK99999999')
train_data['office_id']=train_data['office_id'].fillna('OFF99999999')

x_train = train_data.iloc[:, :-1].values
y_train = train_data.iloc[:, 17].values

# =============================================================================
# from sklearn.preprocessing import Imputer
# imputer = Imputer(missing_values="NaN", strategy="mean", axis=0)
# imputer.fit(x_train[:, 15:17])
# x_train[:, 15:17] = imputer.fit_transform(x_train[:, 15:17])
# 
# imputer.fit(x_train[:, 12:13])
# x_train[:, 12:13] = imputer.fit_transform(x_train[:, 12:13])
# =============================================================================


# Encoding categorical data, i.e. Text data, since calculation happens on numbers only, so having text like 
# Country name, Purchased status will give trouble
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
x_train[:, 0] = labelencoder_X.fit_transform(x_train[:, 0])
x_train[:, 1] = labelencoder_X.fit_transform(x_train[:, 1])
x_train[:, 2] = labelencoder_X.fit_transform(x_train[:, 2])
x_train[:, 3] = labelencoder_X.fit_transform(x_train[:, 3])
x_train[:, 6] = labelencoder_X.fit_transform(x_train[:, 6])
x_train[:, 8] = labelencoder_X.fit_transform(x_train[:, 8])
x_train[:, 14] = labelencoder_X.fit_transform(x_train[:, 14])


# =============================================================================
# import numpy as np
# x_train[:, 3] = x_train[:, 3].reshape(x_train[:, 3].size,1)
# x_train[:, 3] = x_train[:, 3].astype(np.float64, copy=False)
# np.isnan(x_train[:, 3]).any()
# =============================================================================


# =============================================================================
# from sklearn.preprocessing import StandardScaler
# sc_X = StandardScaler()
# x_train = sc_X.fit_transform(x_train)
# =============================================================================

onehotencoder = OneHotEncoder(categorical_features=[0,1,2,3,6,8,14])
x_train = onehotencoder.fit_transform(x_train).toarray() # Replace Country Names with One Hot Encoding.

错误

Traceback (most recent call last):

  File "<ipython-input-4-4992bf3d00b8>", line 58, in <module>
    x_train = onehotencoder.fit_transform(x_train).toarray() # Replace Country Names with One Hot Encoding.

  File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 2019, in fit_transform
    self.categorical_features, copy=True)

  File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 1809, in _transform_selected
    X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)

  File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 453, in check_array
    _assert_all_finite(array)

  File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 44, in _assert_all_finite
    " or a value too large for %r." % X.dtype)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

最佳答案

我在发布问题后再次查看数据集，发现另一列包含 NaN。我不敢相信我在这上面浪费了这么多时间，而我本可以使用 Pandas 函数来获取包含 NaN 的列的列表。所以，使用下面的代码，我发现我错过了三列。当我本可以使用此功能时，我正在视觉上搜索 NaN。在处理了这些新的 NaN 之后，代码可以正常工作。

pd.isnull(train_data).sum() > 0

结果

portfolio_id      False
desk_id           False
office_id         False
pf_category       False
start_date        False
sold               True
country_code      False
euribor_rate      False
currency          False
libor_rate         True
bought             True
creation_date     False
indicator_code    False
sell_date         False
type              False
hedge_value       False
status            False
return            False
dtype: bool

关于python - 预处理数据时 ValueError : Input contains NaN, 无穷大或对于 dtype ('float64' 的值太大)，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/47767162/

文章推荐： javascript - ParentNode.rowIndex 在 Chrome 中不起作用

文章推荐： security - 在持续集成环境中测试跨站点脚本 (XSS) 漏洞

文章推荐： javascript - 关联变量及其 ID 的最佳方式

文章推荐： java - 如何从命令行在 JUnit 4 中运行也被忽略的测试？

javascript - 无穷大 - 无穷大 = NaN？
任何数字减去它本身应该是 0，对吗？ 3 - 3 === 0 那为什么 Infinity - Infinity === NaN 因为 typeof Infinity 是 'number': 最佳答案
c++ - 除以非零值仍然可以创建一个南/无穷大
我有一个可能为零的数字。我除以那个数字所以我想测试它是否为零以防止 NaN 和无穷大。由于除法中的舍入误差，我是否仍可能创建 NaN/无穷大？ double x; // might be zero d
ruby-on-rails - FloatDomainError(无穷大)
我使用carrierwave 和mini_magick 上传图片。在开发中一切都很好，但在生产中它引发了 FloatDomainError (Infinity)当我尝试上传图片时。我在同一台服务器上托
python - 如何从嵌套列表中删除循环(无穷大)符号？
我有一个递归函数，它从一组边生成路径列表。但是，有时由于图形的性质，它会进入循环并生成一个字典，其中在列表中包含无限循环符号 [...]，例如: {('a', 'b'): [[1, 2, 8, 9,
javascript - 为什么 ( 无穷大 | 0 ) === 0？
我正在摆弄 JavaScript 中的按位运算符，我发现有一件事值得注意。 bitwise or operator返回1如果两个输入位之一是 1 作为输出位。这样做x | 0总是返回x ，因为| 0没
algorithm - 如果二叉树包含重复项或包含 +/- 无穷大，如何检查它是否是有效的 BST？
我检查二叉树是否是 BST 的解决方案如下: def is_BST(node): if node is None: return False stack = [(node, -floa
Python3 无穷大/NaN : Decimal vs. float
给定(Python3): >>> float('inf') == Decimal('inf') True >>> float('-inf') >> float('-inf') >> Decimal('
python - 如何在我的 numpy 数组中找到 NaN/无穷大/对于 dtype ('float64' 太大的值？
我正在尝试使用 scikit learn 拟合一个简单的机器学习模型。在这条线上: clf.fit(features, labels) 我得到一个熟悉的错误: Input contains NaN,
python - pd.qcut 的值为 inf(无穷大) ValueError : Bin edges must be unique:
我有一个数据集，它是 2 个浮点类型数字的比率。有些值具有 inf 表示无穷大(除以零)的情况。如何使用 pd.qcut/pd.cut 和 inf 值？我的数据可以访问 here . q = pd.
python - Gauss-Legendre 区间 -x -> 无穷大 : adaptive algorithm to transform weights and nodes efficiently
好的，我知道之前有人用一个有限的缩放示例问过这个问题 [-1, 1]间隔 [a, b] Different intervals for Gauss-Legendre quadrature in num
bash - 当 docker 以 PID 1 运行时，为什么我不能在 docker 中 CTRL-C sleep 无穷大
案例:我们有一个运行 bash 脚本的 docker 容器，该脚本需要永远“阻塞”(因为它为另一个容器公开了一个卷，但有时我们需要这样做还有其他原因)。我当时认为这可以工作: exec sleep

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 预处理数据时 ValueError : Input contains NaN, 无穷大或对于 dtype ('float64' 的值太大)