gpt4 book ai didi

python - pd.to_csv 保存,但显然是错误的数据(根据打印功能)

转载 作者:行者123 更新时间:2023-11-28 18:06:02 24 4
gpt4 key购买 nike

这是我关于 stackoverflow 的第一篇文章。如果我没有正确遵循通用风格指南,请放轻松。

我正在进行 kaggle 挑战“预测房价”。我的第一步是预处理数据集。代码“NaN”中有空单元格。使用 df["Headline"].fillNA("NA") 我将其更改为“NA”,在此挑战中,其定义为未进一步描述。

打印功能表明,该方法有效。最后,我想将修改后的 DataFrame 保存到一个 .csv 文件中(您可以在代码中看到路径和文件名)。然而,虽然 .csv 确实保存了,但数据显然是错误的。所以,我想我一定是在 pd.to_csv 的语法上犯了错误。

首先,这是我的代码。之后,您会发现控制台关于修改后的数据框“maindf”和我的 .csv 文件“csvdf”的数据框的内容。顺便说一下,对于控制台的格式不佳,我们深表歉意。

import os
import pandas as pd
import numpy as np

#Variables
PRICE = []
CRIT = []

#Directories
DATADIR = r"C:\Users\Hp\Desktop\Project_Arcus\house_price\data"
DATA = "train.csv"
path = os.path.join(DATADIR, DATA)
MODFILE = "train_modified.csv"
mod_path = os.path.join(DATADIR, MODFILE)

print(f"Training Data is {path}")
print(f"Modified Training Data is{mod_path}")

# Goal: Open the document of the chosen path. Extract data (f. e. the headline)
df = pd.read_csv(path)
maindf = df # this step is unnecessary, but it helped me to better understand.

# Goal: Check for empty cells. Replace them with a fitting value, so the neural network can
# threat them accordingly. Save the .csv under a new name.
maindf["PoolQC"] = df["PoolQC"].fillna("NA")
maindf["MiscFeature"] = df["MiscFeature"].fillna("NA")
maindf["Alley"] = df["Alley"].fillna("NA")
maindf["Fence"] = df["Fence"].fillna("NA")
maindf["FireplaceQu"] = df["FireplaceQu"].fillna("NA")
maindf.to_csv(mod_path,index=True) # index=False means there will be no row names (index).

# Next Goal: Save the dataframe df into a csv document "train_modified.csv" WORKS
# Check if the new file is correct. Not correct! NaN included...!

#print(df.isnull().sum())
csvdf = pd.read_csv(mod_path)
#print(csvdf.isnull().sum())
print(maindf["PoolQC"].head(10))
print(csvdf["PoolQC"].head(10))

Training Data is C:\Users\Hp\Desktop\Project_Arcus\house_price\data\train.csv Modified Training Data is C:\Users\Hp\Desktop\Project_Arcus\house_price\data\train_modified.csv 0 NA 1 NA 2 NA 3 NA 4 NA 5 NA 6 NA 7 NA 8
NA 9 NA Name: PoolQC, dtype: object 0
NaN 1 NaN 2 NaN 3
NaN 4 NaN 5 NaN 6 NaN 7 NaN 8 NaN 9 NaN Name: PoolQC, dtype: object

最佳答案

问题不在于to_csv,而在于read_csvdocumentation对于哪些州:

na_values : scalar, str, list-like, or dict, default None

By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.

相反,在使用 read_csv 时定义 keep_default_nana_values 参数:

csvdf = pd.read_csv(mod_path, keep_default_na=False, na_values='')

您可能希望为 na_values 提供一个值列表:如果与 keep_default_na=False 一起使用,Pandas 将仅将这些值视为NaN

更好的想法是使用比 'NA' 更明确的字符串来表示您不想被读取为 NaN 的数据。

关于python - pd.to_csv 保存,但显然是错误的数据(根据打印功能),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53411787/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com