gpt4 book ai didi

python - 如何删除错误行错误的行并使用 pandas 或 numpy 读取剩余的 csv 文件?

转载 作者:太空宇宙 更新时间:2023-11-04 08:25:03 25 4
gpt4 key购买 nike

由于以下解析器错误,我无法读取我的 dataset.csv 文件。

标记化数据时出错。 C 错误:第 8 行应有 1 个字段,但看到了 4 个

CSV 文件是通过另一个程序生成的。基本上我想跳过在特定时间间隔后迭代的字符行,只需要我的数据集中的整数和浮点值。我试过这个:

df = pd.read_csv('Dataset.csv')

我也试过这个,但我只得到坏行作为输出。但我想跳过所有这些错误的错误行,只显示数据集中剩余的值。

df = pd.read_csv('Dataset.csv',error_bad_lines=False, engine='python')

数据集:

The pch2csv utility program
This file contains the pch2csv


$TITLE =
$SUBTITLE=
$LABEL = FX
1,0.000000E+00,3.792830E-06,-1.063093E-06
2,0.000000E+00,-1.441319E-06,4.711234E-06
3,0.000000E+00,2.950290E-06,-5.669502E-07
4,0.000000E+00,3.706791E-06,-1.094726E-06
5,0.000000E+00,3.689831E-06,-1.107476E-06

$TITLE =
$SUBTITLE=
$LABEL = FY
1,0.000000E+00,-5.878803E-06,1.127179E-06
2,0.000000E+00,2.782207E-06,-8.840886E-06
3,0.000000E+00,-1.574296E-06,3.867732E-07
4,0.000000E+00,-6.227912E-06,1.864081E-06
5,0.000000E+00,-3.113227E-05,9.339538E-06

dataset

预期数据集:

*如果可能,甚至可以删除空白行第一列应设置为索引,最终数据集必须仅包含第一列和第三列,如图所示。列标签必须设置为“1”

enter image description here

最佳答案

您可以将参数 names 添加到 read_csv对于新列名称 - 然后获取一些缺少值的行,因此添加了 DataFrame.dropna :

import pandas as pd
from io import StringIO


temp="""The pch2csv utility program
This file contains the pch2csv


$TITLE =
$SUBTITLE=
$LABEL = FX
1,0.000000E+00,3.792830E-06,-1.063093E-06
2,0.000000E+00,-1.441319E-06,4.711234E-06
3,0.000000E+00,2.950290E-06,-5.669502E-07
4,0.000000E+00,3.706791E-06,-1.094726E-06
5,0.000000E+00,3.689831E-06,-1.107476E-06

$TITLE =
$SUBTITLE=
$LABEL = FY
1,0.000000E+00,-5.878803E-06,1.127179E-06
2,0.000000E+00,2.782207E-06,-8.840886E-06
3,0.000000E+00,-1.574296E-06,3.867732E-07
4,0.000000E+00,-6.227912E-06,1.864081E-06
5,0.000000E+00,-3.113227E-05,9.339538E-06"""

#after testing replace 'pd.compat.StringIO(temp)' to 'Dataset.csv'
df = pd.read_csv(StringIO(temp),
error_bad_lines=False,
engine='python',
names=['a','b','c','d'])

df = df.dropna(subset=['b','c','d'])
print (df)
a b c d
0 1 0.0 0.000004 -1.063093e-06
1 2 0.0 -0.000001 4.711234e-06
2 3 0.0 0.000003 -5.669502e-07
3 4 0.0 0.000004 -1.094726e-06
4 5 0.0 0.000004 -1.107476e-06
8 1 0.0 -0.000006 1.127179e-06
9 2 0.0 0.000003 -8.840886e-06
10 3 0.0 -0.000002 3.867732e-07
11 4 0.0 -0.000006 1.864081e-06
12 5 0.0 -0.000031 9.339538e-06

编辑:

对于设置第一列索引和其他列名称:

#after testing replace 'pd.compat.StringIO(temp)' to 'Dataset.csv'
df = pd.read_csv(StringIO(temp),
error_bad_lines=False,
engine='python',
index_col=[0],
names=['idx','col1','col2','col3'])

#check all columns, first column is set to index, so not tested
df = df.dropna()

#if need test if all values in row has NaNs
#df = df.dropna(how='all')
print (df)
col1 col2 col3
idx
1 0.0 0.000004 -1.063093e-06
2 0.0 -0.000001 4.711234e-06
3 0.0 0.000003 -5.669502e-07
4 0.0 0.000004 -1.094726e-06
5 0.0 0.000004 -1.107476e-06
1 0.0 -0.000006 1.127179e-06
2 0.0 0.000003 -8.840886e-06
3 0.0 -0.000002 3.867732e-07
4 0.0 -0.000006 1.864081e-06
5 0.0 -0.000031 9.339538e-06

编辑1:

如果需要只删除所有由 0 填充的列:

df = df.loc[:, df.ne(0).any()]
print (df)
col2 col3
idx
1 0.000004 -1.063093e-06
2 -0.000001 4.711234e-06
3 0.000003 -5.669502e-07
4 0.000004 -1.094726e-06
5 0.000004 -1.107476e-06
1 -0.000006 1.127179e-06
2 0.000003 -8.840886e-06
3 -0.000002 3.867732e-07
4 -0.000006 1.864081e-06
5 -0.000031 9.339538e-06

关于python - 如何删除错误行错误的行并使用 pandas 或 numpy 读取剩余的 csv 文件?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58162433/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com