gpt4 book ai didi

python - 修补 Python 中 CSV 文件中缺失的行

转载 作者:太空宇宙 更新时间:2023-11-03 14:57:11 26 4
gpt4 key购买 nike

我有一个 CSV 文件,其中包含多天一天中每一分钟的行。它是由有时会遗漏几行的数据采集系统生成的。

数据看起来像这样 - 一个日期时间字段后跟一些整数

"2017-01-07 03:00:02","7","3","2","13","0"
"2017-01-07 03:01:02","7","3","2","13","0"
"2017-01-07 03:02:02","7","3","2","12","0"
"2017-01-07 03:07:02","7","3","2","12","0"
"2017-01-07 03:08:02","6","3","2","12","1"
"2017-01-07 03:09:02","7","3","2","12","0"
"2017-01-07 03:10:02","6","3","2","11","1"

上述(真实数据)示例中缺少行。由于样本之间的数据变化不大,我只想将最后一个有效数据复制到缺失的行中。我遇到的问题是检测缺少哪些行。

我正在使用我拼凑的 python 程序处理 CSV(我对 python 很陌生)。这可以处理我拥有的数据。

import csv
import datetime

with open("minutedata.csv", 'rb') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
date = datetime.datetime.strptime (row [0],"%Y-%m-%d %H:%M:%S")
v1 = int(row[1])
v2 = int(row[2])
v3 = int(row[3])
v4 = int(row[4])
v5 = int(row[5])
...(process values)...

...(save data)...

我不确定如何检查当前行是按顺序排列的下一行,还是在一些缺失的行之后。

编辑添加:

感谢 jeremycg 提供的指针,我现在正在尝试使用 Pandas。

我在 CSV 中添加了一个标题行,现在它看起来像:

time,v1,v2,v3,v4,v5
"2017-01-07 03:00:02","7","3","2","13","0"
"2017-01-07 03:01:02","7","3","2","13","0"
"2017-01-07 03:02:02","7","3","2","12","0"
"2017-01-07 03:07:02","7","3","2","12","0"
"2017-01-07 03:08:02","6","3","2","12","1"
"2017-01-07 03:09:02","7","3","2","12","0"
"2017-01-07 03:10:02","6","3","2","11","1"

现在的处理代码是:

import pandas as pd
import io
z = pd.read_csv('minutedata.csv')
z['time'] = pd.to_datetime(z['time'])
z.set_index('time').reindex(pd.date_range(min(z['time']), max(z['time']),freq="1min")).ffill()
for row in z:
date = datetime.datetime.strptime (row [0],"%Y-%m-%d %H:%M:%S")
v1 = int(row[1])
v2 = int(row[2])
v3 = int(row[3])
v4 = int(row[4])
v5 = int(row[5])
...(process values)...

...(save data)...

但是这个错误:

Traceback (most recent call last):
File "process_day.py", line 14, in <module>
z.set_index('time').reindex(pd.date_range(min(z['time']), max(z['time']), freq="1min")).ffill()
File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 2821, in reindex
**kwargs)
File "/usr/local/lib/python2.7/site-packages/pandas/core/generic.py", line 2259, in reindex fill_value, copy).__finalize__(self)
File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 2767, in _reindex_axes
fill_value, limit, tolerance)
File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 2778, in _reindex_index allow_dups=False)
File "/usr/local/lib/python2.7/site-packages/pandas/core/generic.py", line 2371, in _reindex_with_indexers copy=copy)
File "/usr/local/lib/python2.7/site-packages/pandas/core/internals.py", line 3839, in reindex_indexer self.axes[axis]._can_reindex(indexer)
File "/usr/local/lib/python2.7/site-packages/pandas/indexes/base.py", line 2494, in _can_reindex raise ValueError("cannot reindex from a duplicate axis")
ValueError: cannot reindex from a duplicate axis

我不知道它现在声称已损坏的内容。

请参阅下面的评论以了解此修复程序。

现在的工作代码是:

import pandas as pd
import datetime

z = pd.read_csv('minutedata1.csv')
z = z[~z.time.duplicated()]
z['time'] = pd.to_datetime(z['time'])
z.set_index('time').reindex(pd.date_range(min(z['time']), max(z['time']),freq="1min")).ffill()
for index,row in z.iterrows():
date = datetime.datetime.strptime (row [0],"%Y-%m-%d %H:%M:%S")
v1 = int(row[1])
v2 = int(row[2])
v3 = int(row[3])
v4 = int(row[4])
v5 = int(row[5])
...(process values)...

...(save data)...

衷心感谢所有提供帮助的人。 - 大卫

最佳答案

你可能应该为此使用 pandas,因为它是为这类东西而生的。

首先阅读 csv:

import pandas as pd
import io
x = '''
time,a,b,c,d,e
"2017-01-07 03:00:02","7","3","2","13","0"
"2017-01-07 03:01:02","7","3","2","13","0"
"2017-01-07 03:02:02","7","3","2","12","0"
"2017-01-07 03:07:02","7","3","2","12","0"
"2017-01-07 03:08:02","6","3","2","12","1"
"2017-01-07 03:09:02","7","3","2","12","0"
"2017-01-07 03:10:02","6","3","2","11","1"''' #your data, with added headers
z = pd.read_csv(io.StringIO(x)) #you can use your file name here

现在 z 是一个 pandas 数据框:

z.head()

time a b c d e
0 2017-01-07 03:00:02 7 3 2 13 0
1 2017-01-07 03:01:02 7 3 2 13 0
2 2017-01-07 03:02:02 7 3 2 12 0
3 2017-01-07 03:07:02 7 3 2 12 0
4 2017-01-07 03:08:02 6 3 2 12 1

我们想要:将“时间”列转换为 pd.datetime:

z['time'] = pd.to_datetime(z['time'])

将数据帧的“索引”设置为时间,然后在我们的范围内重新索引:

z = z.set_index('time').reindex(pd.date_range(min(z['time']), max(z['time']), freq="1min"))
z

a b c d e
2017-01-07 03:00:02 7.0 3.0 2.0 13.0 0.0
2017-01-07 03:01:02 7.0 3.0 2.0 13.0 0.0
2017-01-07 03:02:02 7.0 3.0 2.0 12.0 0.0
2017-01-07 03:03:02 NaN NaN NaN NaN NaN
2017-01-07 03:04:02 NaN NaN NaN NaN NaN
2017-01-07 03:05:02 NaN NaN NaN NaN NaN
2017-01-07 03:06:02 NaN NaN NaN NaN NaN
2017-01-07 03:07:02 7.0 3.0 2.0 12.0 0.0
2017-01-07 03:08:02 6.0 3.0 2.0 12.0 1.0
2017-01-07 03:09:02 7.0 3.0 2.0 12.0 0.0
2017-01-07 03:10:02 6.0 3.0 2.0 11.0 1.0

然后使用.ffill()从之前的值开始填充:

z.ffill()

a b c d e
2017-01-07 03:00:02 7.0 3.0 2.0 13.0 0.0
2017-01-07 03:01:02 7.0 3.0 2.0 13.0 0.0
2017-01-07 03:02:02 7.0 3.0 2.0 12.0 0.0
2017-01-07 03:03:02 7.0 3.0 2.0 12.0 0.0
2017-01-07 03:04:02 7.0 3.0 2.0 12.0 0.0
2017-01-07 03:05:02 7.0 3.0 2.0 12.0 0.0
2017-01-07 03:06:02 7.0 3.0 2.0 12.0 0.0
2017-01-07 03:07:02 7.0 3.0 2.0 12.0 0.0
2017-01-07 03:08:02 6.0 3.0 2.0 12.0 1.0
2017-01-07 03:09:02 7.0 3.0 2.0 12.0 0.0
2017-01-07 03:10:02 6.0 3.0 2.0 11.0 1.0

或者,一起:

z = pd.read_csv(io.StringIO(x))
z['time'] = pd.to_datetime(z['time'])
z.set_index('time').reindex(pd.date_range(min(z['time']), max(z['time']), freq="1min")).ffill()

关于python - 修补 Python 中 CSV 文件中缺失的行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41622526/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com