gpt4 book ai didi

python - 使用科学python进行时序数据分析 : continuous analysis over multiple files

转载 作者:行者123 更新时间:2023-11-28 19:22:50 25 4
gpt4 key购买 nike

问题

我正在进行时间序列分析。测量数据来自对传感器以 50 kHz 的电压输出进行采样,然后将该数据作为单独的文件以小时为单位转储到磁盘。使用 pytables 作为 CArray 将数据保存到 HDF5 文件。选择此格式是为了保持与 MATLAB 的互操作性。

完整的数据集现在有多个 TB,太大而无法加载到内存中。

我的一些分析要求我迭代整个数据集。对于需要我获取大块数据的分析,我可以通过创建生成器方法来寻找前进的道路。我有点不确定如何进行需要连续时间序列的分析。

示例

例如,假设我希望使用一些移动窗口过程(例如小波分析)或应用 FIR 滤波器来查找和分类瞬变。我如何处理文件的末尾或开头或 block 边界的边界?我希望数据显示为一个连续的数据集。

请求

我愿意:

  • 通过在必要时加载数据来保持较低的内存占用。
  • 在内存中保留整个数据集的映射,以便我可以像处理常规 pandas Series 对象一样对数据集进行寻址,例如数据[time1:time2].

我将科学 python(Enthought 分布)与所有常规内容一起使用:numpy、scipy、pandas、matplotlib 等。我最近才开始将 pandas 纳入我的工作流程,但我仍然不熟悉它的所有功能.

我查看了相关的 stackexchange 线程,但没有看到任何能准确解决我的问题的内容。

编辑:最终解决方案。

根据有用的提示,我构建了一个遍历文件并返回任意大小的 block 的迭代器——一个希望能够优雅地处理文件边界的移动窗口。我添加了用数据(重叠窗口)填充每个窗口的正面和背面的选项。然后我可以对重叠的窗口应用一系列过滤器,然后在最后删除重叠。我希望这能给我连续性。

我还没有实现 __getitem__ 但它在我要做的事情列表上。

这是最终代码。为简洁起见,省略了一些细节。

class FolderContainer(readdata.DataContainer):

def __init__(self,startdir):
readdata.DataContainer.__init__(self,startdir)

self.filelist = None
self.fs = None
self.nsamples_hour = None
# Build the file list
self._build_filelist(startdir)


def _build_filelist(self,startdir):
"""
Populate the filelist dictionary with active files and their associated
file date (YYYY,MM,DD) and hour.

Each entry in 'filelist' has the form (abs. path : datetime) where the
datetime object contains the complete date and hour information.
"""
print('Building file list....',end='')
# Use the full file path instead of a relative path so that we don't
# run into problems if we change the current working directory.
filelist = { os.path.abspath(f):self._datetime_from_fname(f)
for f in os.listdir(startdir)
if fnmatch.fnmatch(f,'NODE*.h5')}

# If we haven't found any files, raise an error
if not filelist:
msg = "Input directory does not contain Illionix h5 files."
raise IOError(msg)
# Filelist is a ordered dictionary. Sort before saving.
self.filelist = OrderedDict(sorted(filelist.items(),
key=lambda t: t[0]))
print('done')

def _datetime_from_fname(self,fname):
"""
Return the year, month, day, and hour from a filename as a datetime
object

"""
# Filename has the prototype: NODE##-YY-MM-DD-HH.h5. Split this up and
# take only the date parts. Convert the year form YY to YYYY.
(year,month,day,hour) = [int(d) for d in re.split('-|\.',fname)[1:-1]]
year+=2000
return datetime.datetime(year,month,day,hour)


def chunk(self,tstart,dt,**kwargs):
"""
Generator expression from returning consecutive chunks of data with
overlaps from the entire set of Illionix data files.

Parameters
----------
Arguments:
tstart: UTC start time [provided as a datetime or date string]
dt: Chunk size [integer number of samples]

Keyword arguments:
tend: UTC end time [provided as a datetime or date string].
frontpad: Padding in front of sample [integer number of samples].
backpad: Padding in back of sample [integer number of samples]

Yields:
chunk: generator expression

"""
# PARSE INPUT ARGUMENTS

# Ensure 'tstart' is a datetime object.
tstart = self._to_datetime(tstart)
# Find the offset, in samples, of the starting position of the window
# in the first data file
tstart_samples = self._to_samples(tstart)

# Convert dt to samples. Because dt is a timedelta object, we can't use
# '_to_samples' for conversion.
if isinstance(dt,int):
dt_samples = dt
elif isinstance(dt,datetime.timedelta):
dt_samples = np.int64((dt.day*24*3600 + dt.seconds +
dt.microseconds*1000) * self.fs)
else:
# FIXME: Pandas 0.13 includes a 'to_timedelta' function. Change
# below when EPD pushes the update.
t = self._parse_date_str(dt)
dt_samples = np.int64((t.minute*60 + t.second) * self.fs)

# Read keyword arguments. 'tend' defaults to the end of the last file
# if a time is not provided.
default_tend = self.filelist.values()[-1] + datetime.timedelta(hours=1)
tend = self._to_datetime(kwargs.get('tend',default_tend))
tend_samples = self._to_samples(tend)

frontpad = kwargs.get('frontpad',0)
backpad = kwargs.get('backpad',0)


# CREATE FILE LIST

# Build the the list of data files we will iterative over based upon
# the start and stop times.
print('Pruning file list...',end='')
tstart_floor = datetime.datetime(tstart.year,tstart.month,tstart.day,
tstart.hour)
filelist_pruned = OrderedDict([(k,v) for k,v in self.filelist.items()
if v >= tstart_floor and v <= tend])
print('done.')
# Check to ensure that we're not missing files by enforcing that there
# is exactly an hour offset between all files.
if not all([dt == datetime.timedelta(hours=1)
for dt in np.diff(np.array(filelist_pruned.values()))]):
raise readdata.DataIntegrityError("Hour gap(s) detected in data")


# MOVING WINDOW GENERATOR ALGORITHM

# Keep two files open, the current file and the next in line (que file)
fname_generator = self._file_iterator(filelist_pruned)
fname_current = fname_generator.next()
fname_next = fname_generator.next()

# Iterate over all the files. 'lastfile' indicates when we're
# processing the last file in the que.
lastfile = False
i = tstart_samples
while True:
with tables.openFile(fname_current) as fcurrent, \
tables.openFile(fname_next) as fnext:
# Point to the data
data_current = fcurrent.getNode('/data/voltage/raw')
data_next = fnext.getNode('/data/voltage/raw')
# Process all data windows associated with the current pair of
# files. Avoid unnecessary file access operations as we moving
# the sliding window.
while True:
# Conditionals that depend on if our slice is:
# (1) completely into the next hour
# (2) partially spills into the next hour
# (3) completely in the current hour.
if i - backpad >= self.nsamples_hour:
# If we're already on our last file in the processing
# que, we can't continue to the next. Exit. Generator
# is finished.
if lastfile:
raise GeneratorExit
# Advance the active and que file names.
fname_current = fname_next
try:
fname_next = fname_generator.next()
except GeneratorExit:
# We've reached the end of our file processing que.
# Indicate this is the last file so that if we try
# to pull data across the next file boundary, we'll
# exit.
lastfile = True
# Our data slice has completely moved into the next
# hour.
i-=self.nsamples_hour
# Return the data
yield data_next[i-backpad:i+dt_samples+frontpad]
# Move window by amount dt
i+=dt_samples
# We've completely moved on the the next pair of files.
# Move to the outer scope to grab the next set of
# files.
break
elif i + dt_samples + frontpad >= self.nsamples_hour:
if lastfile:
raise GeneratorExit
# Slice spills over into the next hour
yield np.r_[data_current[i-backpad:],
data_next[:i+dt_samples+frontpad-self.nsamples_hour]]
i+=dt_samples
else:
if lastfile:
# Exit once our slice crosses the boundary of the
# last file.
if i + dt_samples + frontpad > tend_samples:
raise GeneratorExit
# Slice is completely within the current hour
yield data_current[i-backpad:i+dt_samples+frontpad]
i+=dt_samples


def _to_samples(self,input_time):
"""Convert input time, if not in samples, to samples"""
if isinstance(input_time,int):
# Input time is already in samples
return input_time
elif isinstance(input_time,datetime.datetime):
# Input time is a datetime object
return self.fs * (input_time.minute * 60 + input_time.second)
else:
raise ValueError("Invalid input 'tstart' parameter")


def _to_datetime(self,input_time):
"""Return the passed time as a datetime object"""
if isinstance(input_time,datetime.datetime):
converted_time = input_time
elif isinstance(input_time,str):
converted_time = self._parse_date_str(input_time)
else:
raise TypeError("A datetime object or string date/time were "
"expected")
return converted_time


def _file_iterator(self,filelist):
"""Generator for iterating over file names."""
for fname in filelist:
yield fname

最佳答案

@Sean 这是我的 2c

看看这个问题here这是我不久前创建的。这基本上就是您要尝试做的事情。这有点不平凡。

在不知道更多细节的情况下,我会提供一些建议:

  • HDFStore 可以以标准的CArray 格式读取,参见here

  • 您可以轻松创建类似“系列”的对象,它具有以下良好属性:a) 了解每个文件的位置及其范围,并使用 __getitem__ 来“选择”这些文件,例如s[time1:time2]。从顶层 View 来看,这可能是一个非常好的抽象,然后您可以分派(dispatch)操作。

例如

class OutOfCoreSeries(object):

def __init__(self, dir):
.... load a list of the files in the dir where you have them ...

def __getitem__(self, key):
.... map the selection key (say its a slice, which 'time1:time2' resolves) ...
.... to the files that make it up .... , then return a new Series that only
.... those file pointers ....

def apply(self, func, **kwargs):
""" apply a function to the files """
results = []
for f in self.files:
results.append(func(self.read_file(f)))
return Results(results)

这很容易变得相当复杂。例如,如果您应用一个可以适合内存的缩减操作,Results 可以简单地是 pandas.Series(或 Frame)。然而,您可能正在进行转换,这需要您写出一组新的转换数据文件。如果是这样,那么您必须处理这个问题。

还有几个建议:

  • 可能希望以多种可能有用的方式保留您的数据。例如,您说要在 1 小时的切片中保存多个值。可能您可以将这些 1 小时的文件拆分为您要保存的每个变量的文件,但保存更长的片段,然后变得内存可读。

  • 您可能希望将数据重新采样到较低的频率,并处理这些数据,根据需要将数据加载到特定切片中以进行更详细的工作。

  • 您可能想要创建一个可跨时间查询的数据集,例如说出不同频率的高低峰值,例如也许使用表格格式见 here

因此您可能拥有相同数据的多个变体。磁盘空间通常比主内存更便​​宜/更容易管理。利用这一点很有意义。

关于python - 使用科学python进行时序数据分析 : continuous analysis over multiple files,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20639339/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com