python - 使用科学python进行时序数据分析 : continuous analysis over multiple files-6ren

python - 使用科学python进行时序数据分析 : continuous analysis over multiple files

转载作者：行者123 更新时间：2023-11-28 19:22:50

问题

我正在进行时间序列分析。测量数据来自对传感器以 50 kHz 的电压输出进行采样，然后将该数据作为单独的文件以小时为单位转储到磁盘。使用 pytables 作为 CArray 将数据保存到 HDF5 文件。选择此格式是为了保持与 MATLAB 的互操作性。

完整的数据集现在有多个 TB，太大而无法加载到内存中。

我的一些分析要求我迭代整个数据集。对于需要我获取大块数据的分析，我可以通过创建生成器方法来寻找前进的道路。我有点不确定如何进行需要连续时间序列的分析。

示例

例如，假设我希望使用一些移动窗口过程(例如小波分析)或应用 FIR 滤波器来查找和分类瞬变。我如何处理文件的末尾或开头或 block 边界的边界？我希望数据显示为一个连续的数据集。

请求

我愿意:

通过在必要时加载数据来保持较低的内存占用。
在内存中保留整个数据集的映射，以便我可以像处理常规 pandas Series 对象一样对数据集进行寻址，例如数据[time1:time2].

我将科学 python(Enthought 分布)与所有常规内容一起使用:numpy、scipy、pandas、matplotlib 等。我最近才开始将 pandas 纳入我的工作流程，但我仍然不熟悉它的所有功能.

我查看了相关的 stackexchange 线程，但没有看到任何能准确解决我的问题的内容。

编辑:最终解决方案。

根据有用的提示，我构建了一个遍历文件并返回任意大小的 block 的迭代器——一个希望能够优雅地处理文件边界的移动窗口。我添加了用数据(重叠窗口)填充每个窗口的正面和背面的选项。然后我可以对重叠的窗口应用一系列过滤器，然后在最后删除重叠。我希望这能给我连续性。

我还没有实现 __getitem__ 但它在我要做的事情列表上。

这是最终代码。为简洁起见，省略了一些细节。

class FolderContainer(readdata.DataContainer):

    def __init__(self,startdir):
        readdata.DataContainer.__init__(self,startdir)

        self.filelist = None
        self.fs = None
        self.nsamples_hour = None
        # Build the file list
        self._build_filelist(startdir)


    def _build_filelist(self,startdir):
        """
        Populate the filelist dictionary with active files and their associated
        file date (YYYY,MM,DD) and hour.

        Each entry in 'filelist' has the form (abs. path : datetime) where the
        datetime object contains the complete date and hour information.
        """
        print('Building file list....',end='')
        # Use the full file path instead of a relative path so that we don't
        # run into problems if we change the current working directory.
        filelist = { os.path.abspath(f):self._datetime_from_fname(f)
                for f in os.listdir(startdir)
                if fnmatch.fnmatch(f,'NODE*.h5')}

        # If we haven't found any files, raise an error
        if not filelist:
            msg = "Input directory does not contain Illionix h5 files."
            raise IOError(msg)
        # Filelist is a ordered dictionary. Sort before saving.
        self.filelist = OrderedDict(sorted(filelist.items(),
                key=lambda t: t[0]))
        print('done')
    
    def _datetime_from_fname(self,fname):
        """
        Return the year, month, day, and hour from a filename as a datetime
        object
        
        """
        # Filename has the prototype: NODE##-YY-MM-DD-HH.h5. Split this up and
        # take only the date parts. Convert the year form YY to YYYY.
        (year,month,day,hour) = [int(d) for d in re.split('-|\.',fname)[1:-1]]
        year+=2000
        return datetime.datetime(year,month,day,hour)


    def chunk(self,tstart,dt,**kwargs):
        """
        Generator expression from returning consecutive chunks of data with
        overlaps from the entire set of Illionix data files.

        Parameters
        ----------
        Arguments:
            tstart: UTC start time [provided as a datetime or date string]
            dt: Chunk size [integer number of samples]

        Keyword arguments:
            tend: UTC end time [provided as a datetime or date string].
            frontpad: Padding in front of sample [integer number of samples].
            backpad: Padding in back of sample [integer number of samples]

        Yields:
            chunk: generator expression

        """
        # PARSE INPUT ARGUMENTS

        # Ensure 'tstart' is a datetime object.
        tstart = self._to_datetime(tstart)
        # Find the offset, in samples, of the starting position of the window
        # in the first data file
        tstart_samples = self._to_samples(tstart)

        # Convert dt to samples. Because dt is a timedelta object, we can't use
        # '_to_samples' for conversion.
        if isinstance(dt,int):
            dt_samples = dt
        elif isinstance(dt,datetime.timedelta):
            dt_samples = np.int64((dt.day*24*3600 + dt.seconds + 
                    dt.microseconds*1000) * self.fs)
        else:
            # FIXME: Pandas 0.13 includes a 'to_timedelta' function. Change
            # below when EPD pushes the update.
            t = self._parse_date_str(dt)
            dt_samples = np.int64((t.minute*60 + t.second) * self.fs)

        # Read keyword arguments. 'tend' defaults to the end of the last file
        # if a time is not provided.
        default_tend = self.filelist.values()[-1] + datetime.timedelta(hours=1)
        tend = self._to_datetime(kwargs.get('tend',default_tend))
        tend_samples = self._to_samples(tend)

        frontpad = kwargs.get('frontpad',0)
        backpad = kwargs.get('backpad',0)


        # CREATE FILE LIST

        # Build the the list of data files we will iterative over based upon
        # the start and stop times.
        print('Pruning file list...',end='')
        tstart_floor = datetime.datetime(tstart.year,tstart.month,tstart.day,
                tstart.hour)
        filelist_pruned = OrderedDict([(k,v) for k,v in self.filelist.items()
                if v >= tstart_floor and v <= tend])
        print('done.')
        # Check to ensure that we're not missing files by enforcing that there
        # is exactly an hour offset between all files.
        if not all([dt == datetime.timedelta(hours=1) 
                for dt in np.diff(np.array(filelist_pruned.values()))]):
            raise readdata.DataIntegrityError("Hour gap(s) detected in data")


        # MOVING WINDOW GENERATOR ALGORITHM

        # Keep two files open, the current file and the next in line (que file)
        fname_generator = self._file_iterator(filelist_pruned)
        fname_current = fname_generator.next()
        fname_next = fname_generator.next()

        # Iterate over all the files. 'lastfile' indicates when we're
        # processing the last file in the que.
        lastfile = False
        i = tstart_samples
        while True:
            with tables.openFile(fname_current) as fcurrent, \
                    tables.openFile(fname_next) as fnext:
                # Point to the data
                data_current = fcurrent.getNode('/data/voltage/raw')
                data_next = fnext.getNode('/data/voltage/raw')
                # Process all data windows associated with the current pair of
                # files. Avoid unnecessary file access operations as we moving
                # the sliding window.
                while True:
                    # Conditionals that depend on if our slice is:
                    #   (1) completely into the next hour
                    #   (2) partially spills into the next hour
                    #   (3) completely in the current hour.
                    if i - backpad >= self.nsamples_hour:
                        # If we're already on our last file in the processing
                        # que, we can't continue to the next. Exit. Generator
                        # is finished.
                        if lastfile:
                            raise GeneratorExit
                        # Advance the active and que file names. 
                        fname_current = fname_next
                        try:
                            fname_next = fname_generator.next()
                        except GeneratorExit:
                            # We've reached the end of our file processing que.
                            # Indicate this is the last file so that if we try
                            # to pull data across the next file boundary, we'll
                            # exit.
                            lastfile = True
                        # Our data slice has completely moved into the next
                        # hour.
                        i-=self.nsamples_hour
                        # Return the data
                        yield data_next[i-backpad:i+dt_samples+frontpad]
                        # Move window by amount dt
                        i+=dt_samples
                        # We've completely moved on the the next pair of files.
                        # Move to the outer scope to grab the next set of
                        # files.
                        break  
                    elif i + dt_samples + frontpad >= self.nsamples_hour:
                        if lastfile:
                            raise GeneratorExit
                        # Slice spills over into the next hour
                        yield np.r_[data_current[i-backpad:],
                                data_next[:i+dt_samples+frontpad-self.nsamples_hour]]
                        i+=dt_samples
                    else:
                        if lastfile:
                            # Exit once our slice crosses the boundary of the
                            # last file.
                            if i + dt_samples + frontpad > tend_samples:
                                raise GeneratorExit
                        # Slice is completely within the current hour
                        yield data_current[i-backpad:i+dt_samples+frontpad]
                        i+=dt_samples


    def _to_samples(self,input_time):
        """Convert input time, if not in samples, to samples"""
        if isinstance(input_time,int):
            # Input time is already in samples
            return input_time
        elif isinstance(input_time,datetime.datetime):
            # Input time is a datetime object
            return self.fs * (input_time.minute * 60 + input_time.second)
        else:
            raise ValueError("Invalid input 'tstart' parameter")


    def _to_datetime(self,input_time):
        """Return the passed time as a datetime object"""
        if isinstance(input_time,datetime.datetime):
            converted_time = input_time
        elif isinstance(input_time,str):
            converted_time = self._parse_date_str(input_time)
        else:
            raise TypeError("A datetime object or string date/time were "
                    "expected")
        return converted_time


    def _file_iterator(self,filelist):
        """Generator for iterating over file names."""
        for fname in filelist:
            yield fname

最佳答案

@Sean 这是我的 2c

看看这个问题here这是我不久前创建的。这基本上就是您要尝试做的事情。这有点不平凡。

在不知道更多细节的情况下，我会提供一些建议:

HDFStore 可以以标准的CArray 格式读取，参见here
您可以轻松创建类似“系列”的对象，它具有以下良好属性:a) 了解每个文件的位置及其范围，并使用 __getitem__ 来“选择”这些文件，例如s[time1:time2]。从顶层 View 来看，这可能是一个非常好的抽象，然后您可以分派(dispatch)操作。

例如

class OutOfCoreSeries(object):

     def __init__(self, dir):
            .... load a list of the files in the dir where you have them ...

     def __getitem__(self, key):
            .... map the selection key (say its a slice, which 'time1:time2' resolves) ...
            .... to the files that make it up .... , then return a new Series that only
            .... those file pointers ....

     def apply(self, func, **kwargs):
            """ apply a function to the files """
            results = []
            for f in self.files:
                     results.append(func(self.read_file(f)))
            return Results(results)

这很容易变得相当复杂。例如，如果您应用一个可以适合内存的缩减操作，Results 可以简单地是 pandas.Series(或 Frame)。然而，您可能正在进行转换，这需要您写出一组新的转换数据文件。如果是这样，那么您必须处理这个问题。

还有几个建议:

您可能希望以多种可能有用的方式保留您的数据。例如，您说要在 1 小时的切片中保存多个值。可能您可以将这些 1 小时的文件拆分为您要保存的每个变量的文件，但保存更长的片段，然后变得内存可读。
您可能希望将数据重新采样到较低的频率，并处理这些数据，根据需要将数据加载到特定切片中以进行更详细的工作。
您可能想要创建一个可跨时间查询的数据集，例如说出不同频率的高低峰值，例如也许使用表格格式见 here

因此您可能拥有相同数据的多个变体。磁盘空间通常比主内存更便宜/更容易管理。利用这一点很有意义。

关于python - 使用科学python进行时序数据分析 : continuous analysis over multiple files，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/20639339/

文章推荐： python - virtualenv 错误 : "' . ../bin/easy_install' 找不到”

文章推荐： python - 如何让 Python shell 找到 matplotlib？

文章推荐： python - 防止在每个请求上创建昂贵的对象

JavaScript 时序
我想选择一个类的所有元素。然后将该类更改为另一个类。 0.5 秒后，我想将元素恢复到原来的类。我必须连续这样做 8 次。即使我的代码实现了(以某种方式)，我也看不到按钮的颜色变化。谁能帮我？我猜这是
c++ - 获取可变长度操作码和 CPU 时序
我目前正在尝试用 C++ 编写一个 NES 模拟器，作为一个夏季编程项目，为下一学年的秋季学期做准备(我已经有一段时间没有编码了)。我已经编写了一个 Chip8 模拟器，所以我认为下一步是尝试编写一个
javascript - 函数、时序、顺序调用和全局变量
我有 2 个函数依次调用，x 和 y 是全局变量。 function setVariables() { x = 2; y = 10; }; function useVaria
Linux ALSA snd_pcm_writei 时序
我正在尝试以重复的方式播放 1000 毫秒的 wav 文件。因此，播放 1000 毫秒，然后播放 1000 毫秒的静音，然后再次播放 1000 毫秒的音频，... 但是当我在此过程中打印计时时，我注意
c++ - 跨内核和用户空间的 Linux 时序
我正在为一个特殊的相机编写一个内核模块，通过 V4L2 处理帧到用户空间代码的传输。然后我在应用程序中做很多用户空间的事情。时间在这里非常关键，所以我一直在做大量的性能分析和普通的旧 std::ch
iphone - viewDidLoad 和 awakeFromNib 时序
据我了解，awakeFromNib 始终会在 viewDidLoad 之前调用。所以我有一个 UITableViewController 的子类，它是从 xib 文件中取消存档的。我在里面定义了这
powershell - 如何在 powershell 中分析(时序)
我的powershell脚本运行缓慢，有什么办法可以分析powershell脚本吗？最佳答案在这里发布您的脚本真的有助于给出准确的答案。您可以使用 Measure-Command 来查看脚本中每
cqrs - CQRS/事件源/事件总线/时序
我的CQRS / ES设计中有时间问题。为了便于讨论，让我们基于Microsoft的有关此主题的示例， session 管理(https://msdn.microsoft.com/en-us/lib
c# - 基本 RX TestScheduler 时序
我正在使用 RX 进行一些(非常基本的)事件订阅:- public void StartListening(IObservable observable) { subscription = ob
c# - 异常 : When to use, 时序，整体使用
我会试着问我的问题，这样它就不会以一个简单的争论话题结束。我最近进入了一个用 C# 编码的应用程序，我正在发现异常机制。我和他们有过一些不好的经历，比如以下 // _sValue is a stri
performance - Cortex M4 LDR/STR 时序
我正在阅读 Cortex M4 TRM 以了解指令执行周期。但是，那里有一些令人困惑的描述在 Table of Processor Instuctions , STR需要 2 个周期 . 稍后在 L
c++ - MacOS 上 OpenGL 的 GPU 时序
我需要在 GPU 端处理一组绘图调用所需的时间跨度。 OpenGL 3.2+ 具有“GL_ARB_timer_query”扩展名。不幸的是，MacOSX 仍然不支持该扩展。你如何能够在 gpu 端测
java - JFrame.setVisible(false) 和 Robot.createScreenCapture 时序
我正在 try catch 屏幕而不包括我的应用程序窗口。为此，我首先调用 setVisible(false)，然后调用 createScreenCapture 方法，最后调用 setVisible(
javascript - JS vs DOM 时序 : . remove() 元素在视觉上发生，但 travesal 仍然包含它
我们试图实现的功能的简短描述:我们在左边有一个源对象列表，一个人可以将新项目从列表拖到右边的列表中，项目因此被添加到列表中在右侧;他们还可以从右侧的列表中删除项目。右侧的列表在更改时会被保存。 (我认

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 使用科学python进行时序数据分析 : continuous analysis over multiple files