gpt4 book ai didi

python - 从二进制文件创建 Numpy 数组的有效方法

转载 作者:太空狗 更新时间:2023-10-29 17:16:45 25 4
gpt4 key购买 nike

我有非常大的数据集,这些数据集存储在硬盘上的二进制文件中。这是文件结构的示例:

文件头

149 Byte ASCII Header

记录开始

4 Byte Int - Record Timestamp

样本开始

2 Byte Int - Data Stream 1 Sample
2 Byte Int - Data Stream 2 Sample
2 Byte Int - Data Stream 3 Sample
2 Byte Int - Data Stream 4 Sample

示例结束

每个记录有 122,880 个样本,每个文件有 713 个记录。这产生了 700,910,521 字节的总大小。采样率和记录数量有时确实会有所不同,因此我必须编写代码来检测每个文件的数量。

目前我用来将这些数据导入数组的代码是这样工作的:

from time import clock
from numpy import zeros , int16 , int32 , hstack , array , savez
from struct import unpack
from os.path import getsize

start_time = clock()
file_size = getsize(input_file)

with open(input_file,'rb') as openfile:
input_data = openfile.read()

header = input_data[:149]
record_size = int(header[23:31])
number_of_records = ( file_size - 149 ) / record_size
sample_rate = ( ( record_size - 4 ) / 4 ) / 2

time_series = zeros(0,dtype=int32)
t_series = zeros(0,dtype=int16)
x_series = zeros(0,dtype=int16)
y_series = zeros(0,dtype=int16)
z_series = zeros(0,dtype=int16)

for record in xrange(number_of_records):

time_stamp = array( unpack( '<l' , input_data[ 149 + (record * record_size) : 149 + (record * record_size) + 4 ] ) , dtype = int32 )
unpacked_record = unpack( '<' + str(sample_rate * 4) + 'h' , input_data[ 149 + (record * record_size) + 4 : 149 + ( (record + 1) * record_size ) ] )

record_t = zeros(sample_rate , dtype=int16)
record_x = zeros(sample_rate , dtype=int16)
record_y = zeros(sample_rate , dtype=int16)
record_z = zeros(sample_rate , dtype=int16)

for sample in xrange(sample_rate):

record_t[sample] = unpacked_record[ ( sample * 4 ) + 0 ]
record_x[sample] = unpacked_record[ ( sample * 4 ) + 1 ]
record_y[sample] = unpacked_record[ ( sample * 4 ) + 2 ]
record_z[sample] = unpacked_record[ ( sample * 4 ) + 3 ]

time_series = hstack ( ( time_series , time_stamp ) )
t_series = hstack ( ( t_series , record_t ) )
x_series = hstack ( ( x_series , record_x ) )
y_series = hstack ( ( y_series , record_y ) )
z_series = hstack ( ( z_series , record_z ) )

savez(output_file, t=t_series , x=x_series ,y=y_series, z=z_series, time=time_series)
end_time = clock()
print 'Total Time',end_time - start_time,'seconds'

目前,每个 700 MB 的文件大约需要 250 秒,这对我来说似乎非常高。有没有更有效的方法可以做到这一点?

最终解决方案

使用带有自定义数据类型的 numpy fromfile 方法将运行时间缩短至 9 秒,比上面的原始代码快 27 倍。最终代码如下。

from numpy import savez, dtype , fromfile 
from os.path import getsize
from time import clock

start_time = clock()
file_size = getsize(input_file)

openfile = open(input_file,'rb')
header = openfile.read(149)
record_size = int(header[23:31])
number_of_records = ( file_size - 149 ) / record_size
sample_rate = ( ( record_size - 4 ) / 4 ) / 2

record_dtype = dtype( [ ( 'timestamp' , '<i4' ) , ( 'samples' , '<i2' , ( sample_rate , 4 ) ) ] )

data = fromfile(openfile , dtype = record_dtype , count = number_of_records )
time_series = data['timestamp']
t_series = data['samples'][:,:,0].ravel()
x_series = data['samples'][:,:,1].ravel()
y_series = data['samples'][:,:,2].ravel()
z_series = data['samples'][:,:,3].ravel()

savez(output_file, t=t_series , x=x_series ,y=y_series, z=z_series, fid=time_series)

end_time = clock()

print 'It took',end_time - start_time,'seconds'

最佳答案

一些提示:

像这样的东西(未经测试,但你明白了):

import numpy as npfile = open(input_file, 'rb')header = file.read(149)# ... parse the header as you did ...record_dtype = np.dtype([    ('timestamp', '<i4'),     ('samples', '<i2', (sample_rate, 4))])data = np.fromfile(file, dtype=record_dtype, count=number_of_records)# NB: count can be omitted -- it just reads the whole file thentime_series = data['timestamp']t_series = data['samples'][:,:,0].ravel()x_series = data['samples'][:,:,1].ravel()y_series = data['samples'][:,:,2].ravel()z_series = data['samples'][:,:,3].ravel()

关于python - 从二进制文件创建 Numpy 数组的有效方法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/7569563/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com