gpt4 book ai didi

python - 在Python中更快地抓取数据

转载 作者:行者123 更新时间:2023-11-30 22:04:44 26 4
gpt4 key购买 nike

我正在抓取 25GB bz2 文件的数据。现在我正在处理zip文件,打开它,获取传感器的数据,获取中位数,然后在处理完所有文件后,将它们写入excel文件。处理这些文件需要一整天的时间,这是难以忍受的。

我想让进程更快,所以我想触发尽可能多的线程,但我该如何解决这个问题?这个想法的伪代码会很好。

我正在考虑的问题是我有 zip 文件每一天的时间戳。例如,我的第 1 天是 20:00,我需要处理它的文件,然后将其保存在列表中,而其他线程可以处理其他日期,但我需要将数据同步到磁盘中写入的文件中。

基本上我想加速得更快。

这是答案所示的流程文件的伪代码

def proc_file(directoary_names):
i = 0

try:

for idx in range(len(directoary_names)):
print(directoary_names[idx])
process_data(directoary_names[idx], i, directoary_names)
i = i + 1
except KeyboardInterrupt:
pass

print("writing data")
general_pd['TimeStamp'] = timeStamps
general_pd['S_strain_HOY'] = pd.Series(S1)
general_pd['S_strain_HMY'] = pd.Series(S2)
general_pd['S_strain_HUY'] = pd.Series(S3)
general_pd['S_strain_ROX'] = pd.Series(S4)
general_pd['S_strain_LOX'] = pd.Series(S5)
general_pd['S_strain_LMX'] = pd.Series(S6)
general_pd['S_strain_LUX'] = pd.Series(S7)
general_pd['S_strain_VOY'] = pd.Series(S8)
general_pd['S_temp_HOY'] = pd.Series(T1)
general_pd['S_temp_HMY'] = pd.Series(T2)
general_pd['S_temp_HUY'] = pd.Series(T3)
general_pd['S_temp_LOX'] = pd.Series(T4)
general_pd['S_temp_LMX'] = pd.Series(T5)
general_pd['S_temp_LUX'] = pd.Series(T6)
writer = pd.ExcelWriter(r'c:\ahmed\median_data_meter_12.xlsx', engine='xlsxwriter')
# Convert the dataframe to an XlsxWriter Excel object.
general_pd.to_excel(writer, sheet_name='Sheet1')
# Close the Pandas Excel writer and output the Excel file.
writer.save()

Sx 到 Tx 是传感器值..

最佳答案

使用多处理,看来您的任务非常简单。

from multiprocessing import Pool, Manager

manager = Manager()
l = manager.list()

def proc_file(file):
# Process it
l.append(median)

p = Pool(4) # however many process you want to spawn
p.map(proc_file, your_file_list)

# somehow save l to excel.

更新:由于您想保留文件名(也许作为 pandas 列),因此方法如下:

from multiprocessing import Pool, Manager

manager = Manager()
d = manager.dict()

def proc_file(file):
# Process it
d[file] = median # assuming file given as string. if your median (or whatever you want) is a list, this works as well.

p = Pool(4) # however many process you want to spawn
p.map(proc_file, your_file_list)

s = pd.Series(d)
# if your 'median' is a list
# s = pd.DataFrame(d).T
writer = pd.ExcelWriter(path)
s.to_excel(writer, 'sheet1')
writer.save() # to excel format.

如果每个文件都会生成多个值,您可以创建一个字典,其中每个元素都是包含这些值的列表

关于python - 在Python中更快地抓取数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53191015/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com