gpt4 book ai didi

python - 在中等大小的 JSON 文件上使用线程池同步读取比异步读取更快

转载 作者:太空宇宙 更新时间:2023-11-03 23:55:07 25 4
gpt4 key购买 nike

asynchronous slower than synchronous 的答案没有涵盖我正在处理的场景,因此出现了这个问题。

我在 Windows 10 上使用 Python 3.6.0 读取 11 个相同的 JSON 文件,名称为 k80.jsonk90.json,每个 18.1 MB。

首先,我尝试同步、顺序读取所有 11 个文件。完成需要 5.07s。

from json import load
from os.path import join
from time import time


def read_config(fname):
json_fp = open(fname)
json_data = load(json_fp)
json_fp.close()

return len(json_data)


if __name__ == '__main__':
NUM_THREADS = 12
idx = 0
in_files = [join('C:\\', 'Users', 'userA', 'Documents', f'k{idx}.json') for idx in range(80, 91)]

print('Starting sequential run.')
start_time1 = time()

for fname in in_files:
print(f'Reading file: {fname}')
print(f'The JSON file size is {read_config(fname)}')

read_duration1 = round(time() - start_time1, 2)

print('Ending sequential run.')
print(f'Synchoronous reading took {read_duration1}s')
print('\n' * 3)

结果

Starting sequential run.
Reading file: C:\Users\userA\Documents\k80.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k81.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k82.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k83.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k84.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k85.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k86.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k87.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k88.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k89.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k90.json
The JSON file size is 5
Ending sequential run.
Synchoronous reading took 5.07s

接下来,我尝试使用带有 map 函数调用的 ThreadPoolExecutor 运行它,使用 12 个线程。这花费了 5.69s。

from concurrent.futures import ThreadPoolExecutor
from json import load
from os.path import join
from time import time


def read_config(fname):
json_fp = open(fname)
json_data = load(json_fp)
json_fp.close()

return len(json_data)


if __name__ == '__main__':
NUM_THREADS = 12
idx = 0
in_files = [join('C:\\', 'Users', 'userA', 'Documents', f'k{idx}.json') for idx in range(80, 91)]

th_pool = ThreadPoolExecutor(max_workers=NUM_THREADS)
print(f'Starting mapped pre-emptive threaded pool run with {NUM_THREADS} threads.')
start_time2 = time()

with th_pool:
map_iter = th_pool.map(read_config, in_files, timeout=10)

read_duration2 = round(time() - start_time2, 2)

print('The JSON file size is ')
map_results = list(map_iter)
for map_res in map_results:
print(f'The JSON file size is {map_res}')

print('Ending mapped pre-emptive threaded pool run.')
print(f'Mapped asynchoronous pre-emptive threaded pool reading took {read_duration2}s')
print('\n' * 3)

结果

Starting mapped pre-emptive threaded pool run with 12 threads.
The JSON file size is
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
Ending mapped pre-emptive threaded pool run.
Mapped asynchoronous pre-emptive threaded pool reading took 5.69s

最后,我尝试使用带有 submit 函数调用的 ThreadPoolExecutor 运行它,使用 12 个线程。这花费了 5.73s。

from concurrent.futures import ThreadPoolExecutor
from json import load
from os.path import join
from time import time


def read_config(fname):
json_fp = open(fname)
json_data = load(json_fp)
json_fp.close()

return len(json_data)


if __name__ == '__main__':
NUM_THREADS = 12
idx = 0
in_files = [join('C:\\', 'Users', 'userA', 'Documents', f'k{idx}.json') for idx in range(80, 91)]

th_pool = ThreadPoolExecutor(max_workers=NUM_THREADS)
results = []
print(f'Starting submitted pre-emptive threaded pool run with {NUM_THREADS} threads.')
start_time3 = time()

with th_pool:
for fname in in_files:
results.append(th_pool.submit(read_config, fname))

read_duration3 = round(time() - start_time3, 2)

for result in results:
print(f'The JSON file size is {result.result(timeout=10)}')

print('Ending submitted pre-emptive threaded pool run.')
print(f'Submitted asynchoronous pre-emptive threaded pool reading took {read_duration3}s')

结果

Starting submitted pre-emptive threaded pool run with 12 threads.
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
Ending submitted pre-emptive threaded pool run.
Submitted asynchoronous pre-emptive threaded pool reading took 5.73s

问题

  1. 在读取像这样相当大的 JSON 文件时,为什么同步读取比线程读取执行得更快?鉴于文件大小和正在读取的文件数量,我期望线程处理速度更快。

  2. 是否需要比这些文件大得多的 JSON 文件才能使线程处理的性能优于同步读取?如果不是,还有哪些其他因素需要考虑?

提前感谢您抽出时间提供帮助。

后脚本

感谢下面的回答,我稍微更改了 read_config 方法以引入 3s sleep 延迟(模拟 IO-wait 操作),现在线程版本真的很出色(38.81s 对比 9.36s9.39s)。

def read_config(fname):
json_fp = open(fname)
json_data = load(json_fp)
json_fp.close()

sleep(3) # Simulate an activity that waits on I/O.

return len(json_data)

结果

Starting sequential run.
Reading file: C:\Users\userA\Documents\k80.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k81.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k82.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k83.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k84.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k85.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k86.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k87.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k88.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k89.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k90.json
The JSON file size is 5
Ending sequential run.
Synchoronous reading took 38.81s




Starting mapped pre-emptive threaded pool run with 12 threads.
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
Ending mapped pre-emptive threaded pool run.
Mapped asynchoronous pre-emptive threaded pool reading took 9.36s




Starting submitted pre-emptive threaded pool run with 12 threads.
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
Ending submitted pre-emptive threaded pool run.
Submitted asynchoronous pre-emptive threaded pool reading took 9.39s

最佳答案

我不是专家,但总的来说,要使线程对速度有用,您的程序需要等待 IO。线程不会让您访问并行 CPU 线程,它只是允许操作并行运行,共享相同的 CPU 时间和 Python 解释器(如果您想访问更多 CPU,您应该查看 ProcessPoolExecutor)

例如,如果您从多个远程数据库而不是本地文件读取数据,那么您的程序将有很多时间在等待 IO 而没有使用本地资源。在这种情况下,线程可能会有所帮助,因为您可以并行等待,或者在等待另一个项目的同时处理一个项目。但是,由于您的所有数据都来自本地文件,您可能已经最大化了本地磁盘 IO,因此您无法同时读取多个文件(或者至少不能比顺序读取它们的速度快)。您的机器仍然必须使用相同的资源完成所有相同的任务,它在这两种变体中实际上没有任何“停机时间”,这就是为什么它们花费几乎相同的时间。

关于python - 在中等大小的 JSON 文件上使用线程池同步读取比异步读取更快,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58088038/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com