gpt4 book ai didi

python - 如何并行化或制作更快的 python 脚本

转载 作者:行者123 更新时间:2023-11-28 21:31:58 26 4
gpt4 key购买 nike

我有一个正在进行文本文件操作的代码。尽管文本文件非常大,并且根据我当前的代码计算,它需要 30 天才能完成。

如果多处理是我拥有 40 核服务器的唯一方法。

Cell_line_final2.bed:

chr1    778704  778912  MSPC_Peak_37509  8.43   cell_line   GM12878  CTCF   ENCSR000AKB CNhs12333   132
chr1 778704 778912 MSPC_Peak_37509 8.43 cell_line GM12878 CTCF ENCSR000AKB CNhs12331 132
chr1 778704 778912 MSPC_Peak_37509 8.43 cell_line GM12878 CTCF ENCSR000AKB CNhs12332 132
chr1 869773 870132 MSPC_Peak_37508 74.0 cell_line GM12878 CTCF ENCSR000AKB CNhs12333 132
...
...

tf_TPM2.bed:

CNhs12333   2228319     4.41    CTCF
CNhs12331 6419919 0.0 HES2
CNhs12332 6579994 0.78 ZBTB48
CNhs12333 8817465 0.0 RERE
...
...

所需的输出是在“Cell_line_final2.bed”中添加一列,其中“tf_TPM2.bed”的第 1 列和第 4 列同时匹配“Cell_line_final2.bed”的第 10 列和第 8 列。

chr1    778704  778912  MSPC_Peak_37509  8.43   cell_line   GM12878  CTCF   ENCSR000AKB CNhs12333   132   4.41
chr1 778704 778912 MSPC_Peak_37509 8.43 cell_line GM12878 HES2 ENCSR000AKB CNhs12331 132 0.0
chr1 778704 778912 MSPC_Peak_37509 8.43 cell_line GM12878 CTCF ENCSR000AKB CNhs12332 132 0.78
chr1 869773 870132 MSPC_Peak_37508 74.0 cell_line GM12878 RERE ENCSR000AKB CNhs12333 132 0.0
...
...

到目前为止我的代码:

def read_file(file):
with open(file) as f:
current = []
for line in f: # read rest of lines
current.append([x for x in line.split()])
return(current)


inputfile = "/home/lside/Desktop/database_files/Cell_line_final2.bed" # 2.7GB text file
outpufile = "/home/lside/Desktop/database_files/Cell_line_final3.bed"

file_in = read_file("/home/lside/Desktop/tf_TPM2.csv") # 22.5MB text file
new_line = ""
with open(inputfile, 'r') as infile:
with open(outpufile, 'w') as outfile:
for line in infile:
line = line.split("\t")
for j in file_in:
if j[0] == line[9] and j[3] == line[7]:
new_line = new_line + '{0}\t{1}\t{2}\t{3}\t{4}\t{5}\t{6}\t{7}\t{8}\t{9}\t{10}\t{11}\n'.format(line[0], line[1], line[2],line[3], line[4], line[5],line[6], line[7], line[8], line[9], line[10].rstrip(), j[2])
continue
outfile.write(new_line)

最佳答案

我同意评论说这不应该花费 30 天来运行,因此瓶颈应该在其他地方。也许最大的罪魁祸首是您正在构建的巨大字符串,而不是在每次迭代时将每一行转储到文件中 (^)。

注意

(^) 最大的罪魁祸首更有可能是内部循环中的 continue 语句,因为这将始终强制代码将当前行与中的所有元素进行比较查找文件,而不是在第一个匹配处停止。用 break 替换它应该是正确的方法。

这是我要做的,看看它的执行速度有多快:

def read_file(filename):
with open(filename) as f:
current = []
for line in f: # read rest of lines
e0, e2, e3 = line.split()[0], line.split()[2], line.split()[3]
current.append((e0, e2, e3)) # you only use these three elements
return current


inputfile = "/home/lside/Desktop/database_files/Cell_line_final2.bed" # 2.7GB text file
outpufile = "/home/lside/Desktop/database_files/Cell_line_final3.bed"

file_in = read_file("/home/lside/Desktop/tf_TPM2.csv") # 22.5MB text file

with open(inputfile, 'r') as infile:
with open(outpufile, 'w') as outfile:
for line in infile:
line = line.split("\t")
for e0, e2, e3 in file_in:
if e0 == line[9] and e3 == line[7]:
new_line = '{0}\t{1}\n'.format(line.rstrip(), e2) # just append the column to the entire line
outfile.write(new_line) # dump to file, don't linger around with an ever-growing string
break

查找表

如果我们想更进一步,我们可以从file_in创建一个查找表。我们的想法是,我们不必循环遍历从 file_in 中提取的每个元素,而是准备一个字典,其中的键是从 j[0],j[3] 准备的- 您比较的字段 - 值为 j[2]。这样,查找几乎是瞬时的,不再需要循环。

使用此逻辑的修改后的代码如下所示:

def make_lookup_table(filename):
lookup = {}
with open(filename) as f:
for line in f: # read rest of lines
e0, e2, e3 = line.split()[0], line.split()[2], line.split()[3]
lookup[(e0, e3)] = e2 # use (e0,e3) as key, and e2 as value
return lookup


inputfile = "/home/lside/Desktop/database_files/Cell_line_final2.bed" # 2.7GB text file
outpufile = "/home/lside/Desktop/database_files/Cell_line_final3.bed"

lookup = make_lookup_table("/home/lside/Desktop/tf_TPM2.csv") # 22.5MB text file

with open(inputfile, 'r') as infile:
with open(outpufile, 'w') as outfile:
for line in infile:
line = line.split("\t")
value = lookup[(line[9],line[7])]
new_line = '{0}\t{1}\n'.format(line.rstrip(), value) # just append the column to the entire line
outfile.write(new_line) # dump to file, don't linger around with an ever-growing string

关于python - 如何并行化或制作更快的 python 脚本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57280634/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com