gpt4 book ai didi

python - 在Python中,当两个文件都未排序时,我应该如何有效地对一个大文件进行排序以匹配另一个大文件中的公共(public)元素?

转载 作者:行者123 更新时间:2023-12-01 07:52:49 24 4
gpt4 key购买 nike

我对 Python 相当陌生,我正在编写一个脚本,该脚本应该采用两个相当大的文本文件(~10MB)并为每个文件创建一个新文件,并记住下面指定的一些规则。

文件 A 每行包含制表符分隔的值,文件 B 的一行包含 ID,下一行包含数据。文件 B 中的 ID 也存在于文件 A 中,但并非文件 A 中的所有 ID 都在文件 B 中,反之亦然。

两个文件都没有按字母数字顺序排列的 ID,并且两个文件的顺序不同。我不需要按字母数字对它们进行排序,我只需要输出文件按相同的顺序排列,并且只包含具有共同 ID 的项目。

文件 A 的外观如下: File A

文件 B 的外观如下: File B

如您所见,文件 A 的 B 列中的项目提供了文件 A 中可能存在或不存在的标识符。

这是我编写的一个简单脚本。对于文件 A 中的每一行,它都会遍历整个文件 B,直到找到匹配的 ID 或到达末尾。

该脚本工作正常,但由于它包含一个嵌套循环,因此可能约为 O(n^2) (实际上是 O(m*n),因为 m 是文件 A 的大小,n 是文件 B 的大小,但它们通常大小相似),一旦我将其用于实际数据(数百 MB 或 GB 单位),这可能会成为问题。

def spectrosingle(inputline):
if (len(inputline) > 0) and (not inputline[0] == "\t") :
resline = re.findall(r'\d\t(.+?)\t\t\t\t|$', inputline)[0] # ID in spectro file is always followed by 3 empty columns, which is the only such occurence in the whole line
return resline
else:
return None

try:
fastafile = open('fastaseq.fasta', "r")
except:
print("FASTA file corrupted or not found!\n")

try:
spectrometry = open('spectro.txt', "r")
except:
print("Spectro file corrupted or not found!\n")


missingarr = [] # array for IDs that are in spectro file but aren't present in fasta file
misnum = 0 # counter for those IDs

with open('MAIN.fasta', mode='w') as output_handle:
"""Going through a nested cycle - for every sorted sequence in the spectrometry file,"
"we are searching the unsorted FASTA until we find the corresponding file. If there's any sequence in the spectrometry file"
"that is not anywhere in the fasta, it's marked so that it doesn't get copied into the final spectrometry file."""
for line in spectrometry:
fastaline1 = 'temp' # a temporary initialization for fastaline, so we can enter the While loop that checks if there are still lines left in the file
missbool = True # a flag for IDs that are missing from fasta file
speccheck = spectrosingle(line) # filters the ID from spectro file.
if not speccheck:
continue #spectrosingle function returns Nonetype if it gets a line without any sequence. This skips such lines.
while fastaline1:
fastaline1 = fastafile.readline()
fastaline1 = fastaline1.partition(">")[2]
fastaline1 = fastaline1.partition("\n")[0] #shave the header and newline symbols from the ID
fastaline2 = fastafile.readline()
if fastaline1 == speccheck: #check if the sequence in FASTA file matches the one in the spectro file
print("Sorted sequence ID %s." % (fastaline1))
output_handle.write('>'+fastaline1+'\n') #write the header
output_handle.write(fastaline2) #write the sequence
missbool = False
fastafile.seek(0) #return to the start of file for next cycle
break
if missbool: #this fires only when the whole fastafile has been searched and no matching sequence to the one from the spectro file has been found.
misnum = misnum + 1 # count the number of discarded sequence
missingarr.append(speccheck) #append the discarded sequence to the array, so we later know which sequences not to include in the new spectro file
fastafile.seek(0)

print("Sorting finished!\n")
fastafile.close()
spectrometry.close()

if misnum != 0: #check if there are any sequences marked for deletion
num = 0
blackbool = True
blackword = missingarr[num]
else:
blackbool = False # no marked sequences available

with open('spectro.txt', "r") as spectrometry, open(os.path.splitext(finpath)[0]+'\\' + prepid + 'FINAL_spectrometry.txt', mode='w') as final_output: #writing the final spectrometry file with deleted sequences which would cause a mismatch during the final merger of data
fullspec = spectrometry.readlines() #might be memory-heavy, but still probably the most efficient way to do this
if not blackbool: #no redundant characters, so the whole file is copied
for line in fullspec:
final_output.write(line)
else:
try:
for line in fullspec:
if ((re.search(blackword, line)) is None):#if the ID is marked, it is not transferred to the new file
final_output.write(line)
else:
num = num + 1
blackword = missingarr[num]
except:
pass
print("There were %i redundant sequences in the spectro file, which have been filtered out.\n" % (num)
spectrometry.close()

有没有更有效的方法来做到这一点?我怀疑我这样做的方式不太 Pythonic,但我无法真正指出它出了什么问题。

最佳答案

你的代码确实不会很高效。相反,我建议使用字典来存储文件 B 到每个 ID 的数据。要获取数据,您只需在读取文件的同一迭代器上调用 next 即可(前提是行数为偶数)。像这样的东西(未经测试):

data = {}
with open("fileb") as fb:
for line_id in fb:
the_id = line_id.strip()[1:] # remove newline and ">"
line_data = next(fb) # get next line from file
data[the_id] = line_data.strip()

然后,当您从文件 A 中读取数据时,您只需在该字典中查找到当前 ID 的数据即可,而无需一次又一次迭代整个文件 B。

另外,但不太相关的是,您可以只 split("\t") 该行,或者使用csv模块。像这样的东西(也没有测试过):

with open("filea") as fa:
for line in fa:
num, the_id, more, stuff, dont, know, what = line.split("\t")
if the_id in data:
the_data = data.get(the_id)
... to stuff with data ...

您还可以使用 *_ 捕获任何剩余字段,而不是枚举所有列:

        num, the_id, *other_stuff_we_do_not_care_about = line.split("\t")

关于python - 在Python中,当两个文件都未排序时,我应该如何有效地对一个大文件进行排序以匹配另一个大文件中的公共(public)元素?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56108276/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com