gpt4 book ai didi

python - 使用python查找并删除公共(public)子字符串

转载 作者:太空宇宙 更新时间:2023-11-03 17:59:31 24 4
gpt4 key购买 nike

所以我有两个引物序列:

fiveprime = "GATTCGAAGTCCACTATC"threeprime = "TGAGTAGGACGGCACTATC"

What I need to do is when I have another sequence I need to check if it contains a part of one these primer sequences and if it does I need to remove the matching part leaving the non-matching part for further analysis.

For example my sequence: CACTATCAAAAAAAhas a part of the fiveprimer in it. I need to locate this common substring and then remove it leaving only the AAAAA.

The problem I run in to is when you have a sequence like this CACTATCGAAG where GAAG is also in the primer, but not part of the primer sequence it's still removed. I tried to fix this by making sure that the found structure is at the left side of the primer and with the threeprimer on the right side so example:

With the CACTATCGAAG we have 2 common structures CACTATC and GAAGnow I can compare CACTATC with the end of the fiveprime GATTCGAAGTCCACTATC and tell it's part of the primer when it's a match and then remove it. So when we compare GAAG with the length of the last end of the fiveprimer it will give us this GATTCGAAGTCCACTATC which is not a match so GAAG can go on for further analysis.

For some reason my script bugs or doesn't work properly. Are there any other solutions to this problem or suggestions?

def longestSubstringFinder(string1, string2):
answer = ""
len1, len2 = len(string1), len(string2)
for i in range(len1):
match = ""
for j in range(len2):
if i + j < len1 and string1[i + j] == string2[j]:
match += string2[j]
else:
if len(match) > len(answer): answer = match
match = ""
return answer

def get_sequence(test, fiveprime, threeprime):
if test == fiveprime:
pass
elif test == threeprime:
pass
elif test in fiveprime:
pass
elif test in threeprime:
pass

# find out if there is a matching part between the primers and the found
# single stranded region, then calculates what part that is, and removes it
# from the single stranded region
else:
overlap = longestSubstringFinder(fiveprime, test)
l = len(overlap)

if fiveprime[-l:] == test[:l]:
check = test[l:]
else:
check = test

overlap2 = longestSubstringFinder(check, threeprime)
l = len(overlap2)

if threeprime[:l] == check[-l:]:
check2 = check[:-l]
structure.append(check2)
else:
structure.append(check)

return structure

最佳答案

我认为,如果您选择适当的数据结构来表示您要查找的数据,您的问题会更容易处理。我能想到的最好的一个是 trie .

这种结构的好处是,它允许您在给定初始字母序列的情况下表示所有可能的匹配,因此,如果您有序列 AABAB,它将允许从初始 A 遍历到 A 和 B,但不允许从 A 到 G 或 T 的遍历。这使得查找部分匹配变得高效,因为 trie 中的任何点都代表了那么多字母的匹配。

这个数据结构将类似于:

class Trie(object):
def __init__(self):
self.children = {}

def add_child(self, letter):
if letter in self.children:
return self.children[letter]
else:
child = Trie()
self.children[letter] = child
return child

def traverse(self, letter):
return self.children.get(letter, None)

然后您可以像这样填充:

root = Trie()
current_positions = []
for letter in letters:
current_positions = [
position.add_child(letter)
for position in current_positions
]
current_positions.append(root.add_child(letter))

一旦所有这些都设置完毕,您就应该能够遍历这个结构,直到遍历返回 null。这将指示当前最长的匹配。字母的初始化将每个字母视为匹配的潜在起点,您也应该如此。

然后您可以像这样搜索最长的子字符串匹配:

class TrieSearch(object):
def __init__(self, trie, starting_index):
self.trie = trie
self.starting_index = starting_index
self.ending_index = starting_index + 1

def update(self, letter):
""" This returns a boolean indicating
if the search can accept the letter """
self.trie = self.trie.traverse(letter)
if self.trie is not None:
self.ending_index = self.ending_index + 1
return True
return False

def get_match(self, letters):
return letters[self.starting_index:self.ending_index]

def find_matches(root, letters):
completed_matches = []
current_matches = []

for index, letter in enumerate(letters):
new_current = []

for current in current_matches:
if current.update(letter):
new_current.append(current)
else:
completed_matches.append(current)

new_search_trie = root.traverse(letter)
if new_search_trie is not None:
new_current.append(TrieSearch(new_search_trie, index))

current_matches = new_current

all_matches = completed_matches + current_matches
return [match.get_match(letters) for match in all_matches]

我已将所有这些放在 a gist 中当使用 thirdprime Fiveprime 值初始化 trie 且输入数据为 CACTATCAAAAAAA 时,结果为:

['CACTATC', 'ACTATC', 'CTATC', 'TATC', 'ATC', 'TC', 'CA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'A']

由于您无疑要处理大量字符串,因此您可能需要查看一些更有效的通用字符串替换算法。 Aho-Corasick algorithm是此处概述的 trie 方法的扩展。还有Knuth-Morris-Pratt algorithm它使用表而不是特里树。这两种算法都是线性复杂度算法,与您的 longestSubstringFinder 方法使用的二次方法相比,这将是一个重大改进。

关于python - 使用python查找并删除公共(public)子字符串,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27963222/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com