- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
我有几百个(相当简单的)正则表达式和它们在大量序列中的匹配项。我想知道每个正则表达式的哪一部分与目标序列中的哪个位置匹配。例如,下面的正则表达式“[DSTE][^P][^DEWHFYC]D[GSAN]”可以按以下顺序匹配位置 4 到 8:
ABCSGADAZZZ
我想(以编程方式)得到的是,对于每个正则表达式,1) 正则表达式的每个“部分”和 2) 目标序列中与其匹配的位置:
[DSTE] -- (3, 4),
[^P] -- (4, 5),
[^DEWHFYC] -- (5, 6),
D -- (6, 7),
[GSAN] -- (7, 8)
我发现这个网站基本上可以满足我的要求:https://regex101.com/ ,但我不确定我需要深入研究正则表达式解析才能在我自己的代码中执行此操作(我使用的是 Python 和 R)。
最佳答案
它仍然不是 100%,但我在我的数据集的 3365/3510 上返回了输出。我检查的几个排队:)
我的 github(链接如下)中包含 csv、txt(分别)格式的输入和输出。
请忽略全局变量;我正在考虑切换代码以查看速度是否有明显的改进,但没有绕过它。
目前这个版本在关于交替和开始/结束行运算符(^ $)的操作顺序方面有问题,如果它们是字符串开头或结尾的交替选项。我非常有信心这与先例有关;但我没能把它组织得足够好。
代码的函数调用在最后一个单元格中。而不是使用
运行整个 DataFramefor x in range(len(df)):
try:
df_expression = df.iloc[x, 2]
df_subsequence = df.iloc[x, 1]
# call function
identify_submatches(df_expression, df_subsequence)
print(dataframe_counting)
dataframe_counting += 1
except:
pass
通过将模式和相应的序列传递给函数,您可以轻松地一次测试一个:
p = ''
s = ''
identify_submatches(p, s)
代码: https://github.com/jameshollisandrew/just_for_fun/blob/master/motif_matching/motif_matching_02.ipynb
输入: https://github.com/jameshollisandrew/just_for_fun/blob/master/motif_matching/elm_compiled_ss_re.csv
"""exp_a as input expression
sub_a as input subject string"""
input_exp = exp_a
input_sub = sub_a
m_gro = '\^*\((?:[^()]+|(?R))*+\)({.+?})*\$*'
m_set = '\^*\[.+?\]({.+?})*\$*'
m_alt = '\|'
m_lit = '\^*[.\w]({.+?})*\$*|\$'
# PRINTOUT
if (print_type == 1):
print('\nExpression Input: {}\nSequence Input: {}'.format(exp_a, sub_a))
if (print_type == 3):
print('\n\nSTART ITERATION\nINPUTS\n exp: {}\n seq: {}'.format(exp_a, sub_a))
# return the pattern match (USE IF SUB IS NOT MATCHED PRIMARY)
if r.search(exp_a, sub_a) is not None:
m = r.search(exp_a, sub_a)
sub_a = m.group()
# >>>PRINTOUT<<<
if print_type == 3:
print('\nSEQUENCE TYPE M\n exp: {}\n seq: {}'.format(exp_a, sub_a))
elif m is None:
print('Search expression: {} in sequence: {} returned no matches.\n\n'.format(exp_a, sub_a))
return None
if (print_type == 1):
print('Subequence Match: {}'.format(sub_a))
# check if main expression has unnested alternation
if len(alt_states(exp_a)) > 0:
# returns matching alternative
exp_a = alt_evaluation(exp_a, sub_a)
# >>>PRINTOUT<<<
if print_type == 3:
print('\nALTERNATION RETURN\n exp: {}\n seq: {}'.format(exp_a, sub_a))
# get initial expression list
exp_list = get_states(exp_a)
# count possible expression constructions
status, matched_tuples = finite_state(exp_list, sub_a)
# >>>PRINTOUT<<<
if print_type == 3:
print('\nCONFIRM EXPRESSION\n exp: {}'.format(matched_tuples))
# index matches
indexer(input_exp, input_sub, matched_tuples)
def indexer(exp_a, sub_a, matched_tuples):
sub_length = len(sub_a)
sub_b = r.search(exp_a, sub_a)
adj = sub_b.start()
sub_b = sub_b.group()
print('')
for pair in matched_tuples:
pattern, match = pair
start = adj
adj = adj + len(match)
end = adj
index_pos = (start, end)
sub_b = slice_string(match, sub_b)
print('\t{}\t{}'.format(pattern, index_pos))
def strip_nest(s):
s = s[1:]
s = s[:-1]
return s
def slice_string(p, s):
pat = p
string = s
# handles escapes
p = r.escape(p)
# slice the input string on input pattern
s = r.split(pattern = p, string = s, maxsplit = 1)[1]
# >>>PRINTOUT<<<
if print_type == 4:
print('\nSLICE STRING\n pat: {}\n str: {}\n slice: {}'.format(pat, string, s))
return s
def alt_states(exp):
# check each character in string
idx = 0 # index tracker
op = 0 # open parenth
cp = 0 # close parenth
free_alt = [] # amend with index position of unnested alt
for c in exp:
if c == '(':
op += 1
elif c == ')':
cp += 1
elif c == '|':
if op == cp:
free_alt.append(idx)
if idx < len(exp)-1:
idx+=1
# split string if found
alts = []
if free_alt:
_ = 0
for i in free_alt:
alts.append(exp[_:i])
alts.append(exp[i+1:])
# the truth value of this check can be checked against the length of the return
# len(free_alt) > 0 means unnested "|" found
return alts
def alt_evaluation(exp, sub):
# >>>PRINTOUT<<<
if print_type == 3:
print('\nALTERNATION SELECTION\n EXP: {}\n SEQ: {}'.format(exp, sub))
# gets alt index position
alts = alt_states(exp)
# variables for eval
a_len = 0 # length of alternate match
keep_len = 0 # length of return match
keep = '' # return match string
# evaluate alternatives
for alt in alts:
m = r.search(alt, sub)
if m is not None:
a_len = len(m.group()) # length of match string
# >>>PRINTOUT<<<
if print_type == 3:
print(' pat: {}\n str: {}\n len: {}'.format(alt, m.group(0), len(m.group(0))))
if a_len >= keep_len:
keep_len = a_len # sets alternate length to keep length
exp = alt # sets alt as keep variable
# >>>PRINTOUT<<<
if print_type == 3:
print(' OUT: {}'.format(exp))
return exp
def get_states(exp):
"""counts number of subexpressions to be checked
creates FSM"""
# >>>PRINTOUT<<<
if print_type == 3:
print('\nGET STATES\n EXP: {}'.format(exp))
# List of possible subexpression regex matches
m_gro = '\^*\((?:[^()]+|(?R))*+\)({.+?})*\$*'
m_set = '\^*\[.+?\]({.+?})*\$*'
m_alt = '\|'
m_lit = '\^*[.\w]({.+?})*\$*|\$'
# initialize capture list
exp_list = []
# loop through first level of subexpressions:
while exp != '':
if r.match(m_gro, exp):
_ = r.match(m_gro, exp).group(0)
exp_list.append(_)
exp = slice_string(_, exp)
elif r.match(m_set, exp):
_ = r.match(m_set, exp).group(0)
exp_list.append(_)
exp = slice_string(_, exp)
elif r.match(m_alt, exp):
_ = ''
elif r.match(m_lit, exp):
_ = r.match(m_lit, exp).group(0)
exp_list.append(_)
exp = slice_string(_, exp)
else:
print('ERROR getting states')
break
n_states = len(exp_list)
# >>>PRINTOUT<<<
if print_type == 3:
print('GET STATES OUT\n states:\n {}\n # of states: {}'.format(exp_list, n_states))
return exp_list
def finite_state(exp_list, seq, level = 0, pattern_builder = '', iter_count = 0, pat_match = [], seq_match = []):
# >>>PRINTOUT<<<
if (print_type == 3):
print('\nSTARTING MACHINE\n EXP: {}\n SEQ: {}\n LEVEL: {}\n matched: {}\n pat_match: {}'.format(exp_list, seq, level, pattern_builder, pat_match))
# patterns
m_gro = '\^*\((?:[^()]+|(?R))*+\)({.+?})*\$*'
m_set = '\^*\[.+?\]({.+?})*\$*'
m_alt = '\|'
m_squ = '\{(.),(.)\}'
m_lit = '\^*[.\w]({.+?})*\$*|\$'
# set state, n_state
state = 0
n_states = len(exp_list)
#save_state = []
#save_expression = []
# temp exp
local_seq = seq
# >>>PRINTOUT<<<
if print_type == 3:
print('\n >>>MACHINE START')
# set failure cap so no endless loop
failure_cap = 1000
# since len(exp_list) returns + 1 over iteration (0 index) use the last 'state' as success state
while state != n_states:
for exp in exp_list:
# iterations
iter_count+=1
# >>>PRINTOUT<<<
if print_type == 3:
print(' iteration count: {}'.format(iter_count))
# >>>PRINTOUT<<<
if print_type == 3:
print('\n evaluating: {}\n for string: {}'.format(exp, local_seq))
# alternation reset
if len(alt_states(exp)) > 0:
# get operand options
operands = alt_states(exp)
# create temporary exp list
temp_list = exp_list[state+1:]
# add level
level = level + 1
# >>>PRINTOUT<<<
if print_type == 3:
print(' ALT MATCH: {}\n state: {}\n opers returned: {}\n level in: {}'.format(exp, state, operands, level))
# compile local altneration
for oper in operands:
# get substates
_ = get_states(oper)
# compile list
oper_list = _ + temp_list
# send to finite_state, sublevel
alt_status, pats = finite_state(oper_list, local_seq, level = level, pattern_builder=pattern_builder, iter_count=iter_count, pat_match=pat_match)
if alt_status == 'success':
return alt_status, pats
# group cycle
elif r.match(m_gro, exp) is not None:
# get operand options
operands = group_states(exp)
# create temporary exp list
temp_list = exp_list[state+1:]
# add level
level = level + 1
# >>>PRINTOUT<<<
if print_type == 3:
print(' GROUP MATCH: {}\n state: {}\n opers returned: {}\n level in: {}'.format(exp, state, operands, level))
# compile local
oper_list = operands + temp_list
# send to finite_state, sublevel
group_status, pats = finite_state(oper_list, local_seq, level=level, pattern_builder=pattern_builder, iter_count=iter_count, pat_match=pat_match)
if group_status == 'success':
return group_status, pats
# quantifier reset
elif r.search(m_squ, exp) is not None:
# get operand options
operands = quant_states(exp)
# create temporary exp list
temp_list = exp_list[state+1:]
# add level
level = level + 1
# >>>PRINTOUT<<<
if print_type == 3:
print(' QUANT MATCH: {}\n state: {}\n opers returned: {}\n level in: {}'.format(exp, state, operands, level))
# compile local
for oper in reversed(operands):
# compile list
oper_list = [oper] + temp_list
# send to finite_state, sublevel
quant_status, pats = finite_state(oper_list, local_seq, level=level, pattern_builder=pattern_builder, iter_count=iter_count, pat_match=pat_match)
if quant_status == 'success':
return quant_status, pats
# record literal
elif r.match(exp, local_seq) is not None:
# add to local pattern
m = r.match(exp, local_seq).group(0)
local_seq = slice_string(m, local_seq)
# >>>PRINTOUT<<<
if print_type == 3:
print(' state transition: {}\n state {} ==> {} of {}'.format(exp, state, state+1, n_states))
# iterate state for match
pattern_builder = pattern_builder + exp
pat_match = pat_match + [(exp, m)]
state += 1
elif r.match(exp, local_seq) is None:
# >>>PRINTOUT<<<
if print_type == 3:
print(' Return FAIL on {}, level: {}, state: {}'.format(exp, level, state))
status = 'fail'
return status, pattern_builder
# machine success
if state == n_states:
# >>>PRINTOUT<<<
if print_type == 3:
print(' MACHINE SUCCESS\n level: {}\n state: {}\n exp: {}'.format(level, state, pattern_builder))
status = 'success'
return status, pat_match
# timeout
if iter_count == failure_cap:
state = n_states
# >>>PRINTOUT<<<
if print_type == 3:
print('===============\nFAILURE CAP MET\n level: {}\n exp state: {}\n==============='.format(level, state))
break
def group_states(exp):
# patterns
m_gro = '\^*\((?:[^()]+|(?R))*+\)({.+?})*\$*'
m_set = '\^*\[.+?\]({.+?})*\$*'
m_alt = '\|'
m_squ = '\{(.),(.)\}'
m_lit = '\^*[.\w]({.+?})*\$*'
ret_list = []
# iterate over groups
groups = r.finditer(m_gro, exp)
for gr in groups:
_ = strip_nest(gr.group())
# alternation reset
if r.search(m_alt, _):
ret_list.append(_)
else:
_ = get_states(_)
for thing in _:
ret_list.append(thing)
return(ret_list)
def quant_states(exp):
# >>>PRINTOUT<<<
if print_type == 4:
print('\nGET QUANT STATES\n EXP: {}'.format(exp))
squ_opr = '(.+)\{.,.\}'
m_squ = '\{(.),(.)\}'
# create states
states_list = []
# get operand
operand_obj = r.finditer(squ_opr, exp)
for match in operand_obj:
operand = match.group(1)
# get repetitions
fa = r.findall(m_squ, exp)
for m, n in fa:
# loop through range
for x in range(int(m), (int(n)+1)):
# construct string
_ = operand + '{' + str(x) + '}'
# append to list
states_list.append(_)
# >>>PRINTOUT<<<
if print_type == 4:
print(' QUANT OUT: {}\n'.format(states_list))
return states_list
%%time
print_type = 1
"""0:
1: includes input
2:
3: all output prints on """
dataframe_counting = 0
for x in range(len(df)):
try:
df_expression = df.iloc[x, 2]
df_subsequence = df.iloc[x, 1]
# call function
identify_submatches(df_expression, df_subsequence)
print(dataframe_counting)
dataframe_counting += 1
except:
pass
输出返回示例
输出值(即子表达式和索引集)以制表符分隔。
Expression Input: [KR]{1,4}[KR].[KR]W.
Sequence Input: TRQARRNRRRRWRERQRQIH
Subequence Match: RRRRWR
[KR]{1} (7, 8)
[KR] (8, 9)
. (9, 10)
[KR] (10, 11)
W (11, 12)
. (12, 13)
2270
Expression Input: [KR]{1,4}[KR].[KR]W.
Sequence Input: TASQRRNRRRRWKRRGLQIL
Subequence Match: RRRRWK
[KR]{1} (7, 8)
[KR] (8, 9)
. (9, 10)
[KR] (10, 11)
W (11, 12)
. (12, 13)
2271
Expression Input: [KR]{1,4}[KR].[KR]W.
Sequence Input: TRKARRNRRRRWRARQKQIS
Subequence Match: RRRRWR
[KR]{1} (7, 8)
[KR] (8, 9)
. (9, 10)
[KR] (10, 11)
W (11, 12)
. (12, 13)
2272
Expression Input: [KR]{1,4}[KR].[KR]W.
Sequence Input: LDFPSKKRKRSRWNQDTMEQ
Subequence Match: KKRKRSRWN
[KR]{4} (5, 9)
[KR] (9, 10)
. (10, 11)
[KR] (11, 12)
W (12, 13)
. (13, 14)
2273
Expression Input: [KR]{1,4}[KR].[KR]W.
Sequence Input: ASQPPSKRKRRWDQTADQTP
Subequence Match: KRKRRWD
[KR]{2} (6, 8)
[KR] (8, 9)
. (9, 10)
[KR] (10, 11)
W (11, 12)
. (12, 13)
2274
Expression Input: [KR]{1,4}[KR].[KR]W.
Sequence Input: GGATSSARKNRWDETPKTER
Subequence Match: RKNRWD
[KR]{1} (7, 8)
[KR] (8, 9)
. (9, 10)
[KR] (10, 11)
W (11, 12)
. (12, 13)
2275
Expression Input: [KR]{1,4}[KR].[KR]W.
Sequence Input: PTPGASKRKSRWDETPASQM
Subequence Match: KRKSRWD
[KR]{2} (6, 8)
[KR] (8, 9)
. (9, 10)
[KR] (10, 11)
W (11, 12)
. (12, 13)
2276
Expression Input: [VMILF][MILVFYHPA][^P][TASKHCV][AVSC][^P][^P][ILVMT][^P][^P][^P][LMTVI][^P][^P][LMVCT][ILVMCA][^P][^P][AIVLMTC]
Sequence Input: LLNAATALSGSMQYLLNYVN
Subequence Match: LLNAATALSGSMQYLLNYV
[VMILF] (0, 1)
[MILVFYHPA] (1, 2)
[^P] (2, 3)
[TASKHCV] (3, 4)
[AVSC] (4, 5)
[^P] (5, 6)
[^P] (6, 7)
[ILVMT] (7, 8)
[^P] (8, 9)
[^P] (9, 10)
[^P] (10, 11)
[LMTVI] (11, 12)
[^P] (12, 13)
[^P] (13, 14)
[LMVCT] (14, 15)
[ILVMCA] (15, 16)
[^P] (16, 17)
[^P] (17, 18)
[AIVLMTC] (18, 19)
2277
Expression Input: [VMILF][MILVFYHPA][^P][TASKHCV][AVSC][^P][^P][ILVMT][^P][^P][^P][LMTVI][^P][^P][LMVCT][ILVMCA][^P][^P][AIVLMTC]
Sequence Input: IFEASKKVTNSLSNLISLIG
Subequence Match: IFEASKKVTNSLSNLISLI
[VMILF] (0, 1)
[MILVFYHPA] (1, 2)
[^P] (2, 3)
[TASKHCV] (3, 4)
[AVSC] (4, 5)
[^P] (5, 6)
[^P] (6, 7)
[ILVMT] (7, 8)
[^P] (8, 9)
[^P] (9, 10)
[^P] (10, 11)
[LMTVI] (11, 12)
[^P] (12, 13)
[^P] (13, 14)
[LMVCT] (14, 15)
[ILVMCA] (15, 16)
[^P] (16, 17)
[^P] (17, 18)
[AIVLMTC] (18, 19)
2278
Expression Input: [VMILF][MILVFYHPA][^P][TASKHCV][AVSC][^P][^P][ILVMT][^P][^P][^P][LMTVI][^P][^P][LMVCT][ILVMCA][^P][^P][AIVLMTC]
Sequence Input: IYEKAKEVSSALSKVLSKID
Subequence Match: IYEKAKEVSSALSKVLSKI
[VMILF] (0, 1)
[MILVFYHPA] (1, 2)
[^P] (2, 3)
[TASKHCV] (3, 4)
[AVSC] (4, 5)
[^P] (5, 6)
[^P] (6, 7)
[ILVMT] (7, 8)
[^P] (8, 9)
[^P] (9, 10)
[^P] (10, 11)
[LMTVI] (11, 12)
[^P] (12, 13)
[^P] (13, 14)
[LMVCT] (14, 15)
[ILVMCA] (15, 16)
[^P] (16, 17)
[^P] (17, 18)
[AIVLMTC] (18, 19)
2279
Expression Input: [VMILF][MILVFYHPA][^P][TASKHCV][AVSC][^P][^P][ILVMT][^P][^P][^P][LMTVI][^P][^P][LMVCT][ILVMCA][^P][^P][AIVLMTC]
Sequence Input: IYKAAKDVTTSLSKVLKNIN
Subequence Match: IYKAAKDVTTSLSKVLKNI
[VMILF] (0, 1)
[MILVFYHPA] (1, 2)
[^P] (2, 3)
[TASKHCV] (3, 4)
[AVSC] (4, 5)
[^P] (5, 6)
[^P] (6, 7)
[ILVMT] (7, 8)
[^P] (8, 9)
[^P] (9, 10)
[^P] (10, 11)
[LMTVI] (11, 12)
[^P] (12, 13)
[^P] (13, 14)
[LMVCT] (14, 15)
[ILVMCA] (15, 16)
[^P] (16, 17)
[^P] (17, 18)
[AIVLMTC] (18, 19)
2280
数据来自:ELM(蛋白质功能位点的真核线性基序资源)2020。取自 http://elm.eu.org/searchdb.html
关于python - 找出正则表达式的每个部分匹配的内容,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61196282/
我正在为我的按钮使用 onClick 功能。我的按钮代码如下所示: Button 1 Button 2 我的 JS 函数如下所示: function fadeNext(selectedId, spee
首先,我想提一下,我理解每个人在不提供至少一些试验或错误的情况下提出问题的感受,但这纯粹是一种知识需求,话虽如此,我会去提前问。 我一直无法弄清楚如何将保存在 MySQL 表中的 600-1000 个
我想做的事情有点令人困惑,而且我英语不太好,所以我先把代码贴在这里,这样你就可以很容易地理解: 以下是表单内容: Testing for Stackoverflow Option1
我学习 SDL 二维编程已有一段时间了,现在我想创建一个结合使用 SDL 和 OpenGL 的程序。我是这样设置的: SDL_Init(SDL_INIT_VIDEO); window = SDL_Cr
我创建了 2 个 data-* 标签。数据类别和数据标签。单击 href 标签后,我想复制该数据类别和数据标签以形成输入。我的代码是:
我想用 CSS 换行。我正在使用内容。 td:before { content: "Test\A Test2"; } 它不工作。如何正确
这个问题已经有答案了: Java Class that implements Map and keeps insertion order? (8 个回答) 已关闭 6 年前。 我有一个 HashMap
我正在尝试使用 JMeter 执行端到端测试。测试涉及写入SFTP文件夹并从另一个SFTP文件夹读取写入操作生成的文件。 我能够使用 JMeter SSH SFTP 插件连接到 SFTP 文件夹,并能
您好,我有带有标准服务器端 Servlet 的 GWT 客户端。 我可以从 GWT 客户端上传文件并在服务器端读取其内容 我可以将其作为字符串发送回客户端 但是 我有 GWT FormPanel与操作
我在 Plone 4.3.9 中创建了一个自定义类型的灵巧性,称为 PersonalPage,必须只允许在特定文件夹中使用 成员文件夹/用户文件夹 . 在他的 FTI 中,默认情况下 False .
在新(更新)版本的应用程序中更改小部件布局的最佳做法是什么?当新版本提供更新、更好的小部件时,如何处理现有小部件? 最佳答案 我认为您必须向用户显示一个弹出窗口,说明“此版本中的新功能”并要求他们重新
在我的应用程序中,我使用支持 View 寻呼机和 PagerTabStrip。进入查看寻呼机我有一些 fragment ,进入其中一个我正在使用支持卡片 View 。运行应用程序后,所有卡片 View
我有以下布局文件。基本上我有谷歌地图,在左上角我有一个 TextView,我需要在其中每 15 秒保持一次计数器以刷新 map 。布局很好。
我使用如下结构: HashMap > > OverallMap 如果我这样做: OverallMap . clear ( ) clear() 丢弃的所有内容(HashMap 对象、Integer 对
我在数据库中有 1000 张图像。在页面加载时,我随机显示 60 张图片,当用户滚动时,我通过 AJAX 请求添加 20 张图片。 第一种方法 我所做的是将所有图像加载到一个容器中,然后隐藏所有图像并
我正在使用 woocommerce 创建一个网上商店。 我想在每个产品上添加一个包含产品信息的表格,例如颜色、交货时间等等。 但是当我添加这张表时。本产品消失后的所有内容。 我的表的代码: td {
This question already has an answer here: What does an empty value for the CSS property content do?
因此,我正在与我的 friend 一起为 Google Chrome 开发一个扩展程序,对于大多数功能(即日历、设置等),我们打开一个模式,这样我们就不必重定向到另一个页面。当您在内容之外单击时,我们
我将可变高度的 CSS 框设置为在更大的 div 中向左浮动。现在我想添加一个标题,其中文本在框的左侧垂直显示(旋转 90 度),如下面的链接所示(抱歉还不能发布图片)。 http://imagesh
相关页面位于 www.codykrauskopf.com/circus 如果您查看我页面的右侧,在半透明容器和浏览器窗口边缘之间有一个间隙。我看了看,出于某种原因,wrap、main、content
我是一名优秀的程序员,十分优秀!