gpt4 book ai didi

Python 查找重复项并保留注释字符串

转载 作者:行者123 更新时间:2023-12-01 04:48:44 26 4
gpt4 key购买 nike

输入如下:

assign (resid 3 and name H ) (resid 18 and name H ) 2.5 2.5 2.5 ! note string 1
assign (resid 16 and name H ) (resid 5 and name H ) 2.5 2.5 2.5 ! note string 2
assign (resid 42 and name H ) (resid 55 and name H ) 2.5 2.5 2.5 ! note string 3
assign (resid 44 and name H ) (resid 53 and name H ) 2.5 2.5 2.5 ! note string 4
assign (resid 53 and name H ) (resid 44 and name H ) 2.5 2.5 2.5 ! note string 5

如果你注意到这里,第 4 行和第 5 行是重复的,只是 (resid 44 和 name H )(resid 53 和 name H ) 被交换。我理想的输出会返回如下内容:

assign (resid 3 and name H ) (resid 18 and name H ) 2.5 2.5 2.5 ! note string 1
assign (resid 16 and name H ) (resid 5 and name H ) 2.5 2.5 2.5 ! note string 2
assign (resid 42 and name H ) (resid 55 and name H ) 2.5 2.5 2.5 ! note string 3
assign (resid 44 and name H ) (resid 53 and name H ) 2.5 2.5 2.5 ! DUPLICATE ! note string 4 ! note string 5
<小时/>

所以我开始使用在 python 中读取文件的典型方法。

txt = open(filename)

print ( lines[0] )

我显然需要捕获 () 之间的字符串,然后进行某种类型的搜索。我用正则表达式捕获了那些,这是 child 子的东西。我的想法是在嵌套循环中使用 match[0]match[1] 并进行搜索。我失败的尝试是:

for i in lines:
# match = re.search("\\(.*?\\)", i)
match = re.findall('\\(.*?\\)',i)
for x in i:
mm = re.search("match[0] match[1]", lines)
print ( mm )

match[0]match[1] 如果我打印它们,就会得到我想要的东西。进行此搜索以便我可以保留和转移注释标志的最佳方法是什么?我想象将 DUPLICATE 添加到注释字符串将是微不足道的。

我真的只对 python 解决方案感兴趣。我还需要将其用于我一直在编写的 400 行程序。

谢谢

最佳答案

更精通使用正则表达式的人可能会向您指出一个更好的实现来获取 key ,但将元组存储为 key 并反向检查它是否已经存在应该可以工作:

lines = """assign (resid 3 and name H ) (resid 18 and name H ) 2.5 2.5 2.5 ! note string 1
assign (resid 16 and name H ) (resid 5 and name H ) 2.5 2.5 2.5 ! note string 2
assign (resid 42 and name H ) (resid 55 and name H ) 2.5 2.5 2.5 ! note string 3
assign (resid 44 and name H ) (resid 53 and name H ) 2.5 2.5 2.5 ! note string 4
assign (resid 53 and name H ) (resid 44 and name H ) 2.5 2.5 2.5 ! note string 5"""

import re

d = {}

r1 = re.compile(r"(?<=\))\s")
r2 = re.compile(r"\(.*\)")

for line in lines.splitlines():
key = tuple(r1.split(r2.findall(line)[0]))
# ("foo","bar") == ("bar","foo") , also check current key is not in d
if tuple(reversed(key)) not in d and key not in d:
d[key] = line

pp(list(d.values()))

输出:

['assign (resid 42 and name H ) (resid 55 and name H ) 2.5 2.5 2.5 ! note '
'string 3',
'assign (resid 16 and name H ) (resid 5 and name H ) 2.5 2.5 2.5 ! note '
'string 2',
'assign (resid 3 and name H ) (resid 18 and name H ) 2.5 2.5 2.5 ! note '
'string 1',
'assign (resid 44 and name H ) (resid 53 and name H ) 2.5 2.5 2.5 ! note '
'string 4']

如果顺序很重要,请使用collections.Ordereddict。我不确定您到底想在字符串中添加什么,但这会添加 DUPLICATE !字符串 5 等...到现有的键值:

from collections import OrderedDict

d = OrderedDict()
import re

r1 = re.compile(r"(?<=\))\s")
r2 = re.compile(r"\(.*\)")
for line in lines.splitlines():
key = tuple(r1.split(r2.findall(line)[0]))
# (resid 44 and name H ) (resid 53 and name H ) -> (resid 53 and name H ) (resid 44 and name H )
rev_k = tuple(reversed(key))
if rev_k in d:
d[rev_k] += " DUPLICATE " + " ".join(line.rsplit(None,4)[1:])
elif key in d:
d[key] += " DUPLICATE " + " ".join(line.rsplit(None,4)[1:])
else:
d[key] = line

输出:

['assign (resid 3 and name H ) (resid 18 and name H ) 2.5 2.5 2.5 ! note '
'string 1',
'assign (resid 16 and name H ) (resid 5 and name H ) 2.5 2.5 2.5 ! note '
'string 2',
'assign (resid 42 and name H ) (resid 55 and name H ) 2.5 2.5 2.5 ! note '
'string 3',
'assign (resid 44 and name H ) (resid 53 and name H ) 2.5 2.5 2.5 ! note '
'string 4 DUPLICATE ! string 5']

根据您想要执行的操作,您可以附加原始行和DUPLICATE! string ... 每次,因此我们看到重复之前的原始字符串将是第一个元素,其余的将是所有 DUPLICATE !字符串...:

lines = """assign (resid 3 and name H ) (resid 18 and name H ) 2.5 2.5 2.5 ! note string 1
assign (resid 16 and name H ) (resid 5 and name H ) 2.5 2.5 2.5 ! note string 2
assign (resid 42 and name H ) (resid 55 and name H ) 2.5 2.5 2.5 ! note string 3
assign (resid 44 and name H ) (resid 53 and name H ) 2.5 2.5 2.5 ! note string 4
assign (resid 53 and name H ) (resid 44 and name H ) 2.5 2.5 2.5 ! note string 5
assign (resid 53 and name H ) (resid 44 and name H ) 2.5 2.5 2.5 ! note string 6"""

from collections import defaultdict


d = defaultdict(list)
r1 = re.compile(r"(?<=\))\s")
r2 = re.compile(r"\(.*\)")

for line in lines.splitlines():
key = tuple(r1.split(r2.findall(line)[0]))
rev_k = tuple(reversed(key))
if rev_k in d:
d[rev_k].append(line + " DUPLICATE " + " ".join(line.rsplit(None,4)[1:]))
elif key in d:
d[key] += " DUPLICATE " + " ".join(line.rsplit(None,4)[1:])
else:
d[key].append(line)


pp(list(d.values()))

输出:

[['assign (resid 3 and name H ) (resid 18 and name H ) 2.5 2.5 2.5 ! note '
'string 1'],
['assign (resid 44 and name H ) (resid 53 and name H ) 2.5 2.5 2.5 ! note '
'string 4',
'assign (resid 53 and name H ) (resid 44 and name H ) 2.5 2.5 2.5 ! note '
'string 5 DUPLICATE ! note string 5',
'assign (resid 53 and name H ) (resid 44 and name H ) 2.5 2.5 2.5 ! note '
'string 6 DUPLICATE ! note string 6'],
['assign (resid 42 and name H ) (resid 55 and name H ) 2.5 2.5 2.5 ! note '
'string 3'],
['assign (resid 16 and name H ) (resid 5 and name H ) 2.5 2.5 2.5 ! note '
'string 2']]

关于Python 查找重复项并保留注释字符串,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28863863/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com