gpt4 book ai didi

python - 正则表达式消除 bibtex 文件中的字段

转载 作者:太空宇宙 更新时间:2023-11-03 11:35:51 24 4
gpt4 key购买 nike

我正在尝试精简我从引用文献管理器获得的围兜文本文件,因为它留下了额外的字段,当我将它放入 LaTeX 中时,这些字段最终会被破坏。

我要清理的一个特征条目是:

@Article{Kholmurodov:2001p113,
author = {K Kholmurodov and I Puzynin and W Smith and K Yasuoka and T Ebisuzaki},
journal = {Computer Physics Communications},
title = {MD simulation of cluster-surface impacts for metallic phases: soft landing, droplet spreading and implantation},
abstract = {Lots of text here. Even more text.},
affiliation = {RIKEN, Inst Phys {\&} Chem Res, Computat Sci Div, Adv Comp Ctr, Wako, Saitama 3510198, Japan},
number = {1},
pages = {1--16},
volume = {141},
year = {2001},
month = {Dec},
language = {English},
keywords = {Ethane, molecular dynamics, Clusters, Dl_Poly Code, solid surface, metal, Hydrocarbon Thin-Films, Adsorption, impact, Impact Processes, solid surface, Molecular Dynamics Simulation, Large Systems, DL_POLY, Beam Deposition, Package, Collision-Induced Desorption, Diamond Films, Vapor-Deposition, Transition-Metals, Molecular-Dynamics Simulation},
date-added = {2008-06-27 08:58:25 -0500},
date-modified = {2009-03-24 15:40:27 -0500},
pmid = {000172275000001},
local-url = {file://localhost/User/user/Papers/2001/Kholmurodov/Kholmurodov-MD%20simulation%20of%20cluster-surface%20impacts-2001.pdf},
uri = {papers://B08E511A-2FA9-45A0-8612-FA821DF82090/Paper/p113},
read = {Yes},
rating = {0}
}

我想去掉月份、摘要、关键字等字段,有些是单行的,有些是多行的。

我已经在 Python 中试过了,就像这样:

fOpen = open(f,'r')
start_text = fOpen.read()
fOpen.close()

# regex
out_text = re.sub(r'^(month).*,\n','',start_text)
out_text = re.sub(r'^(annote)((.|\n)*?)\},\n','',out_text)
out_text = re.sub(r'^(note)((.|\n)*?)\},\n','',out_text)
out_text = re.sub(r'^(abstract)((.|\n)*?)\},\n','',out_text)

fNew = open(f,'w')
fNew.write(out_text)
fNew.close()

我尝试在 TextMate 中运行这些正则表达式以查看它们是否有效,然后再在 Python 中尝试它们,它们似乎没问题。

有什么建议吗?

谢谢。

最佳答案

这个正则表达式怎么样(应用多行和 dotall 标志):

^(?:month|annote|note|abstract)\s*=\s*\{(?:(?!\},$).)*\},[\r\n]+

解释:

^                             # start-of-line(?:                           # non-capturing group 1  month|annote|note|abstract  #   one of these terms)                             # end non-capturing group 1\s*=\s*                       # whitespace, an equals sign, whitespace\{                            # a literal curly brace(?:                           # non-capturing group 2  (?!                         #   negative look-ahead (if not followed by...)    \},$                      #     a curly brace, a comma and the end-of-line  )                           #   end negative look-ahead  .                           #   ...then match next character, whatever it is)*                            # end non-capturing group 2, repeat\},                           # a literal curly brace and a comma[\r\n]+                       # at least one end-of-line character

这个单一的表达式在一个步骤中整理出所有受影响的行。


编辑/警告:请注意,一旦发生以下情况,这失败:

affiliation = {RIKEN, Inst Phys {\&},Computat Sci Div, Adv Comp Ctr, Wako, Saitama 3510198, Japan},

嵌套结构不能用正则表达式处理。在这种情况下,没有任何纯正则表达式解决方案在所有情况下都是正确的,您可以获得的最好结果是一个很好的近似值。

问题是,如果您 100% 确定上述情况不会发生(我认为您不会发生),或者您是否愿意承担风险。如果您不完全确定这不会成为问题 - 使用或编写解析器。

关于python - 正则表达式消除 bibtex 文件中的字段,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/3558691/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com