gpt4 book ai didi

Python 正则表达式在论文中获取引用

转载 作者:行者123 更新时间:2023-12-02 19:16:28 30 4
gpt4 key购买 nike

我正在适应this code用于从文本中提取引文:

#!/usr/bin/env python3
# https://stackoverflow.com/a/16826935

import re
from sys import stdin

text = stdin.read()

author = "(?:[A-Z][A-Za-z'`-]+)"
etal = "(?:et al.?)"
additional = "(?:,? (?:(?:and |& )?" + author + "|" + etal + "))"
year_num = "(?:19|20)[0-9][0-9]"
page_num = "(?:, p.? [0-9]+)?" # Always optional
year = "(?:, *"+year_num+page_num+"| *\("+year_num+page_num+"\))"
regex = "(" + author + additional+"*" + year + ")"

matches = re.findall(regex, text)
matches = list( dict.fromkeys(matches) )
matches.sort()

#print(matches)
print ("\n".join(matches))

但是,它会将一些大写单词识别为作者姓名。例如文中:

Although James (2020) recognized blablabla, Smith et al. (2020) found mimimi. 
Those inconsistent results are a sign of lalala (Green, 2010; Grimm, 1990).
Also James (2020) ...

输出将是

Also James (2020)
Although James (2020)
Green, 2010
Grimm, 1990
Smith et al. (2020)

有没有办法将上述代码中的某些单词“列入黑名单”而不删除整个匹配项?我希望它认可 James 的工作,但从引文中删除了“Also”和“Although”。

提前致谢。

最佳答案

您可以使用

author = r"(?:[A-Z][A-Za-z'`-]+)"
etal = r"(?:et al\.?)"
additional = f"(?:,? (?:(?:and |& )?{author}|{etal}))"
year_num = "(?:19|20)[0-9][0-9]"
page_num = "(?:, p\.? [0-9]+)?" # Always optional
year = fr"(?:, *{year_num}{page_num}| *\({year_num}{page_num}\))"
regex = fr'\b(?!(?:Although|Also)\b){author}{additional}*{year}'
matches = re.findall(regex, text)

请参阅Python demoresulting regex demo .

主要区别在于 regex = fr'\b(?!(?:Although|Also)\b){author}{additional}*{year}'如果紧邻右侧的单词是 AlthoughAlso,\b(?!(?:Although|Also)\b) 部分将会失败。

另外,请注意,我转义了应该与文字点匹配的点,并使用 f 字符串使代码看起来更紧凑。

关于Python 正则表达式在论文中获取引用,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63632861/

30 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com