gpt4 book ai didi

python - 如何提取标记标签中的文本?

转载 作者:行者123 更新时间:2023-12-04 07:47:59 24 4
gpt4 key购买 nike

我有以下文档,我想提取所有类别标志。
输入 : 应该是一个具有非结构化文本的变量,名为 doc .

doc = "Like APC , <category="Modifier">APC2</category> regulates the formation of active betacatenin-Tcf 
complexes , as demonstrated using transient transcriptional activation assays in APC - / -
<category="Modifier">colon carcinoma</category> cells. Human APC2 maps to chromosome 19p13 . 3 .
APC and APC2 may therefore have comparable functions in development
and <category="SpecificDisease">cancer</category>"
输出 : 应该如下:
{
'Modifier': ['APC2', 'colon carcinoma'],
'SpecificDisease': ['cancer']
}
这应该是自动化的,以便能够提取语料库中的所有类别标签。

我尝试了以下代码:
soup = BeautifulSoup(doc)
contents = soup.find_all('category')
但不知道如何提取每个标志。

最佳答案

BeautifulSoup 无法解析这种类型的文档。但作为“解决方法”,您可以使用 re模块,例如:

import re

doc = """Like APC , <category="Modifier">APC2</category> regulates the formation of active betacatenin-Tcf
complexes , as demonstrated using transient transcriptional activation assays in APC - / -
<category="Modifier">colon carcinoma</category> cells. Human APC2 maps to chromosome 19p13 . 3 .
APC and APC2 may therefore have comparable functions in development
and <category="SpecificDisease">cancer</category>"""

out = {}
for c, t in re.findall(r'<category="(.*?)">(.*?)</category>', doc):
out.setdefault(c, []).append(t)

print(out)
打印:
{'Modifier': ['APC2', 'colon carcinoma'], 'SpecificDisease': ['cancer']}

关于python - 如何提取标记标签中的文本?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/67115910/

24 4 0