gpt4 book ai didi

python - XML 标记文本到字符串忽略子标记但包括它们的文本

转载 作者:太空宇宙 更新时间:2023-11-04 03:24:08 25 4
gpt4 key购买 nike

我正在解析如下所示的 XML 数据:

<title-group><article-title>Leucine to proline substitution by SNP at position 197 in Caspase-9 gene expression leads to neuroblastoma: a bioinformatics analysis</article-title></title-group>

有时虽然里面有斜体标签:

<title-group><article-title><italic>Interferon regulatory factor 5</italic> genetic variants are associated with cardiovascular disease in patients with rheumatoid arthritis</article-title></title-group>

以下 python 代码返回正确连接的标题字符串,但前提是斜体标记不在标题的开头(如上面的代码所示):

    #Get titles
for node in tree.iter('title-group'):
for subnode in node.iter('article-title'):
try:
title = remove_control_characters(subnode.text)
if len(title) == 0:
for subsubnode in node.iter('italic'):
italic = subsubnode.text
tail = remove_control_characters(subsubnode.tail)
title += italic + tail
title = str(title)
break
except:
continue
for subsubnode in node.iter('italic'):
italic = subsubnode.text
tail = remove_control_characters(subsubnode.tail)
title += italic + tail
title = str(title)

当斜体标签位于字符串的开头时,不返回任何内容。

有没有更简单的方法(不包括lxml)可以使用?或者,如果您可以建议对 Python 代码进行更改,我们也将不胜感激。欢迎提出建议,祝您有愉快的一天。

编辑 [已解决]

#Get titles
for node in tree.iter('title-group'):
for subnode in node.iter('article-title'):
whole = subnode.itertext()
for parts in whole:
title += parts
print(remove_control_characters(title))

最佳答案

使用 itertext() <article-title> 上的方法标记,你应该没问题。

关于python - XML 标记文本到字符串忽略子标记但包括它们的文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33631061/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com