gpt4 book ai didi

python - 使用正则表达式或 lxml 在 Python 中提取 HTML 注释?

转载 作者:行者123 更新时间:2023-11-30 22:52:21 25 4
gpt4 key购买 nike

如何使用 Python 从文档中提取所有 HTML 样式注释?

我尝试过使用正则表达式:

text = 'hello, world <!-- comment -->'
re.match('<!--(.*?)-->', text)

但它什么也没产生。我不明白这一点,因为相同的正则表达式在 https://regex101.com/ 的同一个字符串上运行良好。

更新:我的文档实际上是一个XML文件,我正在使用pyquery(基于lxml)解析文档,但我不认为lxml可以extract comments that aren't inside a node 。该文档如下所示:

<?xml version="1.0" encoding="UTF-8"?>
<clinical_study rank="220398">
<intervention_browse>
<!-- CAUTION: The following MeSH terms are assigned with an imperfect algorithm -->
<mesh_term>Freund's Adjuvant</mesh_term>
<mesh_term>Keyhole-limpet hemocyanin</mesh_term>
</intervention_browse>
<!-- Results have not yet been posted for this study -->
</clinical_study>

更新2:感谢您提出其他答案,但我已经使用lxml 广泛解析了文档,并且不想使用BeautifulSoup 重写所有内容。已相应更新标题。

最佳答案

这似乎为我打印了评论:

from lxml import etree
txt = """<?xml version="1.0" encoding="UTF-8"?>
<clinical_study rank="220398">
<intervention_browse>
<!-- CAUTION: The following MeSH terms are assigned with an imperfect algorithm -->
<mesh_term>Freund's Adjuvant</mesh_term>
<mesh_term>Keyhole-limpet hemocyanin</mesh_term>
</intervention_browse>
<!-- Results have not yet been posted for this study -->
</clinical_study>"""
root = etree.XML(txt)
print root[0][0]

enter image description here

要获取最后评论:

comments = [itm for itm in root if itm.tag is etree.Comment]:
if comments:
print comments[-1]

关于python - 使用正则表达式或 lxml 在 Python 中提取 HTML 注释?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38616592/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com