gpt4 book ai didi

python - 如何从字符串中多次提取 HTML 标记模式?

转载 作者:太空宇宙 更新时间:2023-11-03 20:29:04 24 4
gpt4 key购买 nike

我已经有了这个模式,我想根据它搜索字符串以查找所有匹配项。使用后findall() ,只打印最后一个匹配的。

我要处理的字符串如下:

'<inventor sequence="001" designation="us-only"><addressbook><last-name>Li</last-name><first-name>Shuo</first-name><address><city>Beijing</city><country>CN</country></address></addressbook></inventor><inventor sequence="002" designation="us-only"><addressbook><last-name>Liu</last-name><first-name>Xin Peng</first-name><address><city>Beijing</city><country>CN</country></address></addressbook></inventor><inventor sequence="003" designation="us-only"><addressbook><last-name>Sun</last-name><first-name>Sheng Yan</first-name><address><city>Beijing</city><country>CN</country></address></addressbook></inventor><inventor sequence="004" designation="us-only"><addressbook><last-name>Wang</last-name><first-name>Hua</first-name><address><city>Littleton</city><state>MA</state><country>US</country></address></addressbook></inventor><inventor sequence="005" designation="us-only"><addressbook><last-name>Wang</last-name><first-name>Jun</first-name><address><city>Littleton</city><state>MA</state><country>US</country></address></addressbook></inventor>'

我尝试使用以下代码从字符串中提取所有发明人。

INVENTORS_CONTENT_PATTERN = re.compile('<inventor sequence=".*" designation=".*">(.*?)</inventor>')

re.findall(INVENTORS_CONTENT_PATTERN, data)

我得到的结果是最后一个匹配的,而不是数据中的所有发明人:

['<addressbook><last-name>Wang</last-name><first-name>Jun</first-name><address><city>Littleton</city><state>MA</state><country>US</country></address></addressbook>']

最佳答案

这个表达可能更接近您的想法:

<inventor sequence="[^"]*" designation="[^"]*">(.*?)<\/inventor>

测试

import re

regex = r'<inventor sequence="[^"]*" designation="[^"]*">(.*?)<\/inventor>'
test_str = """
<inventor sequence="001" designation="us-only"><addressbook><last-name>Li</last-name><first-name>Shuo</first-name><address><city>Beijing</city><country>CN</country></address></addressbook></inventor><inventor sequence="002" designation="us-only"><addressbook><last-name>Liu</last-name><first-name>Xin Peng</first-name><address><city>Beijing</city><country>CN</country></address></addressbook></inventor><inventor sequence="003" designation="us-only"><addressbook><last-name>Sun</last-name><first-name>Sheng Yan</first-name><address><city>Beijing</city><country>CN</country></address></addressbook></inventor><inventor sequence="004" designation="us-only"><addressbook><last-name>Wang</last-name><first-name>Hua</first-name><address><city>Littleton</city><state>MA</state><country>US</country></address></addressbook></inventor><inventor sequence="005" designation="us-only"><addressbook><last-name>Wang</last-name><first-name>Jun</first-name><address><city>Littleton</city><state>MA</state><country>US</country></address></addressbook></inventor>

"""
print(re.findall(regex, test_str))

输出

['<addressbook><last-name>Li</last-name><first-name>Shuo</first-name><address><city>Beijing</city><country>CN</country></address></addressbook>', '<addressbook><last-name>Liu</last-name><first-name>Xin Peng</first-name><address><city>Beijing</city><country>CN</country></address></addressbook>', '<addressbook><last-name>Sun</last-name><first-name>Sheng Yan</first-name><address><city>Beijing</city><country>CN</country></address></addressbook>', '<addressbook><last-name>Wang</last-name><first-name>Hua</first-name><address><city>Littleton</city><state>MA</state><country>US</country></address></addressbook>', '<addressbook><last-name>Wang</last-name><first-name>Jun</first-name><address><city>Littleton</city><state>MA</state><country>US</country></address></addressbook>']
<小时/>

If you wish to explore/simplify/modify the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.

<小时/>

关于python - 如何从字符串中多次提取 HTML 标记模式?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57634546/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com