gpt4 book ai didi

Python BeautifulSoup - 忽略子标签和 ID

转载 作者:太空宇宙 更新时间:2023-11-03 20:18:56 27 4
gpt4 key购买 nike

我有一个具有以下结构的文件。

我想找到所有父标签,即所有仅包含数字的 ID 以及其中包含的文本。但是,现在我得到了所有 a 标签(包括父标签和子标签)的平面结构。

<A ID=101>
<a id=”A1”>Today is a nice day.
<a id=”A2”>Today is a very nice day.
<a id=”A3”>Today is a very very nice day.
</A>

<A ID=102>
<a id=”A1”>Today is a nice day2.
<a id=”A2”>Today is a very nice day2.
<a id=”A3”>Today is a very very nice day2.
</A>

我只想要这个并忽略所有子标签和 ID。像这样提取的方法是什么?

<A ID=101>
Today is a nice day.
Today is a very nice day.
Today is a very very nice day.
</A>

<A ID=102>
Today is a nice day2.
Today is a very nice day2.
Today is a very very nice day2.
</A>

最佳答案

下面的代码可以满足您的要求,前提是子标签和父标签具有不同的名称,并且不仅仅是彼此的大写和小写版本

html = """
<B ID=101>
<a id=”A1”>Today is a nice day.
<a id=”A2”>Today is a very nice day.
<a id=”A3”>Today is a very very nice day.
</B>

<B ID=102>
<a id=”A1”>Today is a nice day2.
<a id=”A2”>Today is a very nice day2.
<a id=”A3”>Today is a very very nice day2.
</B>
"""
invalid_tags = ['a',"html","body"]
soup = BeautifulSoup(html,"lxml")
for tag in invalid_tags:
for match in soup.findAll(tag):
match.replaceWithChildren()
print (soup)

这是因为BeautifulSoup默认处理html数据。HTML不区分大小写;解析时所有标签均小写。

如果需要匹配标签区分大小写,则需要将文档解析为 XML。安装lxml(通过pip)并告诉BeautifulSoup以XML模式使用该解析器例如

soup = BeautifulSoup(source, 'xml')

关于Python BeautifulSoup - 忽略子标签和 ID,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58273254/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com