gpt4 book ai didi

python - BeautifulSoup:对父元素和子元素进行分类

转载 作者:行者123 更新时间:2023-12-03 23:08:36 27 4
gpt4 key购买 nike

我有一个关于 Python 3 中的 BeautifulSoup 的问题。我花了几个小时尝试,但我还没有解决它。

这是我的汤:

print(soup.prettify())
# REMEMBER THIS SOUP IS DYNAMIC
# <html>
# <body>
# <div class="title" itemtype="http://schema.org/FoodEstablishment">
# <div class="address" itemtype="http://schema.org/PostalAddress">
# <div class="address-inset">
# <p itemprop="name">33 San Francisco</p>
# </div>
# </div>
# <div class="image">
# <img src=""/>
# <span class="subtitle">image subtitle</p>
# </div>
# <a itemprop="name">The Dormouse's story</a>
# </div>
# </body>
# </html>

我必须通过 itemprop="name" 提取两个文本: The Dormouse's story33 San Francisco但是我想要定义什么类是父类的方法。

预期输出:
{
"FoodEstablishment": "The Dormouse's story",
"PostalAddress": "33 San Francisco"
}

请记住,汤总是充满活力的,其中包含许多 child 元素。

最佳答案

我获取每个标签的项目类型和内容,然后使用更新创建一个字典。

from bs4 import BeautifulSoup

html = """<html>
<body>
<div class="title" itemtype="http://schema.org/FoodEstablishment">
<div class="address" itemtype="http://schema.org/PostalAddress">
<p itemprop="name">33 San Francisco</p>
</div>
<p itemprop="name">The Dormouse's story</p>
</div>

</body>
</html>
"""
d = {}
soup = BeautifulSoup(html, 'html.parser')
for item in soup.findAll("div"):
# get the last string in itemtype separated by /
itemType = item.get("itemtype").split('/')[-1]
# remove newline(\n) from contents
itemProp = list(filter(lambda a: a != '\n', item.contents))
# create a dictionary of key: value
d.update({itemType: itemProp[-1].text})

print(d)

Result: {'FoodEstablishment': "The Dormouse's story", 'PostalAddress': '33 San Francisco'}

关于python - BeautifulSoup:对父元素和子元素进行分类,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60604719/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com