gpt4 book ai didi

python - 使用 BeautifulSoup4 查找所有包含文本的端节点

转载 作者:太空宇宙 更新时间:2023-11-04 02:07:49 24 4
gpt4 key购买 nike

我是 Python 和 BeautifulSoup4 的新手

我正在尝试(仅)提取所有标记的文本内容,这些标记要么是“div”、“p”、“li”,并且仅来自直接节点,而不是子节点——因此有两个选项 text=True, recursive=False

这些是我的尝试:

content = soup.find_all("b", "div", "p", text=True, recursive=False)

tags = ["div", "p", "li"]
content = soup.find_all(tags, text=True, recursive=False)

这两个都没有输出,你知道我做错了什么吗?

编辑 - 添加更多代码和我正在测试的示例文档,print(content) 为空

import requests
from bs4 import BeautifulSoup

url = "https://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-list"
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})

soup = BeautifulSoup(response.text, "html.parser")

tags = ["div", "p", "li"]
content = soup.find_all(tags, text=True, recursive=False)

print(content)

最佳答案

根据您的问题和对先前答案的评论,我认为您正在尝试寻找

  • the innermost tags

  • that are either 'p' or 'li' or 'div'

  • Should contain some text

import requests
from bs4 import BeautifulSoup
from bs4 import NavigableString

url = "https://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-list"
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})

soup = BeautifulSoup(response.text, "html.parser")
def end_node(tag):
if tag.name not in ["div", "p", "li"]:
return False
if isinstance(tag,NavigableString): #if str return
return False
if not tag.text: #if no text return false
return False
elif len(tag.find_all(text=False)) > 0: #no other tags inside other than text
return False
return True #if valid it reaches here
content = soup.find_all(end_node)
print(content) #all end nodes matching our criteria

输出示例

[<p>These instructions illustrate all major features of Beautiful Soup 4,
with examples. I show you what the library is good for, how it works,
how to use it, how to make it do what you want, and what to do when it
violates your expectations.</p>, <p>The examples in this documentation should work the same way in Python
2.7 and Python 3.2.</p>, <p>This documentation has been translated into other languages by
Beautiful Soup users:</p>, <p>Here are some simple ways to navigate that data structure:</p>, <p>One common task is extracting all the URLs found within a page’s &lt;a&gt; tags:</p>, <p>Another common task is extracting all the text from a page:</p>, <p>Does this look like what you need? If so, read on.</p>, <p>If you’re using a recent version of Debian or Ubuntu Linux, you can
install Beautiful Soup with the system package manager:</p>, <p>I use Python 2.7 and Python 3.2 to develop Beautiful Soup, but it
should work with other recent versions.</p>, <p>Beautiful Soup is packaged as Python 2 code. When you install it for
use with Python 3, it’s automatically converted to Python 3 code. If
you don’t install the package, the code won’t be converted. There have
also been reports on Windows machines of the wrong version being
installed.</p>, <p>In both cases, your best bet is to completely remove the Beautiful
Soup installation from your system (including any directory created
when you unzipped the tarball) and try the installation again.</p>, <p>This table summarizes the advantages and disadvantages of each parser library:</p>, <li>Batteries included</li>, <li>Decent speed</li>,
....
]

关于python - 使用 BeautifulSoup4 查找所有包含文本的端节点,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54265391/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com