gpt4 book ai didi

python - 使用 BeautifulSoup 获取不带标签的文本

转载 作者:行者123 更新时间:2023-12-01 03:49:55 25 4
gpt4 key购买 nike

我正在尝试使用 BeautifulSoup 获取一些没有标签的文本。我尝试使用 .string.contents.text.find(text=True) 和 < em>.next_sibling,它们列在下面。

编辑Nvmd 我刚刚注意到 .next_sibling 对我有用。无论如何,这个问题可以作为一个记录,收集处理类似案例的方法。

import bs4 as BeautifulSoup
s = """
<p>
<a>
Something I can fetch but don't want
</a>
I want to fetch this line.
<a>
Something else I can fetch but don't want
</a>
</p>
"""

p = BeautifulSoup(s, 'html.parser')
print p.contents
# [u'\n', <p>
# <a>
# Something
# </a>
# I want to fetch this line.
# <a>
# Something else
# </a>
# </p>, u'\n']

print p.next_sibling.string
# I want to fetch this line.
print p.string
# None
print p.text
# all the texts, including those I can get but don't want.
print p.find(text=True)
# Returns an empty line of type bs4.element.NavigableString
print p.find(text=True)[0]
# Returns an empty line of type unicode

我想知道是否有比手动解析字符串 s 来获取我想要获取的行更简单的方法?

最佳答案

试试这个。它仍然很粗糙,但至少不需要您手动解析字符串。

#get all non-empty strings from the backend.
texts = [str.strip(x) for x in p.strings if str.strip(x) != '']

#get strings only with tags
unwanted_text = [str.strip(x.text) for x in p.find_all()]

#take the difference
set(texts).difference(unwanted_text)

这会产生:

In [87]: set(texts).difference(unwanted_text)
Out[87]: {'I want to fetch this line.'}

关于python - 使用 BeautifulSoup 获取不带标签的文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38423542/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com