gpt4 book ai didi

python - BeautifulSoup - 按标签内的文本搜索

转载 作者:IT老高 更新时间:2023-10-28 22:03:11 24 4
gpt4 key购买 nike

观察以下问题:

import re
from bs4 import BeautifulSoup as BS

soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
Edit
</a>
""")

# This returns the <a> element
soup.find(
'a',
href="/customer-menu/1/accounts/1/update",
text=re.compile(".*Edit.*")
)

soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>
""")

# This returns None
soup.find(
'a',
href="/customer-menu/1/accounts/1/update",
text=re.compile(".*Edit.*")
)

由于某些原因,BeautifulSoup 将无法匹配文本,当 <i>标签也在那里。找到标签并显示其文本产生

>>> a2 = soup.find(
'a',
href="/customer-menu/1/accounts/1/update"
)
>>> print(repr(a2.text))
'\n Edit\n'

没错。根据Docs ,soup 使用的是正则表达式的匹配功能,而不是搜索功能。所以我需要提供 DOTALL 标志:

pattern = re.compile('.*Edit.*')
pattern.match('\n Edit\n') # Returns None

pattern = re.compile('.*Edit.*', flags=re.DOTALL)
pattern.match('\n Edit\n') # Returns MatchObject

好的。看起来不错。让我们用汤试试吧

soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>
""")

soup.find(
'a',
href="/customer-menu/1/accounts/1/update",
text=re.compile(".*Edit.*", flags=re.DOTALL)
) # Still return None... Why?!

编辑

我基于 geckons 的解决方案回答:我实现了这些助手:

import re

MATCH_ALL = r'.*'


def like(string):
"""
Return a compiled regular expression that matches the given
string with any prefix and postfix, e.g. if string = "hello",
the returned regex matches r".*hello.*"
"""
string_ = string
if not isinstance(string_, str):
string_ = str(string_)
regex = MATCH_ALL + re.escape(string_) + MATCH_ALL
return re.compile(regex, flags=re.DOTALL)


def find_by_text(soup, text, tag, **kwargs):
"""
Find the tag in soup that matches all provided kwargs, and contains the
text.

If no match is found, return None.
If more than one match is found, raise ValueError.
"""
elements = soup.find_all(tag, **kwargs)
matches = []
for element in elements:
if element.find(text=like(text)):
matches.append(element)
if len(matches) > 1:
raise ValueError("Too many matches:\n" + "\n".join(matches))
elif len(matches) == 0:
return None
else:
return matches[0]

现在,当我想找到上面的元素时,我只需运行 find_by_text(soup, 'Edit', 'a', href='/customer-menu/1/accounts/1/update')

最佳答案

问题是您的 <a>带有 <i> 的标签里面的标签,没有 string您期望它具有的属性。首先我们来看看text=""是什么find() 的参数可以。

注意:text参数是一个旧名称,从 BeautifulSoup 4.4.0 开始它被称为 string .

来自 docs :

Although string is for finding strings, you can combine it with arguments that find tags: Beautiful Soup will find all tags whose .string matches your value for string. This code finds the tags whose .string is “Elsie”:

soup.find_all("a", string="Elsie")
# [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]

现在让我们来看看Tagstring属性是(再次来自 docs):

If a tag has only one child, and that child is a NavigableString, the child is made available as .string:

title_tag.string
# u'The Dormouse's story'

(...)

If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None:

print(soup.html.string)
# None

这正是你的情况。您的 <a>标签包含文本 <i>标签。因此,查找得到 None尝试搜索字符串时无法匹配。

如何解决?

也许有更好的解决方案,但我可能会选择这样的解决方案:

import re
from bs4 import BeautifulSoup as BS

soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>
""")

links = soup.find_all('a', href="/customer-menu/1/accounts/1/update")

for link in links:
if link.find(text=re.compile("Edit")):
thelink = link
break

print(thelink)

我认为指向 /customer-menu/1/accounts/1/update 的链接并不多。所以它应该足够快。

关于python - BeautifulSoup - 按标签内的文本搜索,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31958637/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com