gpt4 book ai didi

python - bs4 python 找不到文本

转载 作者:太空宇宙 更新时间:2023-11-03 15:35:24 24 4
gpt4 key购买 nike

我有一个 html 文档,是我通过 beautiful soup 抓取的。 html 的摘录位于此问题的底部。我正在使用 BeautifulSoup 和 Selenium 。

有人告诉我,我每小时只能提取这么多数据,当我得到这个页面时要等一会儿(一个好小时)。

这就是我尝试提取数据的方式:

def get_page_data(self):
opts = Options()
opts.headless = True
assert opts.headless # Operating in headless mode
browser_detail = Firefox(options=opts)
url = self.base_url.format(str(self.tracking_id))
print(url)
browser_detail.get(url)
self.page_data = bs4(browser_detail.page_source, 'html.parser')
Error_Check = 1 if len(self.page_data.findAll(text='Error Report Number')) > 0 else 0
Error_Check = 2 if len(self.page_data.findAll(text='exceeded the maximum number of sessions per hour allowed')) > 0 else Error_Check
print(self.page_data.findAll(text='waiting an hour and trying your query again')). ##<<--- The Problem is this line.
print(self.page_data)
return Error_Check

问题是这一行:

print(self.page_data.findAll(text='waiting an hour and trying your query again')). ##<<--- The Problem is this line.

代码在页面中找不到该行。我错过了什么?谢谢

<html><head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<link href="/CMPL/styles/ogm_style.css;jsessionid=rw9pc8-bncrIy_4KSZmJ8BxN2Z2hnKVwcr79Vho4-99gxTPrxNbo!-68716939" rel="stylesheet" type="text/css"/>
<body>
<!-- Content Area -->
<table style="width:100%; margin:auto;">
<tbody><tr valign="top">
<td class="ContentArea" style="width:100%;">
<span id="messageArea">
<!-- /tiles/messages.jsp BEGIN -->
<ul>
</ul><b>
</b><table style="width:100%; margin:auto; white-space: pre-wrap; text-align: left;">
<tbody><tr><td align="left"><b><li><font color="red"></font></li></b></td>
<td align="left"><font color="red">You have exceeded the maximum number of sessions per hour allowed for the public queries. You may still access the public</font></td>
</tr>
<tr><td><font color="red"><li style="list-style: none;"></li></font></td>
<td align="left"><font color="red">queries by waiting an hour and trying your query again. The RRC public queries are provided to facilitate online research and are not intended to be accessed by automated tools or scripts. For questions or concerns please contact the RRC HelpDesk at helpdesk@rrc.state.tx.us or 512-463-7229</font></td>
</tr>
</tbody></table>
<p>....more html...</p>
</body></html>

最佳答案

你可以使用下面的 css 选择器

tr:last-child:not([valign])

from bs4 import BeautifulSoup as bs
html = '''yourHTML'''
soup = bs(html, 'lxml')
item = soup.select_one('tr:last-child:not([valign])')
print(item.text)

如果这返回多个项目,您可以循环列表过滤包含感兴趣字符串的项目。您可以限制为 td 的选择器并执行类似的操作。

items = soup.select('tr:last-child:not([valign])')
for item in items:
if 'queries by waiting an hour' in item.text:
print(item.text)

BeautifulSoup 4.7.1

关于python - bs4 python 找不到文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55092543/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com