gpt4 book ai didi

python - Beautiful Soup 不会从 HTML 对象中返回完整的表格

转载 作者:太空宇宙 更新时间:2023-11-04 00:48:09 25 4
gpt4 key购买 nike

我有这个网页:http://waterdata.usgs.gov/nwis/wys_rpt?dd_parm_cds=002_00060&wys_water_yr=2015&site_no=06935965

我希望从以下位置抓取此信息:

enter image description here

信息存储在此处的表格中,显示标题展开后的前两行:

enter image description here

所以我设置并学习了一些 BeautifulSoup,并找到了我的表格(它是页面上的最后一个表格,因此是 tables[-1])但它不会选择整个表格- 它在“年度总计”行名称/条目之后停止。

代码:

from bs4 import BeautifulSoup
import requests


base_url = 'http://waterdata.usgs.gov/nwis/wys_rpt?dd_parm_cds=002_00060&wys_water_yr=2015&site_no='
site = '06935965'

url = base_url + site
r = requests.get(url)

soup = BeautifulSoup(r.text,'html.parser')
tables = soup.find_all('table')

table = tables[-1]

print(table.text)

输出:

SUMMARY STATISTICS



Water Year 2015
Water Years 2000 - 2015




Annual total

仅此而已!整个表由请求调用拾取:

<table class='tables'>
<caption class='table_caption'>SUMMARY STATISTICS</caption>
<thead class='thead'>
<tr>
<th class='tables_th'></th>
<th class='tables_th' colspan='2'>Water Year 2015</th>
<th class='tables_th' colspan='2'>Water Years 2000 - 2015</th>
</tr>
</thead>
<tbody>
<tr>
<td class='tables_date'>Annual total</th>
<td>41,170,000<span class='padding'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span></td>
<td></td>
<td><span class='padding'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span></td>
<td></td>
</tr>
<tr>
<td class='tables_date'>Annual mean</th>
<td>112,800<span class='padding'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span></td>
<td></td>
<td>87,520<span class='padding'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span></td>
<td></td>
</tr>
<tr>
<td class='tables_date'>Highest annual mean</th>
<td><span class='padding'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span></td>
<td></td>
<td>154,900<span class='padding'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span></td>
<td>2010</td>
</tr>
<tr>
<td class='tables_date'>Lowest annual mean</th>
<td><span class='padding'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span></td>
<td></td>
<td>42,090<span class='padding'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span></td>
<td>2006</td>
</tr>
<tr>
<td class='tables_date'>Highest daily mean</th>
<td>342,000<span class='padding'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span></td>
<td>Jun 20</td>
<td>398,000<span class='padding'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span></td>
<td>Jun 02, 2013</td>
</tr>
<tr>
<td class='tables_date'>Lowest daily mean</th>
<td>40,600<span class='padding'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span></td>
<td>Jan 19</td>
<td>22,900<span class='padding'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span></td>
<td>Jan 26, 2003</td>
</tr>
<tr>
<td class='tables_date'>Annual 7-day minimum</th>
<td>41,410.0<span class='padding'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span></td>
<td>Jan 18</td>
<td>23,630.0<span class='padding'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span></td>
<td>Jan 24, 2003</td>
</tr>
<tr>
<td class='tables_date'>Maximum peak flow</th>
<td>344,000<span class='padding'><sup>a</sup>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span></td>
<td>Jun 20</td>
<td>409,000<span class='padding'><sup>a</sup>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span></td>
<td>Jun 02, 2013</td>
</tr>
<tr>
<td class='tables_date'>Maximum peak stage</th>
<td>31.76<span class='padding'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span></td>
<td>Jun 20</td>
<td>33.80<span class='padding'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span></td>
<td>Jun 02, 2013</td>
</tr>
<tr>
<td class='tables_date'>Annual runoff (cfsm)</th>
<td>0.215<span class='padding'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span></td>
<td></td>
<td>0.165<span class='padding'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span></td>
<td></td>
</tr>
<tr>
<td class='tables_date'>Annual runoff (inches)</th>
<td>2.92<span class='padding'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span></td>
<td></td>
<td>2.25<span class='padding'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span></td>
<td></td>
</tr>
<tr>
<td class='tables_date'>10 percent exceeds</th>
<td>249,000<span class='padding'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span></td>
<td></td>
<td>173,000<span class='padding'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span></td>
<td></td>
</tr>
<tr>
<td class='tables_date'>50 percent exceeds</th>
<td>76,800<span class='padding'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span></td>
<td></td>
<td>63,100<span class='padding'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span></td>
<td></td>
</tr>
<tr>
<td class='tables_date'>90 percent exceeds</th>
<td>50,500<span class='padding'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span></td>
<td></td>
<td>35,700<span class='padding'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span></td>
<td></td>
</tr>
</tbody>
</table>

谁能看出为什么 Beautiful Soup 忽略了表格的其余部分?

最佳答案

你只需要 change the parser更宽松的:

soup = BeautifulSoup(r.text, 'html5lib')

lxml 也可以处理这种情况:

soup = BeautifulSoup(r.text, 'lxml')

关于python - Beautiful Soup 不会从 HTML 对象中返回完整的表格,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38365799/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com