gpt4 book ai didi

python - 如何使用 BeautifulSoup 从网页中抓取结构化表格?

转载 作者:太空宇宙 更新时间:2023-11-03 19:55:37 25 4
gpt4 key购买 nike

我有以下代码来抓取网站( https://www.vesselfinder.com/vessels/STENAWECO-ENERGY-IMO-9683984-MMSI-538005270 )。由于存在相似的类名,因此很难精确定位表类名以将数据抓取到 CSV 文件。我如何确保我抓取的是正确的信息?

我的代码是

agent = {"User-Agent":'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'} 
urlFile = requests.get('https://www.vesselfinder.com/vessels/STENAWECO-ENERGY-IMO-9683984-MMSI-538005270', headers = agent)

soupHtml = BeautifulSoup(urlFile.content, 'lxml')

rowsFind = soupHtml.find_all("table",{"class": "tparams"})
print(rowsFind)
for i in rowsFind:
z = i.find_all("tr")
for r in z:
cols = r.find_all('td' , 'v3')
cols = [x.text.strip() for x in cols]
print(cols)
AISVessel.append(cols[0])

AIStable.append(AISVessel)

现在我遇到了这个错误:

IndexError: list index out of range

所需的输出将是:

[['Tanker' , 'Marshall Islands' , 'USHOU > DOSPM' , 'Jan 3, 19:00' , '9683984 / 538005270' , '  V7CJ5', '183 / 32 m' ,  '11.4 m' ,' 115.4° / 13.5 kn ' , '19.60436 N/80.84751 W' , 'Jan 1, 2020 07:38 UTC']]

我想将上面反射(reflect)的相关数据附加到嵌套列表中,以支持将其写入 CSV 文件。

最佳答案

要找到正确的表格,您可以使用 CSS 选择器 h2:contains("AIS Data") ~ table.tparams td.v3 - 这将得到所有 <td>在标题为“AIS 数据”的表格内:

import requests
from bs4 import BeautifulSoup

agent = {"User-Agent":'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'}
urlFile = requests.get('https://www.vesselfinder.com/vessels/STENAWECO-ENERGY-IMO-9683984-MMSI-538005270', headers = agent)

soupHtml = BeautifulSoup(urlFile.content, 'lxml')

out = [td.get_text(strip=True) for td in soupHtml.select('h2:contains("AIS Data") ~ table.tparams td.v3')]

print(out)

打印:

['Tanker', 'Marshall Islands', 'USHOU > DOSPM', 'Jan 3, 19:00', '9683984 / 538005270', 'V7CJ5', '183 / 32 m', '11.4 m', '115.4° / 13.5 kn', '19.60436 N/80.84751 W', 'Jan 1, 2020 07:38 UTC']

关于python - 如何使用 BeautifulSoup 从网页中抓取结构化表格?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59556147/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com