gpt4 book ai didi

python - 关于使用 py bs4 进行网页抓取的问题

转载 作者:行者123 更新时间:2023-12-01 07:45:35 25 4
gpt4 key购买 nike

我正在尝试在网络上抓取天气数据来学习抓取基础知识,但在网站包含的 HTML 结构上遇到了一些问题。

我已经调试了 html 页面内的嵌套结构,我可以通过打印 d["precip"] 来显示第一个数据,但我不知道为什么迭代不能通过下一个循环读取,再次通过 print(i) 迭代仍然在这里,可以表明它正常工作。

第一次循环的结果:

{'date': '19:30', 'hourly-date': 'Thu', 
'hidden-cell-sm description': 'Mostly Cloudy',
'temp': '26°', 'feels': '30°', 'precip': '15%',
'humidity': '84%', 'wind': 'SSE 12 km/h '}

第一个循环之后:

{'date': 'None', 'hourly-date': 'None', 
'hidden-cell-sm description': 'None',
'temp': 'None', 'feels': 'None', 'precip': 'None',
'humidity': 'None', 'wind': 'None'}

HTML 端:值“10”和“%”是我想要抓取的内容,我在第一次迭代中做到了,但我不知道为什么它在第二次迭代中变成了“无”

<td class="precip" headers="precip" data-track-string="ls_hourly_ls_hourly_toggle" classname="precip">
<div><span class="icon icon-font iconset-weather-data icon-drop-1" classname="icon icon-font iconset-weather-data icon-drop-1"></span>
<span class="">
<span>
10
<span class="Percentage__percentSymbol__2Q_AR">
%
</span>
</span>
</span>
</div>
</td>

Python代码

import requests
import pandas
from bs4 import BeautifulSoup

page = requests.get("https://weather.com/en-IN/weather/hourbyhour/l/0fcc6b573ec19677819071ea104e175b6dfc8f942f59554bc99d10c5cd0dbfe8")
content = page.content
soup = BeautifulSoup(content, "html.parser")
total = []
container = []
#all = soup.find("div", {"class": "locations-title hourly-page-title"}).find("h1").text
table = soup.find_all("table", {"class": "twc-table"})
for items in table:
for i in range(len(items.find_all("tr")) - 1):
d = {}
try:
d["date"] = items.find_all("span", {"class": "dsx-date"})[i].text
d["hourly-date"] = items.find_all("div", {"class": "hourly-date"})[i].text
d["hidden-cell-sm description"] = items.find_all("td", {"class": "hidden-cell-sm description"})[i].text
d["temp"] = items.find_all("td", {"class": "temp"})[i].text
d["feels"] = items.find_all("td", {"class": "feels"})[i].text

#issue starts from here
inclass = items.find_all("td", {"class": "precip"})[i]
realtext = inclass.find_all("div", "")[i]
d["precip"] = realtext.find_all("span", {"class": ""})[i].text
#issue end

d["humidity"] = items.find_all("td", {"class": "humidity"})[i].text
d["wind"] = items.find_all("td", {"class": "wind"})[i].text

except:
d["date"] = "None"
d["hourly-date"] = "None"
d["hidden-cell-sm description"] = "None"
d["temp"] = "None"
d["precip"] = "None"
d["feels"] = "None"
d["precip"] = "None"
d["humidity"] = "None"
d["wind"] = "None"

total.append(d)

df = pandas.DataFrame(total)
df = df.rename(index=str, columns={"date": "Date", "hourly-date": "weekdays", "hidden-cell-sm description": "Description"})
df = df.reindex(columns=['Date', 'weekdays', 'Description', 'temp', 'feels', 'percip', 'humidity', 'wind'])

我希望抓取所有数据,但如上所示,“precip”丢失了,但其他数据仍然存在。欲了解更多信息,请查看结果

     Date weekdays    Description temp feels  percip humidity          wind
0 19:30 Thu Mostly Cloudy 26° 30° NaN 84% SSE 12 km/h
1 20:00 Thu Mostly Cloudy 26° 30° NaN 86% SSE 11 km/h
2 21:00 Thu Mostly Cloudy 26° 30° NaN 86% SSE 12 km/h
3 22:00 Thu Mostly Cloudy 26° 29° NaN 86% SSE 12 km/h
4 23:00 Thu Cloudy 26° 29° NaN 87% SSE 12 km/h
5 00:00 Fri Cloudy 26° 29° NaN 87% S 12 km/h
6 01:00 Fri Cloudy 26° 26° NaN 88% S 12 km/h
7 02:00 Fri Cloudy 26° 26° NaN 87% S 12 km/h
8 03:00 Fri Cloudy 29° 35° NaN 87% S 12 km/h
9 04:00 Fri Mostly Cloudy 29° 35° NaN 87% S 12 km/h
10 05:00 Fri Mostly Cloudy 28° 35° NaN 87% SSW 11 km/h
11 06:00 Fri Mostly Cloudy 28° 34° NaN 88% SSW 11 km/h
12 07:00 Fri Mostly Cloudy 29° 35° NaN 87% SSW 10 km/h
13 08:00 Fri Mostly Cloudy 29° 36° NaN 84% SSW 12 km/h
14 09:00 Fri Mostly Cloudy 29° 37° NaN 82% SSW 13 km/h
15 10:00 Fri Partly Cloudy 30° 37° NaN 81% SSW 14 km/h

新手,我愿意学习,请告诉我如何改进我的代码结构。非常感谢

最佳答案

您的 precip 变量什么也没找到,这就是您的结果显示的内容。要解决此问题,您可以使用此类 Percentage__percentSymbol__2Q_AR,然后使用它的 previous_sibling 来提取所需的内容。我试图向您展示下面您遇到问题的部分。

import requests
import pandas
from bs4 import BeautifulSoup

page = requests.get("https://weather.com/en-IN/weather/hourbyhour/l/0fcc6b573ec19677819071ea104e175b6dfc8f942f59554bc99d10c5cd0dbfe8")
soup = BeautifulSoup(page.text, "html.parser")
total = []
for tr in soup.find("table",class_="twc-table").tbody.find_all("tr"):
d = {}
d["date"] = tr.find("span", class_="dsx-date").text
d["precip"] = tr.find("span", class_="Percentage__percentSymbol__2Q_AR").previous_sibling
total.append(d)

df = pandas.DataFrame(total,columns=['date','precip'])
print(df)

关于python - 关于使用 py bs4 进行网页抓取的问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56476561/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com