gpt4 book ai didi

python beautifulsoup 提取文本

转载 作者:太空宇宙 更新时间:2023-11-03 18:55:51 26 4
gpt4 key购买 nike

我想提取粗体文本,该文本表示此网站的最新天气 psi http://app2.nea.gov.sg/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours 。有谁知道如何使用下面的代码提取?

我还需要提取当前天气 psi 前面的两个值来进行计算。三个值的总计(最新和前两个值)

示例:当前值(粗体)是 5AM : 51,我还需要 3AM 和 4AM。有谁知道并且可以帮助我解决这个问题?提前致谢!

    from pprint import pprint
import urllib2
from bs4 import BeautifulSoup as soup


url = "http://app2.nea.gov.sg/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours"
web_soup = soup(urllib2.urlopen(url))

table = web_soup.find(name="div", attrs={'class': 'c1'}).find_all(name="div")[2].find_all('table')[0]

table_rows = []
for row in table.find_all('tr'):
table_rows.append([td.text.strip() for td in row.find_all('td')])

data = {}
for tr_index, tr in enumerate(table_rows):
if tr_index % 2 == 0:
for td_index, td in enumerate(tr):
data[td] = table_rows[tr_index + 1][td_index]

pprint(data)

打印:

    {'10AM': '49',
'10PM': '-',
'11AM': '52',
'11PM': '-',
'12AM': '76',
'12PM': '54',
'1AM': '70',
'1PM': '59',
'2AM': '64',
'2PM': '65',
'3AM': '59',
'3PM': '72',
'4AM': '54',
'4PM': '79',
'5AM': '51',
'5PM': '82',
'6AM': '48',
'6PM': '79',
'7AM': '47',
'7PM': '-',
'8AM': '47',
'8PM': '-',
'9AM': '47',
'9PM': '-',
'Time': '3-hr PSI'}

最佳答案

确保您了解这里发生的情况:

import urllib2
import datetime

from bs4 import BeautifulSoup as soup


url = "http://app2.nea.gov.sg/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours"
web_soup = soup(urllib2.urlopen(url))

table = web_soup.find(name="div", attrs={'class': 'c1'}).find_all(name="div")[2].find_all('table')[0]

data = {}
bold_time = ''
cur_time = datetime.datetime.strptime("12AM", "%I%p")
for tr_index, tr in enumerate(table.find_all('tr')):
if 'Time' in tr.text:
continue
for td_index, td in enumerate(tr.find_all('td')):
if not td_index:
continue
data[cur_time] = td.text.strip()
if td.find('strong'):
bold_time = cur_time
cur_time += datetime.timedelta(hours=1)

print data.get(bold_time) # bold
print data.get(bold_time - datetime.timedelta(hours=1)) # before bold
print data.get(bold_time - datetime.timedelta(hours=2)) # before before bold

这将打印以粗体标记的 3 小时 PSI 值以及其前面的两个值(如果存在)。

希望有帮助。

关于python beautifulsoup 提取文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/17280145/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com