gpt4 book ai didi

python - 使用 BeautifulSoup 解析 HTML 和复杂的表格

转载 作者:行者123 更新时间:2023-12-01 05:55:22 24 4
gpt4 key购买 nike

我正在尝试根据 NOAA 数据创建一个 csv 文件 http://www.srh.noaa.gov/data/obhistory/PAFA.html .

我尝试使用表标签,但失败了。所以我试图通过识别 <tr> 来做到这一点在每一行上。这是我的代码:

#This script should take table context from URL and save new data into a CSV file.
noaa = urllib2.urlopen("http://www.srh.noaa.gov/data/obhistory/PAFA.html").read()
soup = BeautifulSoup(noaa)

#Iterate from lines 7 to 78 and extract the text in each line. I probably would like
#space delimited between each text
#for i in range(7, 78, 1):
rows = soup.findAll('tr')[i]
for tr in rows:
for n in range(0, 15, 1):
cols = rows.findAll('td')[n]
for td in cols[n]:
print td.find(text=true)....(match.group(0), match.group(2), match.group(3), ...
match.group(15)

目前,有些东西按预期工作,有些则不然,最后一部分我不确定如何按照我想要的方式缝合。

好吧,我采纳了“That1guy”的建议,并尝试将其扩展到 CSV 组件。所以:

import urllib2 as urllib
from bs4 import BeautifulSoup
from time import localtime, strftime
import csv
url = 'http://www.srh.noaa.gov/data/obhistory/PAFA.html'
file_pointer = urllib.urlopen(url)
soup = BeautifulSoup(file_pointer)

table = soup('table')[3]
table_rows = table.findAll('tr')
row_count = 0
for table_row in table_rows:
row_count += 1
if row_count < 4:
continue

date = table_row('td')[0].contents[0]
time = table_row('td')[1].contents[0]
wind = table_row('td')[2].contents[0]

print date, time, wind
with open("/home/eyalak/Documents/weather/weather.csv", "wb") as f:
writer = csv.writer(f)
print date, time, wind
writer.writerow( ('Title 1', 'Title 2', 'Title 3') )
writer.writerow(str(time)+str(wind)+str(date)+'\n')
if row_count == 74:
print "74"
break

打印结果没问题,是文件不行。我得到:

Title   1,Title 2,Title 3
0,5,:,5,3,C,a,l,m,0,8,"

创建的 CSV 文件中存在的问题是:

  1. 标题被分成了错误的列;第 2 列包含“1,Title”与“title 2”
  2. 数据在错误的位置用逗号分隔
  3. 当脚本写入新行时,它会覆盖前一行,而不是追加从底部开始。

有什么想法吗?

最佳答案

这对我有用:

url = 'http://www.srh.noaa.gov/data/obhistory/PAFA.html'
file_pointer = urllib.urlopen(url)
soup = BeautifulSoup(file_pointer)

table = soup('table')[3]
table_rows = table.findAll('tr')
row_count = 0
for table_row in table_rows:
row_count += 1
if row_count < 4:
continue

date = table_row('td')[0].contents[0]
time = table_row('td')[1].contents[0]
wind = table_row('td')[2].contents[0]

print date, time, wind

if row_count == 74:
break

这段代码显然只返回每行的前 3 个单元格,但你明白了。另外,请注意一些空单元格。在这些情况下,为了确保它们已填充(否则可能会收到 IndexError),我会在获取 .contents 之前检查每行的长度。即:

if len(table_row('td')[offset]) > 0:
variable = table_row('td')[offset].contents[0]

这将确保单元格已填充,并且您将避免 IndexErrors

关于python - 使用 BeautifulSoup 解析 HTML 和复杂的表格,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/12826983/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com