gpt4 book ai didi

python - 将 html 抓取到 csv 文件中

转载 作者:太空宇宙 更新时间:2023-11-03 16:56:25 25 4
gpt4 key购买 nike

下面的代码从以下页面中抓取数据:“http://www.gbgb.org.uk/resultsMeeting.aspx?id=136005

它会抓取所有相关字段并将其打印到屏幕上。不过,我想尝试将表格形式的数据打印到 csv 文件中,以导出到电子表格或数据库中。

在网站源 HTML 中,赛道、日期、日期时间(比赛时间)成绩、距离和奖品来自 div 类“resultsBlockheader”,并在网页上形成比赛卡的顶部区域。

源 HTML 中的比赛主体来自 div 类“resultsBlock”,其中包括终点位置(Fin) Greyhound、Trap、SP、Time/Sec 和 Time distance。

最终会是这个样子

track,date,datetime,grade,distance,prize,fin,greyhound,trap,SP,timeSec,time distance

这可能吗?或者我必须先将其以表格形式打印到屏幕上,然后才能将其导出为 csv。

 from urllib import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.gbgb.org.uk/resultsMeeting.aspx?id=136005")
bsObj = BeautifulSoup(html, 'lxml')

nameList = bsObj. findAll("div", {"class": "track"})
for name in nameList:
List = bsObj. findAll("div", {"class": "distance"})
for name in nameList:
print(name. get_text())
nameList = bsObj. findAll("div", {"class": "prizes"})
for name in nameList:
print(name. get_text())
nameList = bsObj. findAll("li", {"class": "first essential fin"})
for name in nameList:
print(name. get_text())
nameList = bsObj. findAll("li", {"class": "essential greyhound"})
for name in nameList:
print(name. get_text())
nameList = bsObj. findAll("li", {"class": "trap"})
for name in nameList:
print(name. get_text())
nameList = bsObj. findAll("li", {"class": "sp"})
for name in nameList:
print(name. get_text())
nameList = bsObj. findAll("li", {"class": "timeSec"})
for name in nameList:
print(name. get_text())
nameList = bsObj. findAll("li", {"class": "timeDistance"})
for name in nameList:
print(name. get_text())

nameList = bsObj. findAll("li", {"class": "essential trainer"})
for name in nameList:
print(name. get_text())

nameList = bsObj. findAll("li", {"class": "first essential comment"})
for name in nameList:
print(name. get_text())

nameList = bsObj. findAll("div", {"class": "resultsBlockFooter"})
for name in nameList:
print(name. get_text())

nameList = bsObj. findAll("li", {"class": "first essential"})
for name in nameList:
print(name. get_text())

最佳答案

不知道为什么你没有遵循 this answer 中建议的代码对于您之前的问题 - 它实际上解决了将字段分组在一起的问题。

下面是将 trackdategreyhound 转储到 csv 中的后续代码:

import csv

from bs4 import BeautifulSoup
import requests


html = requests.get("http://www.gbgb.org.uk/resultsMeeting.aspx?id=135754").text
soup = BeautifulSoup(html, 'lxml')

rows = []
for header in soup.find_all("div", class_="resultsBlockHeader"):
track = header.find("div", class_="track").get_text(strip=True).encode('ascii', 'ignore').strip("|")
date = header.find("div", class_="date").get_text(strip=True).encode('ascii', 'ignore').strip("|")

results = header.find_next_sibling("div", class_="resultsBlock").find_all("ul", class_="line1")
for result in results:
greyhound = result.find("li", class_="greyhound").get_text(strip=True)

rows.append({
"track": track,
"date": date,
"greyhound": greyhound
})


with open("results.csv", "w") as f:
writer = csv.DictWriter(f, ["track", "date", "greyhound"])

for row in rows:
writer.writerow(row)

运行代码后results.csv的内容:

Sheffield,02/02/16,Miss Eastwood
Sheffield,02/02/16,Sapphire Man
Sheffield,02/02/16,Swift Millican
...
Sheffield,02/02/16,Geelo Storm
Sheffield,02/02/16,Reflected Light
Sheffield,02/02/16,Boozed Flame

请注意,我正在使用 requests在这里,但如果您愿意,您可以继续使用 urllib2

关于python - 将 html 抓取到 csv 文件中,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35384813/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com