gpt4 book ai didi

python - 提取并格式化站点数据 Python

转载 作者:太空宇宙 更新时间:2023-11-03 16:22:51 25 4
gpt4 key购买 nike

这适用于 Python 3.5.x我正在寻找的是在一段 HTML 代码之后找到标题

<h3 class = "title-link__title"><span class="title=link__text">News Here</span>

with urllib.request.urlopen('http://www.bbc.co.uk/news') as r:
HTML = r.read()
HTML = list(HTML)
for i in range(len(HTML)):
HTML[i] = chr(HTML[i])

我怎样才能得到它,所以我只提取标题,因为这就是我所需要的。我会尽我所能尽力提供详细信息。

最佳答案

从网页获取信息称为网络抓取

完成这项工作的最佳工具之一是 BeautifulSoup图书馆。

from bs4 import BeautifulSoup
import urllib

#opening page
r = urllib.urlopen('http://www.bbc.co.uk/news').read()
#creating soup
soup = BeautifulSoup(r)

#useful for understanding the layout of your page info
#print soup.prettify()

#creating a ResultSet with all h3 tags that contains a class named 'title-link__title'
a = soup.findAll("h3", {"class":"title-link__title"})

#counting ocurrences
len(a)
#result = 44

#get text of first header
a[0].text
#result = u'\nMay v Leadsom to be next UK PM\n'

#get text of second header
a[1].text
#result = u'\nVideo shows US police shooting aftermath\n'

关于python - 提取并格式化站点数据 Python,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38254553/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com