gpt4 book ai didi

python - 如何使用 python/pandas 从 href 获取 href 链接

转载 作者:行者123 更新时间:2023-12-01 08:46:38 24 4
gpt4 key购买 nike

我需要获取 href 中存在的 href 链接(我已经有了),所以我需要点击该 href 链接并收集其他 href。我尝试过,但从该代码中仅获得第一个 href,想要点击该代码并收集前一个中存在的 href。那我怎么能这么做呢。我尝试过:

from bs4 import BeautifulSoup
import requests
url = 'https://www.iea.org/oilmarketreport/reports/'
page = requests.get(url)

soup = BeautifulSoup(page.text, 'html.parser')
#soup.prettify()
#table = soup.find("table")
#print(table)
links = []
for href in soup.find_all(class_='omrlist'):
#print(href)
links.append(href.find('a').get('href'))
print(links)

最佳答案

这里如何循环获取报告网址

import requests

root_url = 'https://www.iea.org'

def getLinks(url):
all_links = []
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
for href in soup.find_all(class_='omrlist'):
all_links.append(root_url + href.find('a').get('href')) # add prefix 'http://....'
return all_links

yearLinks = getLinks(root_url + '/oilmarketreport/reports/')

# get report URL
reportLinks = []
for url in yearLinks:
links = getLinks(url)
reportLinks.extend(links)

print(reportLinks)
for url in reportLinks:
if '.pdf' in url:
url = url.replace('../../..', '')
# do download pdf file
....
else:
# do extract pdf url from html and download it
....
....

现在您可以循环reportLinks来获取pdf url

关于python - 如何使用 python/pandas 从 href 获取 href 链接,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53280170/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com