gpt4 book ai didi

python - 如何获取所有页面的 URL?

转载 作者:行者123 更新时间:2023-12-04 07:25:14 28 4
gpt4 key购买 nike

我有一个代码可以从“oddsportal”网站收集页面的所有 URL:

from bs4 import BeautifulSoup
import requests

headers = {
'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'}
source = requests.get("https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/",headers=headers)

soup = BeautifulSoup(source.text, 'html.parser')
main_div=soup.find("div",class_="main-menu2 main-menu-gray")
a_tag=main_div.find_all("a")
for i in a_tag:
print(i['href'])
返回这些结果:
/soccer/africa/africa-cup-of-nations/results/
/soccer/africa/africa-cup-of-nations-2019/results/
/soccer/africa/africa-cup-of-nations-2017/results/
/soccer/africa/africa-cup-of-nations-2015/results/
/soccer/africa/africa-cup-of-nations-2013/results/
/soccer/africa/africa-cup-of-nations-2012/results/
/soccer/africa/africa-cup-of-nations-2010/results/
/soccer/africa/africa-cup-of-nations-2008/results/
我希望将 URL 返回为:
https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/
https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/#/page/2/
https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/#/page/3/
对于为 results 生成的所有父 URL .
我可以看到可以附加网址,如下面的检查元素所示 div id = "pagination" Inspect Element

最佳答案

id="pagination"下的数据是动态加载的,所以 requests不会支持。
但是,您可以通过发送 GET 来获取所有这些页面(1-3)的表格。请求:

https://fb.oddsportal.com/ajax-sport-country-tournament-archive/1/MN8PaiBs/X0/1/0/{page}/?_={timestampe}"
哪里 {page}对应页码(1-3)和 {timestampe}是当前时间
您还需要添加:
"Referer": "https://www.oddsportal.com/"
给您的 headers .
另外,使用 lxml 解析器而不是 html.parser避免 RecursionError .
import re
import requests
from datetime import datetime
from bs4 import BeautifulSoup

headers = {
"User-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
"Referer": "https://www.oddsportal.com/",
}


with requests.Session() as session:
session.headers.update(headers)
for page in range(1, 4):
response = session.get(
f"https://fb.oddsportal.com/ajax-sport-country-tournament-archive/1/MN8PaiBs/X0/1/0/{page}/?_={datetime.now().timestamp()}"
)

table_data = re.search(r'{"html":"(.*)"}', response.text).group(1)
soup = BeautifulSoup(table_data, "lxml")
print(soup.prettify())

关于python - 如何获取所有页面的 URL?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/68241008/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com