gpt4 book ai didi

python - 如何在一个脚本中从两个网站进行网络抓取?

转载 作者:太空宇宙 更新时间:2023-11-03 20:30:43 25 4
gpt4 key购买 nike

我目前正在研究一个模型,需要收集的信息不仅仅是关于游戏结果的信息(此链接https://www.hltv.org/stats/teams/matches/4991/fnatic?startDate=2019-01-01&endDate=2019-12-31)但我还希望脚本在 HTML 源代码中打开另一个链接。该链接在源代码中可用,它将带我到一个页面,解释每个匹配的详细结果,(如谁想要哪一轮, https://www.hltv.org/stats/matches/mapstatsid/89458/cr4zy-vs-fnatic?startDate=2019-01-01&endDate=2019-12-31&contextIds=4991&contextTypes=team ),主要目标是我想知道谁赢得了比赛(从第一个链接)以及谁赢得了每场比赛的第一轮(在第二个链接中)。这可能吗?这是我当前的脚本;

import requests
r = requests.get('https://www.hltv.org/stats/teams/maps/6665/Astralis')
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('tr')
AstralisResults = []

for result in results[1:]:
date = result.contents[1].text
event = result.contents[3].text
opponent = result.contents[7].text
Map = result.contents[9].text
Score = "'" + result.contents[11].text
WinorLoss = result.contents[13].text
AstralisResults.append((date,event,opponent,Map,Score,WinorLoss))

import pandas as pd
df5 = pd.DataFrame(AstralisResults,columns=['date','event','opponent','Map','Score','WinorLoss'])
df5.to_csv('AstralisResults.csv',index=False,encoding='utf-8')

所以我会寻找以下信息:

Date | Opponent | Map | Score | Result | Round1Result |

最佳答案

看起来如果你抓取得太快,网站就会被屏蔽,所以不得不延迟一些时间。有多种方法可以使此代码更高效,但总的来说,我认为它满足了您的要求:

from bs4 import BeautifulSoup
import requests
import pandas as pd
import time


headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}

r = requests.get('https://www.hltv.org/stats/teams/matches/4991/fnatic?startDate=2019-01-01&endDate=2019-12-31' , headers=headers)
print (r)


soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('tr')
df5 = pd.DataFrame()

cnt=1
for result in results[1:]:
print ('%s of %s' %(cnt, len(results)-1))
date = result.contents[1].text
event = result.contents[3].text
opponent = result.contents[7].text
Map = result.contents[9].text
Score = "'" + result.contents[11].text
WinorLoss = result.contents[13].text

round_results = result.find('td', {'class':'time'})
link = round_results.find('a')['href']


r2 = requests.get('https://www.hltv.org' + link ,headers=headers)
soup2 = BeautifulSoup(r2.text, 'html.parser')
round_history = soup2.find('div', {'class':'standard-box round-history-con'})

teams = round_history.find_all('img', {'class':'round-history-team'})
teams_list = [ x['title'] for x in teams ]



rounds_winners = {}
n = 1
row = round_history.find('div',{'class':'round-history-team-row'})
for each in row.find_all('img',{'class':'round-history-outcome'}):
if 'emptyHistory' in each['src']:
winner = teams_list[1]
loser = teams_list[0]
else:
winner = teams_list[0]
loser = teams_list[1]

rounds_winners['Round%02dResult' %n] = winner
n+=1


round_row_df = pd.DataFrame.from_dict(rounds_winners,orient='index').T

temp_df = pd.DataFrame([[date,event,opponent,Map,Score,WinorLoss]],columns=['date','event','opponent','Map','Score','WinorLoss'])
temp_df = temp_df.merge(round_row_df, left_index=True, right_index=True)

df5 = df5.append(temp_df, sort=True).reset_index(drop=True)
time.sleep(.5)
cnt+=1

df5 = df5[['date','event','opponent','Map','Score','WinorLoss', 'Round01Result']]
df5 = df5.rename(columns={'date':'Date',
'event':'Event',
'WinorLoss':'Result',
'Round01Result':'Round1Result'})

df5.to_csv('AstralisResults.csv',index=False,encoding='utf-8')

输出:

print (df5.head(10).to_string())
Date Event opponent Map Score Result Round1Result
0 20/07/19 Europe Minor - StarLadder Major 2019 CR4ZY Dust2 '13 - 16 L fnatic
1 20/07/19 Europe Minor - StarLadder Major 2019 CR4ZY Train '13 - 16 L fnatic
2 19/07/19 Europe Minor - StarLadder Major 2019 mousesports Inferno '8 - 16 L mousesports
3 19/07/19 Europe Minor - StarLadder Major 2019 mousesports Dust2 '13 - 16 L fnatic
4 17/07/19 Europe Minor - StarLadder Major 2019 North Train '16 - 9 W fnatic
5 17/07/19 Europe Minor - StarLadder Major 2019 North Nuke '16 - 2 W fnatic
6 17/07/19 Europe Minor - StarLadder Major 2019 Ancient Mirage '16 - 7 W fnatic
7 04/07/19 ESL One Cologne 2019 Vitality Overpass '17 - 19 L Vitality
8 04/07/19 ESL One Cologne 2019 Vitality Mirage '16 - 19 L fnatic
9 03/07/19 ESL One Cologne 2019 Astralis Nuke '6 - 16 L fnatic

关于python - 如何在一个脚本中从两个网站进行网络抓取?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57528087/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com