gpt4 book ai didi

python - 使用名称从网站上抓取数据表

转载 作者:行者123 更新时间:2023-12-02 16:16:15 25 4
gpt4 key购买 nike

我在尝试抓取网站时遇到了一个独特的情况。我通过搜索栏搜索数百个名字,然后抓取表格。然而,有些名字是独一无二的,并且在我的列表中与网站上的拼写不同。在这种情况下,我手动在网站上查找了几个名字,它仍然将我直接带到各个页面。其他时候,如果有多个具有相同或相似名字的人,它会进入姓名列表(在这种情况下,我想要在 NBA 打球的人。我已经考虑到了这一点,但我认为有必要提及) )。我该如何继续进入这些玩家的个人页面,而不是每次都运行脚本并点击错误来查看哪个玩家的拼写略有不同?同样,即使拼写略有不同,数组中的名称也会直接带您进入单独的页面或名称列表(需要 NBA 中的名称)。一些例子是 Georgios Papagiannis(在网站上列为 George Papagiannis)、Ognjen Kuzmic(列为 Ognen Kuzmic)、Nene(列为 Maybyner Nene,但会带您到姓名列表 - https://basketball.realgm.com/search?q=nene)。这看起来很难,但我觉得这是可能的。另外,似乎不是将所有抓取的数据写入 csv,而是每次都会被下一个玩家覆盖。非常感谢。

我得到的错误: AttributeError: 'NoneType' object has no attribute 'text'

import requests
from bs4 import BeautifulSoup
import pandas as pd


playernames=['Carlos Delfino', 'Nene', 'Yao Ming', 'Marcus Vinicius', 'Raul Neto', 'Timothe Luwawu-Cabarrot']

result = pd.DataFrame()
for name in playernames:

fname=name.split(" ")[0]
lname=name.split(" ")[1]
url="https://basketball.realgm.com/search?q={}+{}".format(fname,lname)
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

if soup.find('a',text=name).text==name:
url="https://basketball.realgm.com"+soup.find('a',text=name)['href']
print(url)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')

try:
table1 = soup.find('h2',text='International Regular Season Stats - Per Game').findNext('table')
table2 = soup.find('h2',text='International Regular Season Stats - Advanced Stats').findNext('table')

df1 = pd.read_html(str(table1))[0]
df2 = pd.read_html(str(table2))[0]

commonCols = list(set(df1.columns) & set(df2.columns))
df = df1.merge(df2, how='left', on=commonCols)
df['Player'] = name
print(df)
except:
print ('No international table for %s.' %name)
df = pd.DataFrame([name], columns=['Player'])

result = result.append(df, sort=False).reset_index(drop=True)

cols = list(result.columns)
cols = [cols[-1]] + cols[:-1]
result = result[cols]
result.to_csv('international players.csv', index=False)

最佳答案

我对名字相似的 NBA 球员使用了循环。您可以在下面找到下面的 CSS 选择器,从搜索表中获取 NBA 球员:

.tablesaw tr:has(a[href*="/nba/teams/"]) a[href*="/player/"]

CSS选择器含义:通过 tablesaw 查找表格类(class),查找同 table 的 child tr有 child a谁的href包含/nba/teams/文本,然后查找 a谁的href包含/player/

我添加了搜索玩家姓名真实玩家姓名栏,您可以看到如何找到玩家。此列使用 insert 放置为第一列和第二列(请参阅代码中的注释)。

import requests
from bs4 import BeautifulSoup
import pandas as pd
from pandas import DataFrame

base_url = 'https://basketball.realgm.com'
player_names = ['Carlos Delfino', 'Nene', 'Yao Ming', 'Marcus Vinicius', 'Raul Neto', 'Timothe Luwawu-Cabarrot']

result = pd.DataFrame()


def def get_player_stats(search_name = None, real_name = None, player_soup = None):
table_per_game = player_soup.find('h2', text='International Regular Season Stats - Per Game')
table_advanced_stats = player_soup.find('h2', text='International Regular Season Stats - Advanced Stats')

if table_per_game and table_advanced_stats:
print('International table for %s.' % search_name)

df1 = pd.read_html(str(table_per_game.findNext('table')))[0]
df2 = pd.read_html(str(table_advanced_stats.findNext('table')))[0]

common_cols = list(set(df1.columns) & set(df2.columns))
df = df1.merge(df2, how='left', on=common_cols)

# insert name columns for the first positions
df.insert(0, 'Search Player Name', search_name)
df.insert(1, 'Real Player Name', real_name)
else:
print('No international table for %s.' % search_name)
df = pd.DataFrame([[search_name, real_name]], columns=['Search Player Name', 'Real Player Name'])

return df


for name in player_names:
url = f'{base_url}/search?q={name.replace(" ", "+")}'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

if url == response.url:
# Get all NBA players
for player in soup.select('.tablesaw tr:has(a[href*="/nba/teams/"]) a[href*="/player/"]'):
response = requests.get(base_url + player['href'])
player_soup = BeautifulSoup(response.content, 'lxml')
player_data = get_player_stats(search_name=player.text, real_name=name, player_soup=player_soup)
result = result.append(player_data, sort=False).reset_index(drop=True)
else:
player_data = get_player_stats(search_name=name, real_name=name, player_soup=soup)
result = result.append(player_data, sort=False).reset_index(drop=True)

result.to_csv('international players.csv', index=False)
# Append to existing file
# result.to_csv('international players.csv', index=False, mode='a')

关于python - 使用名称从网站上抓取数据表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59903722/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com