gpt4 book ai didi

python - 单击按钮,然后在看似静态的网页上抓取数据?

转载 作者:行者123 更新时间:2023-12-01 03:49:29 25 4
gpt4 key购买 nike

我正在尝试通过以下链接抓取Totals 表中的玩家统计信息:http://www.basketball-reference.com/players/j/jordami01.html 。当您第一次出现在该网站上时,按原样抓取数据要困难得多,因此您可以选择单击表格正上方的“CSV”。这种格式会更容易理解。

我遇到了麻烦

import urllib2
from bs4 import BeautifulSoup
from selenium import webdriver

player_link = "http://www.basketball-reference.com/players/j/jordami01.html"

browser = webdriver.Firefox()
browser.get(player_link)
elem = browser.find_element_by_xpath("//span[@class='tooltip' and @onlick='table2csv('totals')']")
elem.click()

当我运行此命令时,会弹出一个 Firefox 窗口,但代码从未将表从原始格式更改为 CSV。 CSV 表仅在我单击 CSV(显然)后才会在源代码中弹出。我怎样才能让selenium点击那个CSV按钮,然后BS抓取数据?

最佳答案

这里不需要BeautifulSoup。用selenium点击CSV按钮,用CSV数据提取出现的pre元素的内容,并用 built-in csv module: 解析它。

import csv
from StringIO import StringIO

from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

player_link = "http://www.basketball-reference.com/players/j/jordami01.html"

browser = webdriver.Firefox()
wait = WebDriverWait(browser, 10)
browser.set_page_load_timeout(10)

# stop load after a timeout
try:
browser.get(player_link)
except TimeoutException:
browser.execute_script("window.stop();")

# click "CSV"
elem = wait.until(EC.presence_of_element_located((By.XPATH, "//div[@class='table_heading']//span[. = 'CSV']")))
elem.click()

# get CSV data
csv_data = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "pre#csv_totals"))).text.encode("utf-8")
browser.close()

# read CSV
reader = csv.reader(StringIO(csv_data))
for line in reader:
print(line)

关于python - 单击按钮,然后在看似静态的网页上抓取数据?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38470838/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com