gpt4 book ai didi

javascript - 使用 BeautifulSoup 获取 "View Element"代码而不是 "View Source"代码

转载 作者:行者123 更新时间:2023-11-28 05:22:51 25 4
gpt4 key购买 nike

我正在使用以下代码获取所有 <script>...</script>来自网页的内容(请参阅代码中的 url):

import urllib2
from bs4 import BeautifulSoup
import re
import imp

url = "http://racing4everyone.eu/2015/10/25/formula-e-201516formula-e-201516-round01-china-race/"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())

script = soup.find_all("script")
print script #just to check the output of script

但是,BeautifulSoup 会在网页的源代码(Chrome 中为 Ctrl+U)中进行搜索。但是,我想在网页的元素代码(Ctrl+Shift+I in chrome)中进行 BeautifulSoup 搜索。

我希望它这样做,因为我真正感兴趣的代码片段在元素代码中,而不是在源代码中。

最佳答案

首先要了解的是,BeautifulSoupurllib2 都不是浏览器。 urllib2 只会为您获取/下载初始“静态”页面 - 它不能像真正的浏览器那样执行 JavaScript。因此,您将始终获得“查看页面源代码”内容。

要解决您的问题 - 通过 selenium 启动一个真正的浏览器,等待页面加载,获取.page_source,传递给BeautifulSoup解析:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.Firefox()
driver.get("http://racing4everyone.eu/2015/10/25/formula-e-201516formula-e-201516-round01-china-race/")

# wait for the page to load
wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".fluid-width-video-wrapper")))

# get the page source
page_source = driver.page_source

driver.close()

# parse the HTML
soup = BeautifulSoup(page_source, "html.parser")
script = soup.find_all("script")
print(script)

这是一般方法,但您的情况有点不同 - 有一个包含视频播放器的 iframe 元素。如果你想访问 iframe 中的 script 元素,你需要切换到它然后获取 .page_source:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.Firefox()
driver.get("http://racing4everyone.eu/2015/10/25/formula-e-201516formula-e-201516-round01-china-race/")

# wait for the page to load, switch to iframe
wait = WebDriverWait(driver, 10)
frame = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "iframe[src*=video]")))
driver.switch_to.frame(frame)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, ".controls")))

# get the page source
page_source = driver.page_source

driver.close()

# parse the HTML
soup = BeautifulSoup(page_source, "html.parser")
script = soup.find_all("script")
print(script)

关于javascript - 使用 BeautifulSoup 获取 "View Element"代码而不是 "View Source"代码,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36129963/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com