gpt4 book ai didi

python - javascript __doPostBack 的网页抓取在 td 中包含 href

转载 作者:行者123 更新时间:2023-12-01 07:23:01 25 4
gpt4 key购买 nike

我想抓取一个网站,即使用 selenium https://www.unspsc.org/search-code/default.aspx?CSS=51%&Type=desc&SS%27= 但我我只能抓取一页,不能抓取其他页面。

这里我使用 Selenium

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(executable_path='C:/Users/ptiwar34/Documents/chromedriver.exe', chrome_options=chromeOptions, desired_capabilities=chromeOptions.to_capabilities())
driver.get('https://www.unspsc.org/search-code/default.aspx?CSS=51%&Type=desc&SS%27=')
WebDriverWait(driver, 20).until(EC.staleness_of(driver.find_element_by_xpath("//td/a[text()='2']")))
driver.find_element_by_xpath("//td/a[text()='2']").click()

numLinks = len(WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//td/a[text()='2']"))))
print(numLinks)
for i in range(numLinks):
print("Perform your scraping here on page {}".format(str(i+1)))
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//td/a[text()='2']/span//following::span[1]"))).click()
driver.quit()

这里是html内容

    <td><span>1</span></td>
<td><a
href="javascript:__doPostBack
(&#39;dnn$ctr1535$UNSPSCSearch$gvDetailsSearchView&#39;,&#39;Page$2&#39;)"
style="color:#333333;">2</a>
</td>

这会引发错误:

raise TimeoutException(message, screen, stacktrace)
TimeoutException

最佳答案

抓取网站https://www.unspsc.org/search-code/default.aspx?CSS=51%&Type=desc&SS%27=使用Selenium您可以使用以下Locator Strategy :

  • 代码块:

      from selenium import webdriver
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC

    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument("start-maximized")
    driver = webdriver.Chrome(options=chrome_options, executable_path=r'C:\WebDrivers\chromedriver.exe')
    driver.get("https://www.unspsc.org/search-code/default.aspx?CSS=51%&Type=desc&SS%27=%27")
    while True:
    try:
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//table[contains(@id, 'UNSPSCSearch_gvDetailsSearchView')]//tr[last()]//table//span//following::a[1]"))).click()
    print("Clicked for next page")
    except TimeoutException:
    print("No more pages")
    break
    driver.quit()
  • 控制台输出:

      Clicked for next page
    Clicked for next page
    Clicked for next page
    .
    .
    .
  • 解释:如果您观察 HTML DOM 页码<table>内与动态id包含文本 UNSPSCSearch_gvDetailsS​​earchView 的属性。此外,页码位于最后<tr>内正在生 child <table> 。在子表中,当前页码位于 <span> 内其中掌握着关键。所以到click()下一页上,您只需识别以下内容 <a>带有索引的标签 [1] 。最后,由于该元素具有 javascript:__doPostBack()您必须诱导 WebDriverWait 以获得所需的 element_to_be_clickable() .

You can find a detailed discussion in How do I wait for a JavaScript __doPostBack call through Selenium and WebDriver

关于python - javascript __doPostBack 的网页抓取在 td 中包含 href,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57594334/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com