gpt4 book ai didi

javascript - Python Crawling Pastebin(JavaScript 呈现的网页)

转载 作者:行者123 更新时间:2023-11-30 16:09:04 24 4
gpt4 key购买 nike

我在尝试抓取 JavaScript 呈现的页面时遇到问题。

我正在使用 python-qt4 模块,遵循本教程:https://impythonist.wordpress.com/2015/01/06/ultimate-guide-for-scraping-javascript-rendered-web-pages/

在教程中,一切都与示例页面完美配合:http://pycoders.com/archive

但我正在用 pastebin 尝试这个,这个 URL:

http://pastebin.com/search?q=ssh

我正在尝试的是获取所有链接,以便单击它们,以及能够关注页面(我还不知道我要使用什么,也许是 Scrapy,但我想查看其他选项)。

问题是我无法提取链接,这是我的代码:

import sys  
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
from lxml import html

#Take this class for granted.Just use result of rendering.
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()

def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()

url = 'http://pastebin.com/search?q=ssh'
r = Render(url)
result = r.frame.toHtml()
formatted_result = str(result.toAscii())
tree = html.fromstring(formatted_result)
archive_links = tree.xpath('//a[@class="gs-title"]/@data-ctoring')
for i in archive_links:
print i

结果是:我什么也没得到。

最佳答案

理想情况下,您应该考虑使用 Pastebin API - 这里是 Python wrapper .

另一种方法是通过 selenium 实现浏览器自动化.打印搜索结果链接的工作代码:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.Firefox()
driver.get("http://pastebin.com/search?q=ssh")

# wait for the search results to be loaded
wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".gsc-result-info")))

# get all search results links
for link in driver.find_elements_by_css_selector(".gsc-results .gsc-result a.gs-title"):
print(link.get_attribute("href"))

关于javascript - Python Crawling Pastebin(JavaScript 呈现的网页),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36547264/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com