gpt4 book ai didi

javascript - 使用 PyQt5 和 QWebEngineView 抓取 javascript 页面

转载 作者:行者123 更新时间:2023-12-03 07:14:04 26 4
gpt4 key购买 nike

我正在尝试将 javascript 网页呈现为填充的 HTML 以供抓取。研究不同的解决方案(selenium、对页面进行逆向工程等)使我找到了 this技术,但我无法让它工作。顺便说一句,我是 python 的新手,基本上处于剪切/粘贴/实验阶段。过去有安装和缩进问题,但我现在卡住了。

在下面的测试代码中,print(sample_html) 有效并返回目标页面的原始 html,但 print(render(sample_html)) 始终返回“无”一词。

有趣的是,如果您在 amazon.com 上运行它,他们会检测到它不是真正的浏览器,并返回带有自动访问警告的 html。然而,其他测试页面提供了应该呈现的真实 html,但它没有呈现。

如何解决总是返回“无”的结果?

def render(source_html):
"""Fully render HTML, JavaScript and all."""

import sys
from PyQt5.QtWidgets import QApplication
from PyQt5.QtWebEngineWidgets import QWebEngineView

class Render(QWebEngineView):
def __init__(self, html):
self.html = None
self.app = QApplication(sys.argv)
QWebEngineView.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.setHtml(html)
self.app.exec_()

def _loadFinished(self, result):
# This is an async call, you need to wait for this
# to be called before closing the app
self.page().toHtml(self.callable)

def callable(self, data):
self.html = data
# Data has been stored, it's safe to quit the app
self.app.quit()

return Render(source_html).html

import requests
#url = 'http://webscraping.com'
#url='http://www.amazon.com'
url='https://www.ncbi.nlm.nih.gov/nuccore/CP002059.1'
sample_html = requests.get(url).text
print(sample_html)
print(render(sample_html))

编辑:感谢您将回复纳入代码。但是现在它返回一个错误并且脚本挂起,直到我杀死 python 启动器然后导致段错误:

这是修改后的代码:

def render(source_url):
"""Fully render HTML, JavaScript and all."""

import sys
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl
from PyQt5.QtWebEngineWidgets import QWebEngineView

class Render(QWebEngineView):
def __init__(self, url):
self.html = None
self.app = QApplication(sys.argv)
QWebEngineView.__init__(self)
self.loadFinished.connect(self._loadFinished)
# self.setHtml(html)
self.load(QUrl(url))
self.app.exec_()

def _loadFinished(self, result):
# This is an async call, you need to wait for this
# to be called before closing the app
self.page().toHtml(self._callable)

def _callable(self, data):
self.html = data
# Data has been stored, it's safe to quit the app
self.app.quit()

return Render(source_url).html

# url = 'http://webscraping.com'
# url='http://www.amazon.com'
url = "https://www.ncbi.nlm.nih.gov/nuccore/CP002059.1"
print(render(url))

抛出这些错误:

$ python3 -tt fees-pkg-v2.py
Traceback (most recent call last):
File "fees-pkg-v2.py", line 30, in _callable
self.html = data
AttributeError: 'method' object has no attribute 'html'
None (hangs here until force-quit python launcher)
Segmentation fault: 11
$

我已经开始阅读 python 类以完全理解我在做什么(总是一件好事)。我认为我的环境中可能存在问题(OSX Yosemite、Python 3.4.3、Qt5.4.1、sip-4.16.6)。还有其他建议吗?

最佳答案

问题是环境。我手动安装了 Python 3.4.3、Qt5.4.1 和 sip-4.16.6,一定是搞砸了。安装 Anaconda 后,脚本开始运行。再次感谢。

关于javascript - 使用 PyQt5 和 QWebEngineView 抓取 javascript 页面,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45265143/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com