gpt4 book ai didi

python - 使用scrapyjs通过splash抓取onclick页面

转载 作者:太空狗 更新时间:2023-10-30 02:42:55 26 4
gpt4 key购买 nike

我正在尝试从使用 javascript 之类的页面获取 url

<span onclick="go1()">click here </span>
<script>function go1(){
window.location = "../innerpages/" + myname + ".php";
}
</script>

这是我使用带有 splash 的 scrapyjs 的代码

def start_requests(self):
for url in self.start_urls:
yield Request(url, self.parse, meta={
'splash': {
'endpoint': 'render.html',
'args': {'wait': 4, 'html': 1, 'png': 1, 'render_all': 1, 'js_source': 'document.getElementsByTagName("span")[0].click()'},
}
})

如果我写

'js_source': 'document.title="hello world"'

它会起作用

似乎我可以处理页面内的文本,但我无法从 go1()

获取 url

如果我想获取 go1() 中的 url,我应该怎么做

谢谢!

最佳答案

您可以使用 /execute endpoint :

class MySpider(scrapy.Spider):
...

def start_requests(self):
script = """
function main(splash)
local url = splash.args.url
assert(splash:go(url))
assert(splash:wait(1))

assert(splash:runjs('document.getElementsByTagName("span")[0].click()'))
assert(splash:wait(1))

-- return result as a JSON object
return {
html = splash:html()
}
end
"""
for url in self.start_urls:
yield scrapy.Request(url, self.parse_result, meta={
'splash': {
'args': {'lua_source': script},
'endpoint': 'execute',
}
})

def parse_result(self, response):

# fetch base URL because response url is the Splash endpoint
baseurl = response.meta["_splash_processed"]["args"]["url"]

# decode JSON response
splash_json = json.loads(response.body_as_unicode())

# and build a new selector from the response "html" key from that object
selector = scrapy.Selector(text=splash_json["html"], type="html")

...

关于python - 使用scrapyjs通过splash抓取onclick页面,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35052999/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com