gpt4 book ai didi

r - 抓取此网页的正确 xpath 是什么?

转载 作者:行者123 更新时间:2023-12-03 15:49:30 25 4
gpt4 key购买 nike

我试图获取 this page 中的选择器列表:

$("#Lastname"),$(".intro"),....

这里我尝试使用 xpathSApply:

library(XML)
library(RCurl)
a <- getURL('http://www.w3schools.com/jquery/trysel.asp')
doc <- htmlParse(a)
xpathSApply(doc,'//*[@id="selectorOptions"]') ## I can't get the right xpath

我也试过,但没有成功:

xpathSApply(doc,'//*[@id="selectorOptions"]/div[i]')

编辑 我添加了 python 标签,因为我也接受了 python 解决方案。

最佳答案

以下是获取这样的 javascript 页面的 R 方法。您将需要使用@Peyton 指出的浏览器。 Selenium 服务器是控制浏览器的一种好方法。我已经为 Selenium 服务器的 R 编写了一些绑定(bind) https://github.com/johndharrison/RSelenium

以下将允许访问 post javascript 源代码:

require(devtools)
devtools::install_github("RSelenium", "johndharrison")
library(RSelenium)
library(RJSONIO)

# one needs to have an active server running
# the following commented out lines source the latest java binary
# RSelenium::checkForServer()
# RSelenium::startServer()
# a selenium server is assummed to be running now

remDR <- remoteDriver$new()
remDR$open() # opens a browser usually firefox with default settings
remDR$navigate('http://www.w3schools.com/jquery/trysel.asp') # navigate to your page
webElem <- remDR$findElements(value = "//*[@id='selectorOptions']") # find your elememts

# display the appropriate quantities
cat(fromJSON(webElem[[1]]$getElementText())$value)
> cat(fromJSON(webElem[[1]]$getElementText())$value)
$("#Lastname")
$(".intro")
$(".intro, #Lastname")
$("h1")
$("h1, p")
$("p:first")
$("p:last")
$("tr:even")
$("tr:odd")
$("p:first-child")
$("p:first-of-type")
$("p:last-child")
$("p:last-of-typ
.....................

更新:

在这种情况下访问信息的一种更简单的方法是使用 executeScript 方法

library(RSelenium)
RSelenium:startServer()
remDr$open()
remDR$navigate('http://www.w3schools.com/jquery/trysel.asp')
remDr$executeScript("return w3Sels;")[[1]]

> remDr$executeScript("return w3Sels;")[[1]]
[1] "#Lastname" ".intro"
[3] ".intro, #Lastname" "h1"
[5] "h1, p" "p:first"
[7] "p:last" "tr:even"
[9] "tr:odd" "p:first-child"
[11] "p:first-of-type" "p:last-child"
[13] "p:last-of-type" "li:nth-child(1)"
[15] "li:nth-last-child(1)" "li:nth-of-type(2)"
[17] "li:nth-last-of-type(2)" "b:only-child"
[19] "h3:only-of-type" "div > p"
[21] "div p" "ul + h3"
[23] "ul ~ table" "ul li:eq(0)"
[25] "ul li:gt(0)" "ul li:lt(2)"
[27] ":header" ":header:not(h1)"
[29] ":animated" ":focus"
[31] ":contains(Duck)" "div:has(p)"
[33] ":empty" ":parent"
[35] "p:hidden" "table:visible"
[37] ":root" "p:lang(it)"
[39] "[id]" "[id=my-Address]"
[41] "p[id!=my-Address]" "[id$=ess]"
[43] "[id|=my]" "[id^=L]"
[45] "[title~=beautiful]" "[id*=s]"
[47] ":input" ":text"
[49] ":password" ":radio"
[51] ":checkbox" ":submit"
[53] ":reset" ":button"
[55] ":image" ":file"
[57] ":enabled" ":disabled"
[59] ":selected" ":checked"
[61] "*"

关于r - 抓取此网页的正确 xpath 是什么?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20206146/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com