gpt4 book ai didi

r - 如何使用 httr 或 rvest 提交似乎由 JavaScript 处理的表单?

转载 作者:行者123 更新时间:2023-12-04 09:49:03 25 4
gpt4 key购买 nike

我正在尝试以编程方式搜索网站,但提交按钮功能似乎主要由 JavaScript 提供支持。我不太熟悉这是如何工作的,所以我可能是错的。

这是我正在使用的代码:

library(rvest)

BASE_URL = 'https://mdocweb.state.mi.us/otis2/otis2.aspx'
PARAMS = list(txtboxLName='Smith',
drpdwnGender='Either',
drpdwnRace='All',
drpdwnStatus='All',
submit='btnSearch')

# rvest approach
s = html_session(BASE_URL)
form = html_form(s)[[1]]
form = set_values(form, PARAMS)
resp = submit_form(s, form, submit='btnSearch') # This gives an error

# httr approach
resp = httr::POST(BASE_URL, body=PARAMS, encode='form')
html = httr::content(resp) # This just returns that same page I was on

按钮的 HTML 如下所示:
<input type="submit" name="btnSearch" value="Search" onclick="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(&quot;btnSearch&quot;, &quot;&quot;, true, &quot;&quot;, &quot;&quot;, false, false))" language="javascript" id="btnSearch" style="width:100px;">

鉴于 onclick属性,我未受过教育的假设是 JavaScript 的使用干扰了我的方法。但同样,我并不完全理解这一切是如何运作的,所以我可能是错的。

无论哪种方式,我如何实现我的目标,如果有的话,使用 rvesthttr ,但不是 RSelenium ?另外,如果这可以在 Python 中实现,我也会接受。

最佳答案

我们首先需要获取原始搜索页面,因为这是一个共享站点(或类似站点),我们需要一些隐藏的表单字段以供稍后使用:

library(httr)
library(rvest)
library(tidyverse)

pre_pg <- read_html("https://mdocweb.state.mi.us/otis2/otis2.aspx")

setNames(
html_nodes(pre_pg, "input[type='hidden']") %>% html_attr("value"),
html_nodes(pre_pg, "input[type='hidden']") %>% html_attr("name")
) -> hidden

str(hidden)
## Named chr [1:3] "x62pLbphYWUDXsdoNdBBNrxqyHHI+K06BzjFwdP3Uooafgey2uG1gLWxzh07djRxiQR724uplZFAI8klbq6HCSkmrp8jP15EMwvkDM/biUEuQrf"| __truncated__ ...
## - attr(*, "names")= chr [1:3] "__VIEWSTATE" "__VIEWSTATEGENERATOR" "__EVENTVALIDATION"

现在,我们需要像表单一样使用 HTTP POST提交:
POST(
url = "https://mdocweb.state.mi.us/otis2/otis2.aspx",
add_headers(
Origin = "https://mdocweb.state.mi.us",
`User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.52 Safari/537.36",
Referer = "https://mdocweb.state.mi.us/otis2/otis2.aspx"
),
body = list(
`__EVENTTARGET` = "",
`__EVENTARGUMENT` = "",
`__VIEWSTATE` = hidden["__VIEWSTATE"],
`__VIEWSTATEGENERATOR` = hidden["__VIEWSTATEGENERATOR"],
`__EVENTVALIDATION` = hidden["__EVENTVALIDATION"],
txtboxLName = "Smith",
txtboxFName = "",
txtboxMDOCNum = "",
drpdwnGender = "Either",
drpdwnRace = "All",
txtboxAge = "",
drpdwnStatus = "All",
txtboxMarks = "",
btnSearch = "Search"
),
encode = "form"
) -> res

一分钟后我们将需要这个辅助函数:
mcga <- function(x) {
x <- tolower(x)
x <- gsub("[[:punct:][:space:]]+", "_", x)
x <- gsub("_+", "_", x)
x <- gsub("(^_|_$)", "", x)
make.unique(x, sep = "_")
}

现在,我们需要结果页面中的 HTML:
pg <- content(res, as="parsed")

不幸的是,“表”真的是一套 <div> s。但是,它以编程方式生成并且非常统一。我们不想输入太多,所以让我们先获取稍后将使用的列名:
col_names <- html_nodes(pg, "a.headings") %>% html_text(trim=TRUE) %>% mcga()
## [1] "offender_number" "last_name" "first_name"
## [4] "date_of_birth" "sex" "race"
## [7] "mcl_number" "location" "status"
## [10] "parole_board_jurisdiction_date" "maximum_date" "date_paroled"

该网站非常好,因为它通过提供屏幕阅读器提示来容纳残疾人。不幸的是,这给抓取带来了麻烦,因为我们要么必须冗长地使用值定位标签,要么稍后清理文本。值得庆幸的是, xml2 📦 现在可以移除节点:
xml_find_all(pg, ".//div[@class='screenReaderOnly']") %>% xml_remove()
xml_find_all(pg, ".//span[@class='visible-phone']") %>% xml_remove()

我们现在可以收集所有的罪犯记录 <div> “行”:
records <- html_nodes(pg, "div.offenderRow")

并且,简洁地将它们放入一个数据框中:
map(sprintf(".//div[@class='span1 searchCol%s']", 1:12), ~{
html_nodes(records, xpath=.x) %>% html_text(trim=TRUE)
}) %>%
set_names(col_names) %>%
bind_cols() %>%
readr::type_convert() -> xdf

xdf
## # A tibble: 25 x 12
## offender_number last_name first_name date_of_birth sex race mcl_number location status
## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 544429 SMITH AARICK 12/03/1967 M White 333.74012D3 Gladwin Parole
## 2 210262 SMITH AARON 05/27/1972 M Black <NA> <NA> Dischrg
## 3 372965 SMITH AARON 09/16/1973 M White <NA> <NA> Dischrg
## 4 413411 SMITH AARON 07/13/1973 M Black <NA> <NA> Dischrg
## 5 618210 SMITH AARON 10/12/1984 M Black <NA> <NA> Dischrg
## 6 675823 SMITH AARON 05/19/1989 M Black 333.74032A5 Det Lahser Prob Prob
## 7 759548 SMITH AARON 06/19/1990 M Black <NA> <NA> Dischrg
## 8 763189 SMITH AARON 07/15/1976 M White 333.74032A5 Mt. Pleasant Prob
## 9 854557 SMITH AARON 12/27/1973 M White <NA> <NA> Dischrg
## 10 856804 SMITH AARON 02/24/1989 M White 750.110A2 Harrison CF Prison
## # ... with 15 more rows, and 3 more variables: parole_board_jurisdiction_date <chr>, maximum_date <chr>,
## # date_paroled <chr>

glimpse(xdf)
## Observations: 25
## Variables: 12
## $ offender_number <int> 544429, 210262, 372965, 413411, 618210, 675823, 759548, 763189, 854557, 85...
## $ last_name <chr> "SMITH", "SMITH", "SMITH", "SMITH", "SMITH", "SMITH", "SMITH", "SMITH", "S...
## $ first_name <chr> "AARICK", "AARON", "AARON", "AARON", "AARON", "AARON", "AARON", "AARON", "...
## $ date_of_birth <chr> "12/03/1967", "05/27/1972", "09/16/1973", "07/13/1973", "10/12/1984", "05/...
## $ sex <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M",...
## $ race <chr> "White", "Black", "White", "Black", "Black", "Black", "Black", "White", "W...
## $ mcl_number <chr> "333.74012D3", NA, NA, NA, NA, "333.74032A5", NA, "333.74032A5", NA, "750....
## $ location <chr> "Gladwin", NA, NA, NA, NA, "Det Lahser Prob", NA, "Mt. Pleasant", NA, "Har...
## $ status <chr> "Parole", "Dischrg", "Dischrg", "Dischrg", "Dischrg", "Prob", "Dischrg", "...
## $ parole_board_jurisdiction_date <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "11/28/2024", "03/25/2016", NA, NA, NA...
## $ maximum_date <chr> NA, "09/03/2015", "06/29/2016", "10/02/2017", "05/19/2017", "07/18/2019", ...
## $ date_paroled <chr> "11/15/2016", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...

我曾希望 type_convert wld 提供了更好的转换,尤其是日期列,但它没有并且可能会被消除。

现在,您需要对结果页面做更多的工作,因为结果是分页的。谢天谢地,你知道页面信息:
xml_integer(html_nodes(pg, "span#lblPgCurrent"))
## [1] 1

xml_integer(html_nodes(pg, "span#lblTotalPgs"))
## [1] 101

你将不得不再次做“隐藏”的舞蹈:
html_nodes(pg, "input[type='hidden']")

(按照上面的引用来处理)并重新调整一个新的 POST只有那些隐藏字段和一个表单元素的调用: btnNext = 'Next' .您需要在分页结果集中的所有单个页面上重复此操作,然后最后 bind_rows()一切。

我应该补充一点,当您弄清楚分页工作流程时,请从新的空白搜索页面抓取开始。共享点服务器似乎配置了一个非常小的 View 状态 session 缓存超时,如果在迭代之间等待太长时间,代码将中断。

更新

我有点想确保最后一点建议有效,所以是这样的:
library(httr)
library(rvest)
library(tidyverse)

mcga <- function(x) {
x <- tolower(x)
x <- gsub("[[:punct:][:space:]]+", "_", x)
x <- gsub("_+", "_", x)
x <- gsub("(^_|_$)", "", x)
make.unique(x, sep = "_")
}

start_search <- function(last_name) {

pre_pg <- read_html("https://mdocweb.state.mi.us/otis2/otis2.aspx")

setNames(
html_nodes(pre_pg, "input[type='hidden']") %>% html_attr("value"),
html_nodes(pre_pg, "input[type='hidden']") %>% html_attr("name")
) -> hidden

POST(
url = "https://mdocweb.state.mi.us/otis2/otis2.aspx",
add_headers(
Origin = "https://mdocweb.state.mi.us",
`User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.52 Safari/537.36",
Referer = "https://mdocweb.state.mi.us/otis2/otis2.aspx"
),
body = list(
`__EVENTTARGET` = "",
`__EVENTARGUMENT` = "",
`__VIEWSTATE` = hidden["__VIEWSTATE"],
`__VIEWSTATEGENERATOR` = hidden["__VIEWSTATEGENERATOR"],
`__EVENTVALIDATION` = hidden["__EVENTVALIDATION"],
txtboxLName = last_name,
txtboxFName = "",
txtboxMDOCNum = "",
drpdwnGender = "Either",
drpdwnRace = "All",
txtboxAge = "",
drpdwnStatus = "All",
txtboxMarks = "",
btnSearch = "Search"
),
encode = "form"
) -> res

content(res, as="parsed")

}

extract_results <- function(results_pg) {

col_names <- html_nodes(results_pg, "a.headings") %>% html_text(trim=TRUE) %>% mcga()

xml_find_all(results_pg, ".//div[@class='screenReaderOnly']") %>% xml_remove()

xml_find_all(results_pg, ".//span[@class='visible-phone']") %>% xml_remove()

records <- html_nodes(results_pg, "div.offenderRow")

map(sprintf(".//div[@class='span1 searchCol%s']", 1:12), ~{
html_nodes(records, xpath=.x) %>% html_text(trim=TRUE)
}) %>%
set_names(col_names) %>%
bind_cols()

}

current_page_number <- function(results_pg) {
xml_integer(html_nodes(results_pg, "span#lblPgCurrent"))
}

last_page_number <- function(results_pg) {
xml_integer(html_nodes(results_pg, "span#lblTotalPgs"))
}

scrape_status <- function(results_pg) {

cur <- current_page_number(results_pg)
tot <- last_page_number(results_pg)

message(sprintf("%s of %s", cur, tot))

}

next_page <- function(results_pg) {

cur <- current_page_number(results_pg)
tot <- last_page_number(results_pg)

if (cur == tot) return(NULL)

setNames(
html_nodes(results_pg, "input[type='hidden']") %>% html_attr("value"),
html_nodes(results_pg, "input[type='hidden']") %>% html_attr("name")
) -> hidden

POST(
url = "https://mdocweb.state.mi.us/otis2/otis2.aspx",
add_headers(
Origin = "https://mdocweb.state.mi.us",
`User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.52 Safari/537.36",
Referer = "https://mdocweb.state.mi.us/otis2/otis2.aspx"
),
body = list(
`__EVENTTARGET` = hidden["__EVENTTARGET"],
`__EVENTARGUMENT` = hidden["__EVENTARGUMENT"],
`__VIEWSTATE` = hidden["__VIEWSTATE"],
`__VIEWSTATEGENERATOR` = hidden["__VIEWSTATEGENERATOR"],
`__EVENTVALIDATION` = hidden["__EVENTVALIDATION"],
btnNext = 'Next'
),
encode = "form"
) -> res

content(res, as="parsed")

}

curr_pg <- start_search("smith")
results_df <- extract_results(curr_pg)

pb <- progress_estimated(last_page_number(curr_pg)-1)

repeat{

scrape_status(curr_pg) # optional esp since we have a progress bar

pb$tick()$print()

curr_pg <- next_page(curr_pg)

if (is.null(curr_pg)) break

results_df <- bind_rows(results_df, extract_results(next_pg))

Sys.sleep(5) # be kind

}

希望您能跟上,但 shd 会为您获取给定搜索词的所有页面。

关于r - 如何使用 httr 或 rvest 提交似乎由 JavaScript 处理的表单?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46224378/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com