- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
我想从以下位置抓取瑞士政府为大学研究项目提供的药物信息:
http://www.spezialitaetenliste.ch/ShowPreparations.aspx?searchType=Substance&searchValue=
该页面确实提供了一个robotx.txt 文件,但是,它的内容对公众免费提供,我认为抓取这些数据是不受禁止的。
这是更新 of this question ,因为我取得了一些进展。
到目前为止我取得了什么
# opens the first results page
# opens the first link as a table at the end of the page
library("rvest")
library("dplyr")
url <- "http://www.spezialitaetenliste.ch/ShowPreparations.aspx?searchType=Substance&searchValue="
pgsession<-html_session(url)
pgform<-html_form(pgsession)[[1]]
page<-rvest:::request_POST(pgsession,url,
body=list(
`ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$txtPageNumber`=1,
`__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
`__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
`__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
`__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value,
`ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$ddlPageSize`="10",
`__EVENTTARGET`="ctl00$cphContent$gvwPreparations$ctl02$ctl00",
`__EVENTARGUMENT`=""
),
encode="form")
# makes a table of all results of the first page
read_html(page) %>%
html_nodes(xpath = '//*[@id="ctl00_cphContent_gvwPreparations"]') %>%
html_table(fill=TRUE) %>%
bind_rows %>%
tibble()
# gives the desired informations (=additional data) of the first drug (not yet very structured)
read_html(page) %>%
html_nodes(xpath = '//*[@id="ctl00_cphContent_fvwPreparation"]') %>%
html_text
# if I open the second search page
page<-rvest:::request_POST(pgsession,url,
body=list(
`ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$txtPageNumber`=2,
`__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
`__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
`__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
`__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value,
`ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$ddlPageSize`="10",
`__EVENTTARGET`="ctl00$cphContent$gvwPreparations$ctl02$ctl00",
`__EVENTARGUMENT`=""
),
encode="form")
# I get easily a table with the new results
read_html(page) %>%
html_nodes(xpath = '//*[@id="ctl00_cphContent_gvwPreparations"]') %>%
html_table(fill=TRUE) %>%
bind_rows %>%
tibble()
# does not give the desired output:
read_html(page) %>%
html_nodes(xpath = '//*[@id="ctl00_cphContent_fvwPreparation"]') %>%
html_text
__VIEWSTATE
那可能request_POST
? 最佳答案
我认为你只是想多了这个问题。问题出在 xpath
.本质上是 xpath
您用于数据提取的所有页面都相同。是的,//*[@id="ctl00_cphContent_gvwPreparations"]
您的代码中唯一发生变化的组件是 txtPageNumber
.在下面的代码中,我更改了 txtPageNumber
至 3
,喜欢,txtPageNumber=3
我建议你的重点应该放在类似的东西上,如何自动化页码以进行数据提取? .这样,您就不必手动更改 txtPageNumber
在
page<-rvest:::request_POST(pgsession,url,
body=list(
`ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$txtPageNumber`=3,
`__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
`__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
`__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
`__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value,
`ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$ddlPageSize`="10",
`__EVENTTARGET`="ctl00$cphContent$gvwPreparations$ctl02$ctl00",
`__EVENTARGUMENT`=""
),
encode="form")
library(rvest)
library(dplyr)
url <- "http://www.spezialitaetenliste.ch/ShowPreparations.aspx?searchType=Substance&searchValue="
pgsession<-html_session(url)
pgform<-html_form(pgsession)[[1]]
page<-rvest:::request_POST(pgsession,url,
body=list(
`ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$txtPageNumber`=3,
`__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
`__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
`__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
`__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value,
`ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$ddlPageSize`="10",
`__EVENTTARGET`="ctl00$cphContent$gvwPreparations$ctl02$ctl00",
`__EVENTARGUMENT`=""
),
encode="form")
# makes a table of all results of the first page
read_html(page) %>%
html_nodes(xpath = '//*[@id="ctl00_cphContent_gvwPreparations"]') %>%
html_table(fill=TRUE) %>%
bind_rows %>%
tibble()
# A tibble: 11 x 1
.$`` $Präparat $`Galen. Form /~ $Packung $FAP $PP $SB $`Lim-Pkt` $Lim
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 21. Accolate Tabl 20 mg 60 Stk 29.75 50.55 "" "" ""
2 22. Accupaque Inj Lös 300 mg Plast F~ 32.00 53.10 "" "" ""
3 23. Accupaque Inj Lös 300 mg Plast F~ 61.15 86.60 "" "" ""
4 24. Accupaque Inj Lös 300 mg Plast F~ 120.~ 154.~ "" "" ""
5 25. Accupaque Inj Lös 350 mg Plast F~ 33.97 55.35 "" "" ""
6 26. Accupaque Inj Lös 350 mg Plast F~ 66.88 93.20 "" "" ""
7 27. Accupaque Inj Lös 350 mg Plast F~ 129.~ 164.~ "" "" ""
8 28. Accupro ~ Filmtabl 10 mg 30 Stk 8.56 18.00 "" "" ""
9 29. Accupro ~ Filmtabl 10 mg 100 Stk 26.60 46.90 "" "" ""
10 30. Accupro ~ Filmtabl 20 mg 30 Stk 14.02 28.35 "" "" ""
11 "Ein~ "Einträg~ "Einträge pro S~ "Einträ~ "Ein~ "Ein~ "Ein~ "Einträge~ "Ein~
# ... with 9 more variables: $`Swissmedic-Code` <chr>, $Zulassungsinhaberin <chr>,
# $Wirkstoff <chr>, $`BAG-Dossier` <chr>, $Aufnahme <chr>, $`Befr. AufnahmeBefr.
# Limitation` <chr>, $`O/G` <chr>, $`IT-Code` <chr>, $`ATC-Code` <chr>
# gives the desired informations of the first drug (not yet very structured)
read_html(page) %>%
html_nodes(xpath = '//*[@id="ctl00_cphContent_gvwPreparations"]') %>%
html_text %>%
head(10)
[1] " PräparatGalen. Form / DosierungPackungFAPPPSBLim-PktLimSwissmedic-CodeZulassungsinhaberinWirkstoffBAG-DossierAufnahmeBefr. AufnahmeBefr. LimitationO/GIT-CodeATC-Code\r\n\t\t\t\t\r\n 21.\r\n \r\n Accolate\r\n \r\n Tabl 20 mg \r\n \r\n 60 Stk\r\n \r\n 29.75\r\n \r\n 50.55\r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n 53750036\r\n \r\n AstraZeneca AG\r\n \r\n Zafirlukastum\r\n \r\n 17053\r\n \r\n 15.03.1998\r\n \r\n \r\n \r\n \r\n \r\n \r\n 03.04.50.\r\n \r\n R03DC01\r\n \r\n\t\t\t\t\r\n 22.\r\n \r\n Accupaque\r\n \r\n
关于R:POST 后抓取附加数据仅适用于第一页,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56068532/
SCENARIO 有两页,第一页是HomePage,它在flutter_bloc软件包的帮助下自动获取api数据。在首页(第一页)中,还有一个按钮,可在此代码Navigator.push(contex
我检查过类似的问题,但由其他人发布,但我仍然看不到我的代码有什么问题。 我刚刚从文档中复制了它 - https://symfony.com/doc/3.4/page_creation.html Luc
我已经编写了一段代码,使用Python中的Scrapy来抓取页面。下面我粘贴了 main.py 代码。但是,每当我运行我的蜘蛛时,它仅从第一页抓取(DEBUG:从抓取),这也是请求中的Referer标
我创建了一个 ios 图书阅读器应用程序。在这个应用程序中,我集成了 google drive 和 skydrive 。现在我可以从 google drive 和 skydrive 登录和检索数据了。
效果图: 功能简介:可使用上下键选中行,选中后点击修改,textbox获得gridview中的代码的数据。对你有帮助的话,请记得要点击“好文要顶”哦!!!不懂的,请留言。废话不多说了,贴码如下
我是一名优秀的程序员,十分优秀!