R:POST 后抓取附加数据仅适用于第一页-6ren

R:POST 后抓取附加数据仅适用于第一页

转载作者：行者123 更新时间：2023-12-04 12:37:23

我想从以下位置抓取瑞士政府为大学研究项目提供的药物信息:

http://www.spezialitaetenliste.ch/ShowPreparations.aspx?searchType=Substance&searchValue=

该页面确实提供了一个robotx.txt 文件，但是，它的内容对公众免费提供，我认为抓取这些数据是不受禁止的。

这是更新 of this question ，因为我取得了一些进展。

到目前为止我取得了什么

# opens the first results page 
# opens the first link as a table at the end of the page

library("rvest")
library("dplyr")


url <- "http://www.spezialitaetenliste.ch/ShowPreparations.aspx?searchType=Substance&searchValue="
pgsession<-html_session(url)
pgform<-html_form(pgsession)[[1]]

page<-rvest:::request_POST(pgsession,url,
                           body=list(
                             `ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$txtPageNumber`=1,
                             `__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
                             `__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
                             `__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
                             `__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value,
                             `ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$ddlPageSize`="10",
                             `__EVENTTARGET`="ctl00$cphContent$gvwPreparations$ctl02$ctl00",
                             `__EVENTARGUMENT`=""

                             ),
                           encode="form")

下一篇:获取基础数据

# makes a table of all results of the first page

read_html(page) %>%
  html_nodes(xpath = '//*[@id="ctl00_cphContent_gvwPreparations"]') %>%
  html_table(fill=TRUE) %>% 
  bind_rows %>%
  tibble()

下一步:获取附加数据

# gives the desired informations (=additional data) of the first drug (not yet very structured)

read_html(page) %>%
  html_nodes(xpath = '//*[@id="ctl00_cphContent_fvwPreparation"]') %>%
  html_text

我的问题:

# if I open the second  search page

page<-rvest:::request_POST(pgsession,url,
                           body=list(
                             `ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$txtPageNumber`=2,
                             `__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
                             `__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
                             `__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
                             `__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value,
                             `ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$ddlPageSize`="10",
                             `__EVENTTARGET`="ctl00$cphContent$gvwPreparations$ctl02$ctl00",
                             `__EVENTARGUMENT`=""

                             ),
                           encode="form")

下一篇:获取新的基础数据

# I get easily a table with the new results

read_html(page) %>%
  html_nodes(xpath = '//*[@id="ctl00_cphContent_gvwPreparations"]') %>%
  html_table(fill=TRUE) %>% 
  bind_rows %>%
  tibble()

但是，如果我尝试获取新的附加数据，则会再次从第 1 页获取结果:

# does not give the desired output:

read_html(page) %>%
  html_nodes(xpath = '//*[@id="ctl00_cphContent_fvwPreparation"]') %>%
  html_text

我要找的:第2页第一种药的详细资料

问题:

为什么我得到重复的结果？是不是因为__VIEWSTATE那可能
换新期间request_POST ?

有没有办法解决这个问题？

有没有更好的方法来获取基本数据和附加数据？如果是，如何？

最佳答案

我认为你只是想多了这个问题。问题出在 xpath .本质上是 xpath您用于数据提取的所有页面都相同。是的，//*[@id="ctl00_cphContent_gvwPreparations"]您的代码中唯一发生变化的组件是 txtPageNumber .在下面的代码中，我更改了 txtPageNumber至 3 ，喜欢，txtPageNumber=3我建议你的重点应该放在类似的东西上，如何自动化页码以进行数据提取？ .这样，您就不必手动更改 txtPageNumber在

page<-rvest:::request_POST(pgsession,url,
                           body=list(
                             `ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$txtPageNumber`=3,
                             `__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
                             `__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
                             `__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
                             `__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value,
                             `ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$ddlPageSize`="10",
                             `__EVENTTARGET`="ctl00$cphContent$gvwPreparations$ctl02$ctl00",
                             `__EVENTARGUMENT`=""

                           ),
                           encode="form")

以下代码对我有用；

library(rvest)
library(dplyr)


url <- "http://www.spezialitaetenliste.ch/ShowPreparations.aspx?searchType=Substance&searchValue="
pgsession<-html_session(url)
pgform<-html_form(pgsession)[[1]]

page<-rvest:::request_POST(pgsession,url,
                           body=list(
                             `ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$txtPageNumber`=3,
                             `__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
                             `__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
                             `__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
                             `__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value,
                             `ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$ddlPageSize`="10",
                             `__EVENTTARGET`="ctl00$cphContent$gvwPreparations$ctl02$ctl00",
                             `__EVENTARGUMENT`=""

                           ),
                           encode="form")
# makes a table of all results of the first page

read_html(page) %>%
  html_nodes(xpath = '//*[@id="ctl00_cphContent_gvwPreparations"]') %>%
  html_table(fill=TRUE) %>% 
  bind_rows %>%
  tibble()

# A tibble: 11 x 1
   .$``  $Präparat $`Galen. Form /~ $Packung $FAP  $PP   $SB   $`Lim-Pkt` $Lim 
   <chr> <chr>     <chr>            <chr>    <chr> <chr> <chr> <chr>      <chr>
 1 21.   Accolate  Tabl 20 mg       60 Stk   29.75 50.55 ""    ""         ""   
 2 22.   Accupaque Inj Lös 300 mg   Plast F~ 32.00 53.10 ""    ""         ""   
 3 23.   Accupaque Inj Lös 300 mg   Plast F~ 61.15 86.60 ""    ""         ""   
 4 24.   Accupaque Inj Lös 300 mg   Plast F~ 120.~ 154.~ ""    ""         ""   
 5 25.   Accupaque Inj Lös 350 mg   Plast F~ 33.97 55.35 ""    ""         ""   
 6 26.   Accupaque Inj Lös 350 mg   Plast F~ 66.88 93.20 ""    ""         ""   
 7 27.   Accupaque Inj Lös 350 mg   Plast F~ 129.~ 164.~ ""    ""         ""   
 8 28.   Accupro ~ Filmtabl 10 mg   30 Stk   8.56  18.00 ""    ""         ""   
 9 29.   Accupro ~ Filmtabl 10 mg   100 Stk  26.60 46.90 ""    ""         ""   
10 30.   Accupro ~ Filmtabl 20 mg   30 Stk   14.02 28.35 ""    ""         ""   
11 "Ein~ "Einträg~ "Einträge pro S~ "Einträ~ "Ein~ "Ein~ "Ein~ "Einträge~ "Ein~
# ... with 9 more variables: $`Swissmedic-Code` <chr>, $Zulassungsinhaberin <chr>,
#   $Wirkstoff <chr>, $`BAG-Dossier` <chr>, $Aufnahme <chr>, $`Befr. AufnahmeBefr.
#   Limitation` <chr>, $`O/G` <chr>, $`IT-Code` <chr>, $`ATC-Code` <chr>

# gives the desired informations of the first drug (not yet very structured)

read_html(page) %>%
  html_nodes(xpath = '//*[@id="ctl00_cphContent_gvwPreparations"]') %>%
  html_text %>%
  head(10)


[1] " PräparatGalen. Form / DosierungPackungFAPPPSBLim-PktLimSwissmedic-CodeZulassungsinhaberinWirkstoffBAG-DossierAufnahmeBefr. AufnahmeBefr. LimitationO/GIT-CodeATC-Code\r\n\t\t\t\t\r\n                        21.\r\n                    \r\n                        Accolate\r\n                    \r\n                        Tabl 20 mg \r\n                    \r\n                        60 Stk\r\n                    \r\n                        29.75\r\n                    \r\n                        50.55\r\n                    \r\n                                                \r\n                    \r\n                        \r\n                    \r\n                      \r\n                    \r\n                        53750036\r\n                    \r\n                        AstraZeneca AG\r\n                    \r\n                        Zafirlukastum\r\n                    \r\n                        17053\r\n                    \r\n                        15.03.1998\r\n                    \r\n                        \r\n                        \r\n                    \r\n                        \r\n                    \r\n                        03.04.50.\r\n                    \r\n                        R03DC01\r\n                    \r\n\t\t\t\t\r\n                        22.\r\n                    \r\n                        Accupaque\r\n                    \r\n

关于R:POST 后抓取附加数据仅适用于第一页，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/56068532/

文章推荐： django - 在Django中显示图像

文章推荐： .net - 试图对无法访问的主机进行套接字操作

文章推荐： cordova - 通过Phonegap连接到HTTP服务器

文章推荐： oauth - 如何使用 imap xoauth 获取 gmail threadid

flutter - 如何在Flutter中使用flutter_bloc自动刷新上一页(第一页)
SCENARIO 有两页，第一页是HomePage，它在flutter_bloc软件包的帮助下自动获取api数据。在首页(第一页)中，还有一个按钮，可在此代码Navigator.push(contex
php - Symfony 第一页 - 自动加载器预期的类 […] 将在文件中定义
我检查过类似的问题，但由其他人发布，但我仍然看不到我的代码有什么问题。我刚刚从文档中复制了它 - https://symfony.com/doc/3.4/page_creation.html Luc
python - SCRAPY:每次我的蜘蛛爬行时，它都会抓取同一页面(第一页)
我已经编写了一段代码，使用Python中的Scrapy来抓取页面。下面我粘贴了 main.py 代码。但是，每当我运行我的蜘蛛时，它仅从第一页抓取(DEBUG:从抓取)，这也是请求中的Referer标
ios - 使用 SkyDrive api ios 获取文件的缩略图(第一页)
我创建了一个 ios 图书阅读器应用程序。在这个应用程序中，我集成了 google drive 和 skydrive 。现在我可以从 google drive 和 skydrive 登录和检索数据了。
asp.net gridview分页：第一页下一页 1 2 3 4 上一页最末页
效果图：功能简介：可使用上下键选中行，选中后点击修改，textbox获得gridview中的代码的数据。对你有帮助的话，请记得要点击“好文要顶”哦!!!不懂的，请留言。废话不多说了，贴码如下

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

R:POST 后抓取附加数据仅适用于第一页