gpt4 book ai didi

R:POST 后抓取附加数据仅适用于第一页

转载 作者:行者123 更新时间:2023-12-04 12:37:23 25 4
gpt4 key购买 nike

我想从以下位置抓取瑞士政府为大学研究项目提供的药物信息:

http://www.spezialitaetenliste.ch/ShowPreparations.aspx?searchType=Substance&searchValue=

该页面确实提供了一个robotx.txt 文件,但是,它的内容对公众免费提供,我认为抓取这些数据是不受禁止的。

这是更新 of this question ,因为我取得了一些进展。

到目前为止我取得了什么

# opens the first results page 
# opens the first link as a table at the end of the page

library("rvest")
library("dplyr")


url <- "http://www.spezialitaetenliste.ch/ShowPreparations.aspx?searchType=Substance&searchValue="
pgsession<-html_session(url)
pgform<-html_form(pgsession)[[1]]

page<-rvest:::request_POST(pgsession,url,
body=list(
`ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$txtPageNumber`=1,
`__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
`__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
`__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
`__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value,
`ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$ddlPageSize`="10",
`__EVENTTARGET`="ctl00$cphContent$gvwPreparations$ctl02$ctl00",
`__EVENTARGUMENT`=""

),
encode="form")

下一篇:获取基础数据
# makes a table of all results of the first page

read_html(page) %>%
html_nodes(xpath = '//*[@id="ctl00_cphContent_gvwPreparations"]') %>%
html_table(fill=TRUE) %>%
bind_rows %>%
tibble()

下一步:获取附加数据
# gives the desired informations (=additional data) of the first drug (not yet very structured)

read_html(page) %>%
html_nodes(xpath = '//*[@id="ctl00_cphContent_fvwPreparation"]') %>%
html_text

我的问题:
# if I open the second  search page

page<-rvest:::request_POST(pgsession,url,
body=list(
`ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$txtPageNumber`=2,
`__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
`__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
`__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
`__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value,
`ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$ddlPageSize`="10",
`__EVENTTARGET`="ctl00$cphContent$gvwPreparations$ctl02$ctl00",
`__EVENTARGUMENT`=""

),
encode="form")

下一篇:获取新的基础数据
# I get easily a table with the new results

read_html(page) %>%
html_nodes(xpath = '//*[@id="ctl00_cphContent_gvwPreparations"]') %>%
html_table(fill=TRUE) %>%
bind_rows %>%
tibble()

但是,如果我尝试获取新的附加数据,则会再次从第 1 页获取结果:
# does not give the desired output:

read_html(page) %>%
html_nodes(xpath = '//*[@id="ctl00_cphContent_fvwPreparation"]') %>%
html_text

我要找的:第2页第一种药的详细资料
enter image description here

问题:
  • 为什么我得到重复的结果?是不是因为__VIEWSTATE那可能
    换新期间request_POST ?
  • 有没有办法解决这个问题?
  • 有没有更好的方法来获取基本数据和附加数据?如果是,如何?
  • 最佳答案

    我认为你只是想多了这个问题。问题出在 xpath .本质上是 xpath您用于数据提取的所有页面都相同。是的,//*[@id="ctl00_cphContent_gvwPreparations"]您的代码中唯一发生变化的组件是 txtPageNumber .在下面的代码中,我更改了 txtPageNumber3 ,喜欢,txtPageNumber=3我建议你的重点应该放在类似的东西上,如何自动化页码以进行数据提取? .这样,您就不必手动更改 txtPageNumber

    page<-rvest:::request_POST(pgsession,url,
    body=list(
    `ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$txtPageNumber`=3,
    `__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
    `__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
    `__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
    `__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value,
    `ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$ddlPageSize`="10",
    `__EVENTTARGET`="ctl00$cphContent$gvwPreparations$ctl02$ctl00",
    `__EVENTARGUMENT`=""

    ),
    encode="form")

    以下代码对我有用;
    library(rvest)
    library(dplyr)


    url <- "http://www.spezialitaetenliste.ch/ShowPreparations.aspx?searchType=Substance&searchValue="
    pgsession<-html_session(url)
    pgform<-html_form(pgsession)[[1]]

    page<-rvest:::request_POST(pgsession,url,
    body=list(
    `ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$txtPageNumber`=3,
    `__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
    `__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
    `__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
    `__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value,
    `ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$ddlPageSize`="10",
    `__EVENTTARGET`="ctl00$cphContent$gvwPreparations$ctl02$ctl00",
    `__EVENTARGUMENT`=""

    ),
    encode="form")
    # makes a table of all results of the first page

    read_html(page) %>%
    html_nodes(xpath = '//*[@id="ctl00_cphContent_gvwPreparations"]') %>%
    html_table(fill=TRUE) %>%
    bind_rows %>%
    tibble()

    # A tibble: 11 x 1
    .$`` $Präparat $`Galen. Form /~ $Packung $FAP $PP $SB $`Lim-Pkt` $Lim
    <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
    1 21. Accolate Tabl 20 mg 60 Stk 29.75 50.55 "" "" ""
    2 22. Accupaque Inj Lös 300 mg Plast F~ 32.00 53.10 "" "" ""
    3 23. Accupaque Inj Lös 300 mg Plast F~ 61.15 86.60 "" "" ""
    4 24. Accupaque Inj Lös 300 mg Plast F~ 120.~ 154.~ "" "" ""
    5 25. Accupaque Inj Lös 350 mg Plast F~ 33.97 55.35 "" "" ""
    6 26. Accupaque Inj Lös 350 mg Plast F~ 66.88 93.20 "" "" ""
    7 27. Accupaque Inj Lös 350 mg Plast F~ 129.~ 164.~ "" "" ""
    8 28. Accupro ~ Filmtabl 10 mg 30 Stk 8.56 18.00 "" "" ""
    9 29. Accupro ~ Filmtabl 10 mg 100 Stk 26.60 46.90 "" "" ""
    10 30. Accupro ~ Filmtabl 20 mg 30 Stk 14.02 28.35 "" "" ""
    11 "Ein~ "Einträg~ "Einträge pro S~ "Einträ~ "Ein~ "Ein~ "Ein~ "Einträge~ "Ein~
    # ... with 9 more variables: $`Swissmedic-Code` <chr>, $Zulassungsinhaberin <chr>,
    # $Wirkstoff <chr>, $`BAG-Dossier` <chr>, $Aufnahme <chr>, $`Befr. AufnahmeBefr.
    # Limitation` <chr>, $`O/G` <chr>, $`IT-Code` <chr>, $`ATC-Code` <chr>

    # gives the desired informations of the first drug (not yet very structured)

    read_html(page) %>%
    html_nodes(xpath = '//*[@id="ctl00_cphContent_gvwPreparations"]') %>%
    html_text %>%
    head(10)


    [1] " PräparatGalen. Form / DosierungPackungFAPPPSBLim-PktLimSwissmedic-CodeZulassungsinhaberinWirkstoffBAG-DossierAufnahmeBefr. AufnahmeBefr. LimitationO/GIT-CodeATC-Code\r\n\t\t\t\t\r\n 21.\r\n \r\n Accolate\r\n \r\n Tabl 20 mg \r\n \r\n 60 Stk\r\n \r\n 29.75\r\n \r\n 50.55\r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n 53750036\r\n \r\n AstraZeneca AG\r\n \r\n Zafirlukastum\r\n \r\n 17053\r\n \r\n 15.03.1998\r\n \r\n \r\n \r\n \r\n \r\n \r\n 03.04.50.\r\n \r\n R03DC01\r\n \r\n\t\t\t\t\r\n 22.\r\n \r\n Accupaque\r\n \r\n

    关于R:POST 后抓取附加数据仅适用于第一页,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56068532/

    25 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com