gpt4 book ai didi

r - 使用 R 从 aspx 网站抓取

转载 作者:行者123 更新时间:2023-12-04 15:00:27 26 4
gpt4 key购买 nike

我正在尝试使用 R 完成一项任务来抓取网站上的数据。

  • 我想浏览以下页面上的每个链接:
    http://capitol.hawaii.gov/advreports/advreport.aspx?year=2013&report=deadline&rpt_type=&measuretype=hb&title=House票据
  • 仅选择当前状态显示“已传输给州长”的项目。例如,http://capitol.hawaii.gov/measure_indiv.aspx?billtype=HB&billnumber=17&year=2013
  • 然后为以下子句“通过最终阅读”删除 STATUS TEXT 中的单元格。例如:通过 SD 2 中修正的最终阅读,代表 Fale、Jordan、Tsuji 有保留地投票赞成;代表 Cabanilla、Morikawa、Oshiro、Tokioka 投了反对票(4),没有人有理由(0)。

  • 我曾尝试使用包含 Rcurl 和 XML(在 R 中)包的先前示例,但我不知道如何将它们正确用于 aspx 站点。所以我想要的是: 1. 关于如何构建这样的代码的一些建议。 2. 并建议如何学习执行此类任务所需的知识。

    谢谢你的帮助,

    汤姆

    最佳答案

    require(httr)
    require(XML)

    basePage <- "http://capitol.hawaii.gov"

    h <- handle(basePage)

    GET(handle = h)

    res <- GET(handle = h, path = "/advreports/advreport.aspx?year=2013&report=deadline&rpt_type=&measuretype=hb&title=House")

    # parse content for "Transmitted to Governor" text
    resXML <- htmlParse(content(res, as = "text"))
    resTable <- getNodeSet(resXML, '//*/table[@id ="GridViewReports"]/tr/td[3]')
    appRows <-sapply(resTable, xmlValue)
    include <- grepl("Transmitted to Governor", appRows)
    resUrls <- xpathSApply(resXML, '//*/table[@id ="GridViewReports"]/tr/td[2]//@href')

    appUrls <- resUrls[include]

    # look at just the first

    res <- GET(handle = h, path = appUrls[1])

    resXML <- htmlParse(content(res, as = "text"))


    xpathSApply(resXML, '//*[text()[contains(.,"Passed Final Reading")]]', xmlValue)

    [1] "Passed Final Reading as amended in SD 2 with Representative(s) Fale, Jordan,
    Tsuji voting aye with reservations; Representative(s) Cabanilla, Morikawa, Oshiro,
    Tokioka voting no (4) and none excused (0)."

    让包 httr通过设置 handle 处理所有后台工作.

    如果要遍历所有 92 个链接:
     # get all the links returned as a list (will take sometime)
    # print statement included for sanity
    res <- lapply(appUrls, function(x){print(sprintf("Got url no. %d",which(appUrls%in%x)));
    GET(handle = h, path = x)})
    resXML <- lapply(res, function(x){htmlParse(content(x, as = "text"))})
    appString <- sapply(resXML, function(x){
    xpathSApply(x, '//*[text()[contains(.,"Passed Final Reading")]]', xmlValue)
    })


    head(appString)

    > head(appString)
    $href
    [1] "Passed Final Reading as amended in SD 2 with Representative(s) Fale, Jordan, Tsuji voting aye with reservations; Representative(s) Cabanilla, Morikawa, Oshiro, Tokioka voting no (4) and none excused (0)."

    $href
    [1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none. 0 Excused: none."
    [2] "Passed Final Reading as amended in CD 1 with Representative(s) Cullen, Har voting aye with reservations; Representative(s) McDermott voting no (1) and none excused (0)."

    $href
    [1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none. 0 Excused: none."
    [2] "Passed Final Reading as amended in CD 1 with none voting aye with reservations; Representative(s) Hashem, McDermott voting no (2) and none excused (0)."

    $href
    [1] "Passed Final Reading, as amended (CD 1). 24 Aye(s); Aye(s) with reservations: none . 0 No(es): none. 1 Excused: Ige."
    [2] "Passed Final Reading as amended in CD 1 with none voting aye with reservations; none voting no (0) and Representative(s) Say excused (1)."

    $href
    [1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none. 0 Excused: none."
    [2] "Passed Final Reading as amended in CD 1 with Representative(s) Johanson voting aye with reservations; none voting no (0) and none excused (0)."

    $href
    [1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none. 0 Excused: none."
    [2] "Passed Final Reading as amended in CD 1 with none voting aye with reservations; none voting no (0) and none excused (0)."

    关于r - 使用 R 从 aspx 网站抓取,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/16826379/

    26 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com