gpt4 book ai didi

html - R中的网页抓取html

转载 作者:数据小太阳 更新时间:2023-10-29 03:02:42 26 4
gpt4 key购买 nike

我想从抓取 http://obamaspeeches.com/P-Obama-Inaugural-Speech-Inauguration.htm 中获取 URL 列表,如下所示:

[1] "P-Obama-Inaugural-Speech-Inauguration.htm"
[2] "E11-Barack-Obama-Election-Night-Victory-Speech-Grant-Park-Illinois-November-4-2008.htm"

这是我的代码:

library(XML)

url = "http://obamaspeeches.com/P-Obama-Inaugural-Speech-Inauguration.htm"
doc = htmlTreeParse(url, useInternalNodes = T)
url.list = xpathSApply(doc, "//a[contains(@href, 'htm')]")

问题是我想unlist() url.list 所以我可以strsplit 但它没有取消列出

最佳答案

还需要一步(只需要获取href属性):

library(XML)

url <- "http://obamaspeeches.com/P-Obama-Inaugural-Speech-Inauguration.htm"
doc <- htmlTreeParse(url, useInternalNodes=TRUE)

url.list <- xpathSApply(doc, "//a[contains(@href, 'htm')]")
hrefs <- gsub("^/", "", sapply(url.list, xmlGetAttr, "href"))

head(hrefs, 6)

## [1] "P-Obama-Inaugural-Speech-Inauguration.htm"
## [2] "E11-Barack-Obama-Election-Night-Victory-Speech-Grant-Park-Illinois-November-4-2008.htm"
## [3] "E11-Barack-Obama-Election-Night-Victory-Speech-Grant-Park-Illinois-November-4-2008.htm"
## [4] "E-Barack-Obama-Speech-Manassas-Virgina-Last-Rally-2008-Election.htm"
## [5] "E10-Barack-Obama-The-American-Promise-Acceptance-Speech-at-the-Democratic-Convention-Mile-High-Stadium--Denver-Colorado-August-28-2008.htm"
## [6] "E10-Barack-Obama-The-American-Promise-Acceptance-Speech-at-the-Democratic-Convention-Mile-High-Stadium--Denver-Colorado-August-28-2008.htm"

free(doc)

更新 强制性 rvest + dplyr 方式:

library(rvest)
library(dplyr)

speeches <- html("http://obamaspeeches.com/P-Obama-Inaugural-Speech-Inauguration.htm")
speeches %>% html_nodes("a[href*=htm]") %>% html_attr("href") %>% head(6)

## same output as above

关于html - R中的网页抓取html,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22837441/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com