gpt4 book ai didi

r - 使用 R 从网页中提取元描述

转载 作者:行者123 更新时间:2023-12-04 12:33:03 25 4
gpt4 key购买 nike

您好,我正在尝试检索这些网页元描述

来自页面资源“

Data<-data.frame(Pages=c(
"http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html",
"http://boingboing.net/2016/06/16/omg-the-japanese-trump-commer.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facebook.html"))

期望的输出

Data$Meta_Description<-data.frame(Extracted=c(
"Sanford Wallace gets 2.5 years in prison for 27 million Facebook",
"OMG, this Japanese Trump Commercial is everything",
"Omar Mateen posted to Facebook during Orlando mass shooting"))

我试图用 httr 完成这个任务,但我无法以所需的输出格式获取它或从使用 GET 命令检索的内容中提取内容

library (httr)
resp<-GET ("http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html")
str(resp)
List of 10
$ url : chr "http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html"
$ status_code: int 200
$ headers :List of 22
..$ server : chr "Apache/2.2"

我需要从源代码中提取的字段在这个字符串之后

<meta itemprop="description" content="

像这样

<meta itemprop="description" content="&#039;Spam King&#039; 
Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages"

最佳答案

你真的只需要 rvest .因为它们都是 <h1>标题,您可以遍历 URL 列表,选择标题:

library(rvest)

sapply(Data$Pages,
function(url){
url %>%
as.character() %>% # in case strings are stored as factors
read_html() %>%
html_nodes('h1') %>%
html_text()
})

# [1] "'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages"
# [2] "OMG, this Japanese Trump Commercial is everything"
# [3] "Omar Mateen posted to Facebook during Orlando mass shooting"

或者如果您真的想要抓取 <meta>标签,你可以用同样的方式做到这一点,虽然选择器更痛苦:

sapply(Data$Pages, function(url){
url %>%
as.character() %>%
read_html() %>%
html_nodes(xpath = '//meta[@itemprop="description"]') %>%
html_attr('content')
})

无论哪种方式,您都会得到相同的结果。

关于r - 使用 R 从网页中提取元描述,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37871556/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com