gpt4 book ai didi

r - 通过循环 rvest::follow_link() 函数来抓取链接的 HTML 网页

转载 作者:行者123 更新时间:2023-12-01 10:42:31 25 4
gpt4 key购买 nike

如何循环 rvest::follow_link()抓取链接网页的功能?

用例:

  • 确定所有乐高电影 Actor
  • 关注所有乐高电影 Actor 链接
  • 为所有 Actor 获取每部电影(+ 年份)的表格

  • 我需要的必需选择器如下:
    library(rvest)
    lego_movie <- html("http://www.imdb.com/title/tt1490017/")
    lego_movie <- lego_movie %>%
    html_nodes(".itemprop , .character a") %>%
    html_text()

    # follow cast links
    (".itemprop .itemprop")

    # grab tables of all movies and dates for each cast member
    (".year_column , b a")

    期望输出:
    castMember       movie    year
    Will Arnett Lego 2017
    Will Arnett BoJack 2014
    Will Arnett Wander 2014
    ............
    Elizabeth Banks Moonbeam 2015
    Elizabeth Banks Wet Hot 2015
    ............
    Alison Brie Get Hard 2015
    Alison Brie GetaJob 2015
    .....etc.....

    最佳答案

    也许这样的事情可以奏效。

    library(rvest)
    library(stringr)
    library(data.table)
    lego_movie <- read_html("http://www.imdb.com/title/tt1490017/")
    cast <- lego_movie %>%
    html_nodes("#titleCast .itemprop span") %>%
    html_text()
    cast

    s <- html_session("http://www.imdb.com/title/tt1490017/")

    cast_movies <- list()

    for(i in cast[1:3]){
    actorpage <- s %>% follow_link(i) %>% read_html()
    cast_movies[[i]]$movies <- actorpage %>%
    html_nodes("b a") %>% html_text() %>% head(10)
    cast_movies[[i]]$years <- actorpage %>%
    html_nodes("#filmography .year_column") %>% html_text() %>%
    head(10) %>% str_extract("[0-9]{4}")
    cast_movies[[i]]$name <- rep(i, length(cast_movies[[i]]$years))
    }

    cast_movies
    as.data.frame(cast_movies[[1]])
    rbindlist(cast_movies)

    关于r - 通过循环 rvest::follow_link() 函数来抓取链接的 HTML 网页,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28863775/

    25 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com