% html_nodes(xpath='//*[@id="pageContainer"]/table[-6ren">
gpt4 book ai didi

r - 如何使用 rvest 获取 html_table 中的链接?

转载 作者:行者123 更新时间:2023-12-01 14:00:55 25 4
gpt4 key购买 nike

library("rvest")
url <- "myurl.com"
tables<- url %>%
read_html() %>%
html_nodes(xpath='//*[@id="pageContainer"]/table[1]') %>%
html_table(fill = T)
tables[[1]]

单元格的html内容是这样的
<td><a href="http://somelink.com" target="_blank">Click Here</a></td>

但在抓取的 html 中,我只能得到,

Click Here

最佳答案

您可以通过编辑 rvest::html_table 来处理此问题。与 trace .

现有行为示例:

library(rvest)
x <- "https://en.wikipedia.org/wiki/Academy_Award_for_Best_Picture" %>%
read_html() %>%
html_nodes("#mw-content-text > table:nth-child(55)")

html_table(x)
#[[1]]
# Film Production company(s) Producer(s)
#1 The Great Ziegfeld Metro-Goldwyn-Mayer Hunt Stromberg
#2 Anthony Adverse Warner Bros. Henry Blanke
#3 Dodsworth Goldwyn, United Artists Samuel Goldwyn and Merritt Hulbert
#4 Libeled Lady Metro-Goldwyn-Mayer Lawrence Weingarten
#5 Mr. Deeds Goes to Town Columbia Frank Capra
#6 Romeo and Juliet Metro-Goldwyn-Mayer Irving Thalberg
#7 San Francisco Metro-Goldwyn-Mayer John Emerson and Bernard H. Hyman
#8 The Story of Louis Pasteur Warner Bros. Henry Blanke
#9 A Tale of Two Cities Metro-Goldwyn-Mayer David O. Selznick
#10 Three Smart Girls Universal Joe Pasternak and Charles R. Rogers
html_table本质上提取 html 表的单元格并运行 html_text在他们。我们需要做的就是通过提取 <a> 来替换它。从每个单元格标记并运行 html_attr(., "href")反而。
trace(rvest:::html_table.xml_node, quote({ 
values <- lapply(lapply(cells, html_node, "a"), html_attr, name = "href")
values[[1]] <- html_text(cells[[1]])
}), at = 14)

新行为:
html_table(x)
#Tracing html_table.xml_node(X[[i]], ...) step 14
#[[1]]
# Film Production company(s) Producer(s)
#1 /wiki/The_Great_Ziegfeld NA /wiki/Hunt_Stromberg
#2 /wiki/Anthony_Adverse NA /wiki/Henry_Blanke
#3 /wiki/Dodsworth_(film) NA /wiki/Samuel_Goldwyn
#4 /wiki/Libeled_Lady NA /wiki/Lawrence_Weingarten
#5 /wiki/Mr._Deeds_Goes_to_Town NA /wiki/Frank_Capra
#6 /wiki/Romeo_and_Juliet_(1936_film) NA /wiki/Irving_Thalberg
#7 /wiki/San_Francisco_(1936_film) NA /wiki/John_Emerson_(filmmaker)
#8 /wiki/The_Story_of_Louis_Pasteur NA /wiki/Henry_Blanke
#9 /wiki/A_Tale_of_Two_Cities_(1935_film) NA /wiki/David_O._Selznick
#10 /wiki/Three_Smart_Girls NA /wiki/Joe_Pasternak

关于r - 如何使用 rvest 获取 html_table 中的链接?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42119851/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com