gpt4 book ai didi

r - 将复杂的HTML表刮到R中的data.frame中

转载 作者:行者123 更新时间:2023-12-04 13:32:01 27 4
gpt4 key购买 nike

我正在尝试将有关美国最高法院大法官的维基百科数据加载到R中:

library(rvest)

html = html("http://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States")
judges = html_table(html_nodes(html, "table")[[2]])
head(judges[,2])

[1] "Wilson, JamesJames Wilson" "Jay, JohnJohn Jay†"
[3] "Cushing, WilliamWilliam Cushing" "Blair, JohnJohn Blair, Jr."
[5] "Rutledge, JohnJohn Rutledge" "Iredell, JamesJames Iredell"

问题是数据格式不正确。实际上,它不是出现在实际HTML表中的名字(“James Wilson”),而是两次出现,一次是“Lastname,Firstname”,然后再次是“Firstname Lastname”。

原因是每个实际上都包含一个不可见的:
<td style="text-align:left;" class="">
<span style="display:none" class="">Wilson, James</span>
<a href="/wiki/James_Wilson" title="James Wilson">James Wilson</a>
</td>

具有数字数据的列也是如此。我猜想这些额外的代码对于对HTML表进行排序是必需的。但是,我不清楚在尝试从R中的表创建data.frame时如何删除这些范围。

最佳答案

也许是这样

library(XML)
library(rvest)
html = html("http://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States")
judges = html_table(html_nodes(html, "table")[[2]])
head(judges[,2])
# [1] "Wilson, JamesJames Wilson" "Jay, JohnJohn Jay†" "Cushing, WilliamWilliam Cushing" "Blair, JohnJohn Blair, Jr."
# [5] "Rutledge, JohnJohn Rutledge" "Iredell, JamesJames Iredel

removeNodes(getNodeSet(html, "//table/tr/td[2]/span"))
judges = html_table(html_nodes(html, "table")[[2]])
head(judges[,2])
# [1] "James Wilson" "John Jay†" "William Cushing" "John Blair, Jr." "John Rutledge" "James Iredell"

关于r - 将复杂的HTML表刮到R中的data.frame中,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27843659/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com