gpt4 book ai didi

在数据帧列表上运行 rapply

转载 作者:行者123 更新时间:2023-12-04 11:41:14 25 4
gpt4 key购买 nike

要跟进两个 rapply 问题,herehere从几年前开始,rapply 似乎只适用于简单的类(即向量、矩阵),而不适用于多方面的 data.frame 类。

在大多数情况下以及下面的演示中,rapply 等价物是嵌套的 lapply 及其变体包装器,v/sapply 其中嵌套数与级别数相关。下面是我在向量、矩阵和数据帧类型之间嵌套 lapplyrapply 之间的测试场景。除了 datafames 之外的所有都无法均衡。

问题

在 base R 中是否有 rapply() 的用例来递归地在数据帧列表上运行操作并返回数据帧列表,就像它对向量或矩阵列表所做的那样?如果不是,这是一个错误还是应该在 ?rapply base R 文档中发出警告?大多数教程不显示 rapply 数据框示例。

一维 (字符向量)

下面显示了 rapply 如何等同于运行字符计数的简单字符向量上的嵌套 lapply,甚至显示了 rapply 如何明显更快处理中:

library(microbenchmark)

ScriptLists <- list(R = list.files(path="/path/to/Scripts", pattern="\\.R"),
Python = list.files(path="/path/to/Scripts", pattern="\\.py"),
SQL = list.files(path="/path/to/Scripts", pattern="\\.sql"),
PHP = list.files(path="/path/to/Scripts", pattern="\\.xsl"),
XSLT = list.files(path="/path/to/Scripts", pattern="\\.php"))

microbenchmark(
ScriptsLists1 <- lapply(ScriptLists, function(i){
unname(vapply(i, function(x){
nchar(x)
}, numeric(1)))
})
)
# Unit: microseconds
# min lq mean median uq max neval
# 384 408.782 524.1363 434.7675 678.016 886.377 100

microbenchmark(
ScriptsLists2 <- rapply(ScriptLists, function(x){
nchar(x)
}, how="list")
)
# Unit: microseconds
# min lq mean median uq max neval
# 110.196 112.8425 131.6141 114.5265 123.91 352.722 100

all.equal(ScriptsLists1, ScriptsLists2)
# [1] TRUE

二维类型 (矩阵与数据帧)

输入数据框(从最高年份排名 StackOverflow top users 中提取)以按语言标签(C#、Python、R 等)构建顶级用户数据框列表。

df <- structure(list(user = structure(c(12L, 14L, 19L, 35L, 22L, 32L, 
1L, 36L, 7L, 9L, 2L, 18L, 27L, 6L, 30L, 20L, 10L, 24L, 29L, 23L,
5L, 3L, 4L, 15L, 25L, 17L, 11L, 8L, 33L, 13L, 34L, 16L, 21L,
26L, 28L, 31L), .Label = c("akrun", "alecxe", "Alexey Mezenin",
"BalusC", "Barmar", "CommonsWare", "Darin Dimitrov", "dasblinkenlight",
"Eric Duminil", "Felix Kling", "Frank van Puffelen", "Gordon Linoff",
"Greg Hewgill", "Günter Zöchbauer", "GurV", "Hans Passant", "JB Nizet",
"Jean-François Fabre", "jezrael", "Jon Skeet", "Jonathan Leffler",
"Martijn Pieters", "Martin R", "matt", "Nina Scholz", "paxdiablo",
"piRSquared", "Pranav C Balan", "Psidom", "Quentin", "Suragch",
"T.J. Crowder", "Tim Biegeleisen", "unutbu", "VonC", "Wiktor Stribi?ew"
), class = "factor"), link = structure(c(2L, 17L, 21L, 31L, 1L,
10L, 27L, 28L, 22L, 33L, 35L, 34L, 20L, 3L, 15L, 19L, 18L, 25L,
29L, 4L, 8L, 5L, 11L, 32L, 6L, 30L, 16L, 24L, 13L, 36L, 14L,
12L, 9L, 7L, 23L, 26L), .Label = c("http://www.stackoverflow.com//users/100297/martijn-pieters",
"http://www.stackoverflow.com//users/1144035/gordon-linoff",
"http://www.stackoverflow.com//users/115145/commonsware", "http://www.stackoverflow.com//users/1187415/martin-r",
"http://www.stackoverflow.com//users/1227923/alexey-mezenin",
"http://www.stackoverflow.com//users/1447675/nina-scholz", "http://www.stackoverflow.com//users/14860/paxdiablo",
"http://www.stackoverflow.com//users/1491895/barmar", "http://www.stackoverflow.com//users/15168/jonathan-leffler",
"http://www.stackoverflow.com//users/157247/t-j-crowder", "http://www.stackoverflow.com//users/157882/balusc",
"http://www.stackoverflow.com//users/17034/hans-passant", "http://www.stackoverflow.com//users/1863229/tim-biegeleisen",
"http://www.stackoverflow.com//users/190597/unutbu", "http://www.stackoverflow.com//users/19068/quentin",
"http://www.stackoverflow.com//users/209103/frank-van-puffelen",
"http://www.stackoverflow.com//users/217408/g%c3%bcnter-z%c3%b6chbauer",
"http://www.stackoverflow.com//users/218196/felix-kling", "http://www.stackoverflow.com//users/22656/jon-skeet",
"http://www.stackoverflow.com//users/2336654/pirsquared", "http://www.stackoverflow.com//users/2901002/jezrael",
"http://www.stackoverflow.com//users/29407/darin-dimitrov", "http://www.stackoverflow.com//users/3037257/pranav-c-balan",
"http://www.stackoverflow.com//users/335858/dasblinkenlight",
"http://www.stackoverflow.com//users/341994/matt", "http://www.stackoverflow.com//users/3681880/suragch",
"http://www.stackoverflow.com//users/3732271/akrun", "http://www.stackoverflow.com//users/3832970/wiktor-stribi%c5%bcew",
"http://www.stackoverflow.com//users/4983450/psidom", "http://www.stackoverflow.com//users/571407/jb-nizet",
"http://www.stackoverflow.com//users/6309/vonc", "http://www.stackoverflow.com//users/6348498/gurv",
"http://www.stackoverflow.com//users/6419007/eric-duminil", "http://www.stackoverflow.com//users/6451573/jean-fran%c3%a7ois-fabre",
"http://www.stackoverflow.com//users/771848/alecxe", "http://www.stackoverflow.com//users/893/greg-hewgill"
), class = "factor"), location = structure(c(17L, 15L, 8L, 12L,
10L, 26L, 1L, 28L, 23L, 1L, 17L, 25L, 6L, 29L, 26L, 19L, 24L,
1L, 5L, 13L, 4L, 2L, 3L, 1L, 7L, 20L, 21L, 27L, 22L, 11L, 1L,
16L, 9L, 1L, 18L, 14L), .Label = c("", "??????", "Amsterdam, Netherlands",
"Arlington, MA", "Atlanta, GA, United States", "Bellevue, WA, United States",
"Berlin, Deutschland", "Bratislava, Slovakia", "California, USA",
"Cambridge, United Kingdom", "Christchurch, New Zealand", "France",
"Germany", "Hohhot, China", "Linz, Austria", "Madison, WI", "New York, United States",
"Ramanthali, Kannur, Kerala, India", "Reading, United Kingdom",
"Saint-Etienne, France", "San Francisco, CA", "Singapore", "Sofia, Bulgaria",
"Sunnyvale, CA", "Toulouse, France", "United Kingdom", "United States",
"Warsaw, Poland", "Who Wants to Know?"), class = "factor"), year_rep = structure(c(36L,
35L, 34L, 33L, 32L, 31L, 30L, 29L, 28L, 27L, 26L, 25L, 24L, 23L,
22L, 21L, 20L, 19L, 18L, 17L, 16L, 15L, 14L, 13L, 12L, 11L, 10L,
9L, 8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L), .Label = c("3,580", "3,604",
"3,636", "3,649", "3,688", "3,735", "3,796", "3,814", "3,886",
"3,920", "3,923", "3,950", "4,016", "4,046", "4,142", "4,179",
"4,195", "4,236", "4,313", "4,324", "4,348", "4,464", "4,475",
"4,482", "4,526", "4,723", "4,854", "4,936", "4,948", "5,188",
"5,258", "5,337", "5,577", "5,740", "5,835", "5,985"), class = "factor"),
total_rep = structure(c(18L, 2L, 34L, 27L, 22L, 20L, 5L,
3L, 31L, 1L, 6L, 9L, 13L, 25L, 21L, 36L, 14L, 4L, 11L, 7L,
8L, 10L, 30L, 29L, 24L, 15L, 35L, 17L, 33L, 23L, 12L, 28L,
16L, 19L, 26L, 32L), .Label = c("12,557", "154,439", "158,134",
"220,515", "229,553", "233,368", "269,380", "289,989", "30,027",
"31,602", "36,950", "401,595", "41,183", "411,535", "418,780",
"455,157", "475,813", "499,408", "507,043", "508,310", "509,365",
"525,176", "529,137", "61,135", "616,135", "64,476", "651,397",
"672,118", "7,932", "703,046", "709,683", "71,032", "77,211",
"83,237", "86,520", "921,690"), class = "factor"), tag1 = structure(c(15L,
2L, 10L, 6L, 11L, 8L, 12L, 13L, 4L, 14L, 11L, 11L, 10L, 1L,
8L, 4L, 8L, 16L, 11L, 16L, 8L, 9L, 7L, 15L, 8L, 7L, 5L, 4L,
15L, 6L, 11L, 4L, 3L, 3L, 8L, 16L), .Label = c("android",
"angular2", "c", "c#", "firebase", "git", "java", "javascript",
"laravel", "pandas", "python", "r", "regex", "ruby", "sql",
"swift"), class = "factor"), tag2 = structure(c(23L, 24L,
19L, 8L, 20L, 14L, 6L, 13L, 3L, 21L, 22L, 20L, 19L, 12L,
10L, 12L, 14L, 11L, 17L, 11L, 18L, 18L, 15L, 16L, 2L, 9L,
7L, 12L, 16L, 19L, 17L, 1L, 4L, 5L, 14L, 11L), .Label = c(".net",
"arrays", "asp.net-mvc", "bash", "c++", "dplyr", "firebase-database",
"github", "hibernate", "html", "ios", "java", "javascript",
"jquery", "jsf", "mysql", "pandas", "php", "python", "python-3.x",
"ruby-on-rails", "selenium", "sql-server", "typescript"), class = "factor"),
tag3 = structure(c(20L, 17L, 11L, 12L, 24L, 15L, 11L, 8L,
5L, 4L, 23L, 24L, 11L, 3L, 10L, 1L, 6L, 31L, 25L, 28L, 18L,
19L, 26L, 27L, 22L, 16L, 2L, 9L, 15L, 13L, 21L, 30L, 29L,
7L, 14L, 2L), .Label = c(".net", "android", "android-intent",
"arrays", "asp.net-mvc-3", "asynchronous", "bash", "c#",
"c++", "css", "dataframe", "docker", "git-pull", "html",
"java", "java-8", "javascript", "jquery", "laravel-5.3",
"mysql", "numpy", "object", "protractor", "python-2.7", "r",
"servlets", "sql-server", "swift3", "unix", "winforms", "xcode"
), class = "factor")), .Names = c("user", "link", "location",
"year_rep", "total_rep", "tag1", "tag2", "tag3"), class = "data.frame", row.names = c(NA,
-36L))

R代码

以下方法对类型、矩阵或数据帧中的 year_reptotal_rep(第 5/6)列进行平均。请务必更改设置 block 中的返回语句,换出注释部分类型。请注意矩阵返回的 rapply() 与嵌套 lapply 相同,但数据帧返回则不然。

# NESTED LIST SETUP ------------------------------------
LangLists <- list(`c#`=list(), python=list(), sql=list(), php=list(), r=list(),
java=list(), javascript=list(), ruby=list(), `c++`=list())

LangLists <- setNames(mapply(function(i, j){

df <- subset(df, tag1 == j | tag2 == j | tag3 == j)
df$year_rep <- as.numeric(as.character(gsub(",", "", df$year_rep)))
df$total_rep <- as.numeric(as.character(gsub(",", "", df$total_rep)))

return(list(as.matrix(df))) # MATRIX TYPE
# return(list(df)) # DF TYPE

}, LangLists, names(LangLists), SIMPLIFY=FALSE), names(LangLists))
# -----------------------------------------------------

# MATRIX RETURN
LangLists1 <- lapply(LangLists, function(i){
lapply(i, function(df){
cbind(mean(as.numeric(df[,5])),
mean(as.numeric(df[,6])))
})
})

LangLists2 <- rapply(LangLists, function(i){
cbind(mean(as.numeric(i[,5])),
mean(as.numeric(i[,6])))
}, classes="matrix", how="list")

all.equal(LangLists1, LangLists2)
# [1] TRUE


# DATA FRAME RETURN
LangLists1 <- lapply(LangLists, function(i){
lapply(i, function(df){
data.frame(year_rep=mean(df$year_rep),
total_rep=mean(df$total_rep))
})
})

LangLists2 <- rapply(LangLists, function(i){
data.frame(year_rep=mean(i$year_rep),
total_rep=mean(i$total_rep))
}, classes="data.frame", how="list")

all.equal(LangLists1, LangLists2)

# [1] "Component “c#”: Component 1: Names: 2 string mismatches"
# [2] "Component “c#”: Component 1: Attributes: < names for target but not for current >"
# [3] "Component “c#”: Component 1: Attributes: < Length mismatch: comparison on first 0 components >"
# [4] "Component “c#”: Component 1: Length mismatch: comparison on first 2 components"
# [5] "Component “c#”: Component 1: Component 1: Modes: numeric, NULL"
...

事实上,虽然嵌套的 lapply 仍然是 rep 的两列完整数据帧列表,但 rapply 用于数据帧转换基础数据帧到 NULL 列表。那么,与向量/矩阵相比,为什么 rapply 无法返回原始数据帧列表?

# $`c#`
# $`c#`[[1]]
# $`c#`[[1]]$X
# NULL

# $`c#`[[1]]$user
# NULL

# $`c#`[[1]]$link
# NULL

# $`c#`[[1]]$location
# NULL

# $`c#`[[1]]$year_rep
# NULL

# $`c#`[[1]]$total_rep
# NULL

# $`c#`[[1]]$tag1
# NULL

# $`c#`[[1]]$tag2
# NULL

# $`c#`[[1]]$tag3
# NULL

# $python
# $python[[1]]
# $python[[1]]$X
# NULL

# $python[[1]]$user
# NULL

# $python[[1]]$link
# NULL

# $python[[1]]$location
# NULL

# $python[[1]]$year_rep
# NULL

# $python[[1]]$total_rep
# NULL

# $python[[1]]$tag1
# NULL

# $python[[1]]$tag2
# NULL

# $python[[1]]$tag3
# NULL

最佳答案

rapply 似乎不是为处理数据帧列表而设计的。

?rapply 的详细信息部分说,如果

how = "list" or how = "unlist", the list is copied, all non-list elements which have a class included in classes are replaced by the result of applying f to the element and all others are replaced by deflt.

由于 data.frames 是列表,因此它们不属于第一类。因此,它们属于 all others catch-all 并被 dflt 替换,其默认值为 NULL。这解释了问题中最后一行代码的结果。

关于如何“替换”的最后一个替代参数似乎在这种“模式”下 data.frames 被简单地忽略了

If how = "replace", each element of the list which is not itself a list and has a class included in classes is replaced by the result of applying f to the element.

没有提及本身就是列表的元素,使用 how="replace"运行上面的代码似乎返回一个嵌套列表,其中 data.frames 现在是简单列表。所以看来 rapply 已经通过并剥离了类属性。

关于在数据帧列表上运行 rapply,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41813353/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com