gpt4 book ai didi

R:按值排除的子集上的数据表

转载 作者:行者123 更新时间:2023-12-04 14:46:19 25 4
gpt4 key购买 nike

使用 data.tableR ,我正在尝试对排除选定元素的子集进行操作。我正在使用 by运算符(operator),但我不知道这是否是正确的方法。

这是一个例子。例如。 Delta 的值在 IAH:SNA是 (3+3)/2,这是 Stops 的平均值在 IAH:SNA一次 Delta已被排除。

library(data.table)
s1 <- "Market Carrier Stops
IAH:SNA Delta 1
IAH:SNA Delta 1
IAH:SNA Southwest 3
IAH:SNA Southwest 3
MSP:CLE Southwest 2
MSP:CLE Southwest 2
MSP:CLE American 2
MSP:CLE JetBlue 1"

d <- data.table(read.table(textConnection(s1), header=TRUE))

setkey(d, Carrier, Market)

f <- function(x, y){
subset(d, !(Carrier %in% x) & Market == y, Stops)[, mean(Stops)]}

d[, s := f(.BY[[1]], .BY[[2]]), by=list(Carrier, Market)]

## Market Carrier Stops s
## 1: MSP:CLE American 2 1.666667
## 2: IAH:SNA Delta 1 3.000000
## 3: IAH:SNA Delta 1 3.000000
## 5: IAH:SNA Southwest 3 1.000000
## 6: IAH:SNA Southwest 3 1.000000
## 7: MSP:CLE Southwest 2 1.500000
## 8: MSP:CLE Southwest 2 1.500000

上述解决方案在大型数据集上的表现非常差(它本质上是一个 mapply ),但我不确定如何在快速 data.table 中做到这一点。 - 喜欢的方式。

也许一个人可以(动态地)生成一个因素来做到这一点?我只是不确定如何。 . .

有没有办法改善它?

编辑:只是为了它,这是一种获得上述更大版本的方法
library(data.table)
dl.dta <- function(...){
## input years ..
years <- gsub("\\.", "_", c(...))
baseurl <- "http://www.transtats.bts.gov/Download/"
names <- paste("Origin_and_Destination_Survey_DB1BMarket", years, sep="_")
info <- t(sapply(names, function(x) file.exists(paste(x, c("zip", "csv"), sep="."))))
to.download <- paste(baseurl, names, ".zip", sep="")[!apply(info, 1, any)]
if (length(to.download) > 0){
message("starting download...")
sapply(to.download,
function(x) download.file(x, rev(strsplit(x, "/")[[1]])[1]))}

to.unzip <- paste(names, "zip", sep=".")[!info[, 2]]
if (length(to.unzip > 0)){
message("starting to unzip...")
sapply(to.unzip, unzip)}
paste(names, "csv", sep=".")}

countWords.split <- function(x, s=":"){
## Faster on my machine than grep for some reanon
sapply(strsplit(as.character(x), s), length)}

countWords.grep <- function(x){
sapply(gregexpr("\\W+", x), length)+1}

fname <- dl.dta(2013.1)
cols <- rep("NULL", 41)
## Columns to keep: 9 is Origin, 18 is Dest, 24 is groups of airports in travel
## 30 is RPcarrier (reporting carrier).
## For more columns: 35 is market fare and 36 is distance.
cols[9] <- cols[18] <- cols[24] <- cols[30] <- NA
d <- data.table(read.csv(file=fname, colClasses=cols))
d[, Market := paste(Origin, Dest, sep=":")]
## should probably
d[, Stops := -2 + countWords.split(AirportGroup)]
d[, Carrier := RPCarrier]
d[, c("RPCarrier", "Origin", "Dest", "AirportGroup") := NULL]

最佳答案

使用一点点基础数学:

d[, c("tmp.mean", "N") := list(mean(Stops), .N), by = Market]
d[, exep.mean := (tmp.mean * N - sum(Stops)) / (N - .N), by = list(Market,Carrier)]

# Market Carrier Stops tmp.mean N exep.mean
# 1: IAH:SNA Delta 1 2.00 4 3.000000
# 2: IAH:SNA Delta 1 2.00 4 3.000000
# 3: IAH:SNA Southwest 3 2.00 4 1.000000
# 4: IAH:SNA Southwest 3 2.00 4 1.000000
# 5: MSP:CLE Southwest 2 1.75 4 1.500000
# 6: MSP:CLE Southwest 2 1.75 4 1.500000
# 7: MSP:CLE American 2 1.75 4 1.666667
# 8: MSP:CLE JetBlue 1 1.75 4 2.000000

关于R:按值排除的子集上的数据表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/17893613/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com