gpt4 book ai didi

r - 如何按组拆分数据表并按列中的出现次数使用子集?

转载 作者:行者123 更新时间:2023-12-04 10:12:51 25 4
gpt4 key购买 nike

我有一个大型数据集,287046 x 18,看起来像这样(仅部分表示):

tdf
geneSymbol peaks
16 AK056486 Pol2_only
13 AK310751 no_peak
7 BC036251 no_peak
10 DQ575786 no_peak
4 DQ597235 no_peak
5 DQ599768 no_peak
11 DQ599872 no_peak
12 DQ599872 no_peak
2 FAM138F no_peak
15 FAM41C no_peak
34116 GAPDH both
283034 GAPDH Pol2_only
6 LOC100132062 no_peak
9 LOC100133331 no_peak
14 LOC100288069 both
8 M37726 no_peak
3 OR4F5 no_peak
17 SAMD11 both
18 SAMD11 both
19 SAMD11 both
20 SAMD11 both
21 SAMD11 both
22 SAMD11 both
23 SAMD11 both
24 SAMD11 both
25 SAMD11 both
1 WASH7P Pol2_only

我想要做的是提取 (1) 是“Pol2_only”或“两者”的基因符号,然后; (2) 只是“Pol2_only”但不是“两者”的基因符号。例如,GAPDH 将满足条件 1 但不满足条件 2。

我已经尝试过类似这样的 plyr (那里有一个额外的条件,请忽略):
## grab genes with both peaks 
pol2.peaks <- ddply(filem, .(geneSymbol), function(dfrm) subset(dfrm, peaks == "both" | (peaks == "Pol2_only" & peaks == "CBP20_only")), .parallel=TRUE)

## grab genes pol2 only peaks
pol2.only.peaks <- ddply(tdf, .(geneSymbol), function(dfrm) subset(dfrm, peaks != "both" & peaks == "Pol2_only" & peaks != "CBP20_only"), .parallel=TRUE)

但是需要很长时间,仍然返回错误的答案。例如,2 的答案是:
pol2.only.peaks
geneSymbol peaks
1 AK056486 Pol2_only
2 GAPDH Pol2_only
3 WASH7P Pol2_only

如您所见,GAPDH 不应该存在。我在 data.table 中的实现(更喜欢,因此更喜欢)也产生相同的结果:
filem.dt <- as.data.table(tdf)
setkey(filem.dt, "geneSymbol")
test.dt <- filem.dt[ , .SD[ peaks != "both" & peaks == "Pol2_only" & peaks != "CBP20_only"]]
test.dt
geneSymbol peaks
1: AK056486 Pol2_only
2: GAPDH Pol2_only
3: WASH7P Pol2_only

问题似乎是子集是逐行工作的,而我需要将它应用于整个基因符号的子组。

可以帮我分组吗? data.table 解决方案会受到欢迎,因为它速度更快,但 plyr(甚至基础 R)很好。添加一个额外的列来记录峰的性质的解决方案将是完美的。这就是我的意思:
tdf
geneSymbol peaks newCol
16 AK056486 Pol2_only Pol2_only
13 AK310751 no_peak no_peak
7 BC036251 no_peak no_peak
10 DQ575786 no_peak no_peak
4 DQ597235 no_peak no_peak
5 DQ599768 no_peak no_peak
11 DQ599872 no_peak no_peak
12 DQ599872 no_peak no_peak
2 FAM138F no_peak no_peak
15 FAM41C no_peak no_peak
34116 GAPDH both both
283034 GAPDH Pol2_only both
6 LOC100132062 no_peak no_peak
9 LOC100133331 no_peak no_peak
14 LOC100288069 both both
8 M37726 no_peak no_peak
3 OR4F5 no_peak no_peak
17 SAMD11 both both
18 SAMD11 both both
19 SAMD11 both both
20 SAMD11 both both
21 SAMD11 both both
22 SAMD11 both both
23 SAMD11 both both
24 SAMD11 both both
25 SAMD11 both both
1 WASH7P Pol2_only Pol2_only

再次注意现在在 2 行中“两者”的 GAPDH。这是数据:
dput(tdf)
structure(list(geneSymbol = c("AK056486", "AK310751", "BC036251",
"DQ575786", "DQ597235", "DQ599768", "DQ599872", "DQ599872", "FAM138F",
"FAM41C", "GAPDH", "GAPDH", "LOC100132062", "LOC100133331", "LOC100288069",
"M37726", "OR4F5", "SAMD11", "SAMD11", "SAMD11", "SAMD11", "SAMD11",
"SAMD11", "SAMD11", "SAMD11", "SAMD11", "WASH7P"), peaks = c("Pol2_only",
"no_peak", "no_peak", "no_peak", "no_peak", "no_peak", "no_peak",
"no_peak", "no_peak", "no_peak", "both", "Pol2_only", "no_peak",
"no_peak", "both", "no_peak", "no_peak", "both", "both", "both",
"both", "both", "both", "both", "both", "both", "Pol2_only")), .Names = c("geneSymbol",
"peaks"), row.names = c(16L, 13L, 7L, 10L, 4L, 5L, 11L, 12L,
2L, 15L, 34116L, 283034L, 6L, 9L, 14L, 8L, 3L, 17L, 18L, 19L,
20L, 21L, 22L, 23L, 24L, 25L, 1L), class = "data.frame")

谢谢!

编辑**
我找到了解决该问题的方法。选择是逐行进行的。所需要的只是一个 hack,也就是说,在返回的逻辑向量中,所有值都为真。所以这是我对 plyr 函数所做的:
ddply(tdf, .(geneSymbol), function(dfrm) subset(dfrm, all(peaks != "both" & peaks == "Pol2_only" & peaks != "CBP20_only")), .parallel=TRUE)
geneSymbol peaks
1 AK056486 Pol2_only
2 WASH7P Pol2_only

请注意在条件旁边使用 all。现在结果是预期的,即“Pol2_only”只有(冗余警报)基因 :) 仍然需要做的是在 data.table 中的实现,我尝试过但没有做到。有什么帮助吗?

我没有写我的问题的答案,期望有人在 data.table 中提出更好的解决方案。

最佳答案

当您请求 data.table 解决方案时。

# set the key to be "peaks
TDF <- data.table(tdf, key = c('geneSymbol','peaks'))

# use unique to get unique combinations, then for each geneSymbol get the first
# match (we have keyed by peak soboth < Pol2_only < no_peak within each
# geneSymbol )
# then exclude those with "peak == "no_peak")

unique(TDF)[.(unique(geneSymbol)), mult = 'first'][!peaks =='no_peak']

# geneSymbol peaks
# 1: AK056486 Pol2_only
# 2: GAPDH both
# 3: LOC100288069 both
# 4: SAMD11 both
# 5: WASH7P Pol2_only

关于r - 如何按组拆分数据表并按列中的出现次数使用子集?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/17943623/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com