gpt4 book ai didi

r - R 中 data.frames 与 sapply 之间的高效坐标匹配

转载 作者:行者123 更新时间:2023-12-01 01:02:58 26 4
gpt4 key购买 nike

我正在尝试获取一个向量,告诉我 data.frame 中的哪些行(transcriptcoords)

              chr  start  end
NONHSAT000001 chr1 11868 14409
NONHSAT000002 chr1 11871 14412
NONHSAT000003 chr1 11873 14409
NONHSAT000004 chr1 12009 13670
NONHSAT000005 chr1 14777 16668
NONHSAT000006 chr1 15602 29370

在另一个 data.frame (genecoords) 中松散地包含开始/结束坐标(具有 +/- 10 容差)
              chr  start  end
NONHSAG000001 chr1 11869 14412
NONHSAG000002 chr1 14778 29370
NONHSAG000003 chr1 29554 31109
NONHSAG000004 chr1 34554 36081
NONHSAG000005 chr1 36273 50281
NONHSAG000006 chr1 62948 63887

为此,我在第一个 data.frame 的行 indeces 上使用 sapply 循环,将坐标与第二个 data.frame 中的任何行匹配。我有一个解决方案(如下所述),但它似乎相当慢(大约 6 秒,2000 行):
   user  system elapsed 
6.02 0.00 6.04

我试图了解 sapply 的哪些部分可以优化。是 if/else 块吗?还是比较行(==、<=、>=)?或者更简单地说,它是一种本质上很慢的算法吗?

谢谢!我生成的代码如下:
load(url("http://www.giorgilab.org/stuff/data.rda"))

# Pre-vectorize the data frames
g0<-rownames(genecoords)
g1<-genecoords[,1]
g2<-as.integer(genecoords[,2])
g3<-as.integer(genecoords[,3])

t0<-rownames(transcriptcoords)
t1<-transcriptcoords[,1]
t2<-as.integer(transcriptcoords[,2])
t3<-as.integer(transcriptcoords[,3])

system.time(gs<-sapply(1:2000,function(i){
t<-t0[i]
chr<-t1[i]
start<-t2[i]
end<-t3[i]

# Find a match (loose boundaries +/- 10)
right1<-which(g1==chr)
right2<-which(g2<=start+10)
right3<-which(g3>=end-10)
right<-intersect(right3,intersect(right1,right2))
right<-g0[right]

if(length(right)==1){
g<-right
} else if(length(right)>1){
# Get the smallest match
heregenecoords<-genecoords[right,]
size<-apply(heregenecoords,1,function(x){abs(as.numeric(x[3])-as.numeric(x[2]))})
g<-names(which.min(size))
} else {
g<-t
}
return(g)
}
))

最佳答案

用你的数据

tx0 <- read.table(textConnection("chr  start  end
NONHSAT000001 chr1 11868 14409
NONHSAT000002 chr1 11871 14412
NONHSAT000003 chr1 11873 14409
NONHSAT000004 chr1 12009 13670
NONHSAT000005 chr1 14777 16668
NONHSAT000006 chr1 15602 29370"))

gene0 <- read.table(textConnection("chr start end
NONHSAG000001 chr1 11869 14412
NONHSAG000002 chr1 14778 29370
NONHSAG000003 chr1 29554 31109
NONHSAG000004 chr1 34554 36081
NONHSAG000005 chr1 36273 50281
NONHSAG000006 chr1 62948 63887"))

GenomicRanges包裹在 Bioconductor轻松有效地做到这一点(对于数百万次重叠)。
library(GenomicRanges)
tx <- with(tx0, GRanges(chr, IRanges(start, end)))
gene <- with(gene0, GRanges(chr, IRanges(start, end)))

## increase width by 10 on both sides of the center of the gene range
gene <- resize(gene, width=width(gene) + 20, fix="center")
## find overlaps of 'query' tx and 'subject' gene, where query is within subject
olaps <- findOverlaps(tx, gene, type="within")

例如,显示 'query' (tx) 1、2、3 和 4 在 'subject' (gene) 1 内。
> findOverlaps(tx, gene, type="within")
Hits of length 6
queryLength: 6
subjectLength: 6
queryHits subjectHits
<integer> <integer>
1 1 1
2 2 1
3 3 1
4 4 1
5 5 2
6 6 2

并且基因 1 与 4 个转录本重叠,基因 2 与 2 个转录本重叠。
> table(subjectHits(olaps))

1 2
4 2

另见此 publication .使用更大的数据集:
tx <- with(transcriptcoords, GRanges(V1, IRanges(V2, V3, names=rownames(tx0))))
gene <- with(genecoords, GRanges(V1, IRanges(V2, V3, names=rownames(gene0))))

有一些时间:
system.time(gene <- resize(gene, width=width(gene) + 20, fix="center"))
## user system elapsed
## 0.056 0.000 0.057
system.time(findOverlaps(tx, gene, type="within"))
## user system elapsed
## 2.248 0.000 2.250

我认为这大约是来自 @danas.zuokos 的 data.table 解决方案的时间,只有 1000 个成绩单
system.time({
dt <- genecoords[transcriptcoords, allow.cartesian = TRUE];
res <- dt[start <= start.1 + tol & end >= end.1 - tol,
list(gene = gene[which.min(size)]), by = transcript]
})
## user system elapsed
## 2.148 0.244 2.400

关于r - R 中 data.frames 与 sapply 之间的高效坐标匹配,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21472153/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com