gpt4 book ai didi

r - 比较数据框中行的更快方法

转载 作者:行者123 更新时间:2023-12-04 01:54:24 24 4
gpt4 key购买 nike

考虑下面的数据框。我想将每一行与下面的行进行比较,然后取超过 3 个值相等的行。

我写了下面的代码,但是如果你有一个大的数据框,它会很慢。

我怎么能做得更快?

data <- as.data.frame(matrix(c(10,11,10,13,9,10,11,10,14,9,10,10,8,12,9,10,11,10,13,9,13,13,10,13,9), nrow=5, byrow=T))
rownames(data)<-c("sample_1","sample_2","sample_3","sample_4","sample_5")

>data
V1 V2 V3 V4 V5
sample_1 10 11 10 13 9
sample_2 10 11 10 14 9
sample_3 10 10 8 12 9
sample_4 10 11 10 13 9
sample_5 13 13 10 13 9

output <- data.frame(sample = NA, duplicate = NA, matches = NA)
dfrow <- 1
for(i in 1:nrow(data)) {
sample <- data[i, ]
for(j in (i+1):nrow(data)) if(i+1 <= nrow(data)) {
matches <- 0
for(V in 1:ncol(data)) {
if(data[j,V] == sample[,V]) {
matches <- matches + 1
}
}
if(matches > 3) {
duplicate <- data[j, ]
pair <- cbind(rownames(sample), rownames(duplicate), matches)
output[dfrow, ] <- pair
dfrow <- dfrow + 1
}
}
}

>output
sample duplicate matches
1 sample_1 sample_2 4
2 sample_1 sample_4 5
3 sample_2 sample_4 4

最佳答案

这是一个 Rcpp 解决方案。但是,如果结果矩阵变得太大(即点击次数过多),则会引发错误。我运行循环两次,首先获得结果矩阵的必要大小,然后填充它。可能有更好的可能性。此外,显然,这仅适用于整数。如果您的矩阵是数字,则必须处理浮点精度。

library(Rcpp)
library(inline)

#C++ code:
body <- '
const IntegerMatrix M(as<IntegerMatrix>(MM));
const int m=M.ncol(), n=M.nrow();
long count1;
int count2;
count1 = 0;
for (int i=0; i<(n-1); i++)
{
for (int j=(i+1); j<n; j++)
{
count2 = 0;
for (int k=0; k<m; k++) {
if (M(i,k)==M(j,k)) count2++;
}
if (count2>3) count1++;
}
}
IntegerMatrix R(count1,3);
count1 = 0;
for (int i=0; i<(n-1); i++)
{
for (int j=(i+1); j<n; j++)
{
count2 = 0;
for (int k=0; k<m; k++) {
if (M(i,k)==M(j,k)) count2++;
}
if (count2>3) {
count1++;
R(count1-1,0) = i+1;
R(count1-1,1) = j+1;
R(count1-1,2) = count2;
}
}
}
return wrap(R);
'

fun <- cxxfunction(signature(MM = "matrix"),
body,plugin="Rcpp")

#with your data
fun(as.matrix(data))
# [,1] [,2] [,3]
# [1,] 1 2 4
# [2,] 1 4 5
# [3,] 2 4 4

#Benchmarks
set.seed(42)
mat1 <- matrix(sample(1:10,250*26,TRUE),ncol=26)
mat2 <- matrix(sample(1:10,2500*26,TRUE),ncol=26)
mat3 <- matrix(sample(1:10,10000*26,TRUE),ncol=26)
mat4 <- matrix(sample(1:10,25000*26,TRUE),ncol=26)
library(microbenchmark)
microbenchmark(
fun(mat1),
fun(mat2),
fun(mat3),
fun(mat4),
times=3
)
# Unit: milliseconds
# expr min lq median uq max neval
# fun(mat1) 2.675568 2.689586 2.703603 2.732487 2.761371 3
# fun(mat2) 272.600480 274.680815 276.761151 276.796217 276.831282 3
# fun(mat3) 4623.875203 4643.634249 4663.393296 4708.067638 4752.741979 3
# fun(mat4) 29041.878164 29047.151348 29052.424532 29235.839275 29419.254017 3

关于r - 比较数据框中行的更快方法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/19720488/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com