I am not sure what the correct word for this would be, so apologies for getting the terminology horribly wrong. Basically I have about 1000 datapoints, and I want to randomly subsample 100 data points 999 times and perform the same function (a generalised least squares model) on each subsample, and see how often the correlation would be significant.
我不确定用什么词来形容这件事才是正确的,所以很抱歉,我把术语弄错了。基本上,我有大约1000个数据点,我想随机对100个数据点进行999次子采样,并对每个子样本执行相同的函数(广义最小二乘模型),看看相关性有多大。
I am also adding some more context, in case it helps. My data is in a data frame with various columns, and I am doing a comparing if there is a relationship between altitude and dichromatism, and whether the relationship between the two varies depending on whether dichromatism is measured using a spectrophotometer or human scoring. I also include latitude centroid of species range in these models, so the PGLS for each looks as follows:
我还添加了一些背景,以防有帮助。我的数据是在一个数据框架与各种列,我做了一个比较,如果有一个高度和二色性之间的关系,以及两者之间的关系是否不同,这取决于是否使用分光光度计或人类评分测量二色性。我还在这些模型中包括了物种范围的纬度质心,因此每个模型的PGLS如下所示:
PGLS_VO_Score <- gls(Colour_discriminability_Absolute ~ Altitude_Reported*Centroid.Abs,
correlation = corPagel(1, phy = AvianTreeEdge, form = ~Species),
data = VO_HumanScores_Merged, method = "ML")
PGLS_Human_Score <- gls(Human_Score ~ Altitude_Reported*Centroid.Abs,
correlation = corPagel(1, phy = AvianTreeEdge, form = ~Species),
data = VO_HumanScores_Merged, method = "ML")
And the data frame of VO_Human_Scores_Merged included a columnn for species names, for Human Scores, for spectrophotometer scores, altitude, latitude, and then some transformed values of those (log transformed, etc.) which I did to begin with in case I needed to to transform the data to meet the assumptions of the PGLS.
VO_Human_Scores_Merged的数据框包括一列物种名称、人类评分、分光光度计评分、海拔、纬度,然后是这些的一些转换值(对数转换等)。我一开始就这么做了,以防我需要转换数据,以满足PGLS的假设。
更多回答
A pipeline sampling helps to view what can be done here:
管道采样有助于查看此处可以执行的操作:
myfun <- function(x) cor(x[[1]], x[[3]])
set.seed(42)
replicate(5, mtcars[sample(nrow(mtcars), 10),], simplify=FALSE) |>
lapply(myfun)
# [[1]]
# [1] -0.8130999
# [[2]]
# [1] -0.8633841
# [[3]]
# [1] -0.7967049
# [[4]]
# [1] -0.901294
# [[5]]
# [1] -0.8761853
(My 5
is your 999
, my 10
is your 100
.)
(我的5是你的999,我的10是你的100)
The simplify=FALSE
is required since otherwise replicate
will reduce to a (nested) matrix, not what we want. My myfun
is contrived, use whatever function you want.
simplify=是必需的,因为否则replicate将减少到(嵌套的)矩阵,而不是我们想要的。我的myfun是人为的,使用任何你想要的功能。
The (perhaps only) advantage to breaking it out into two (or more) steps in a pipeline is that if you want to go back to revisit the random sampling, it's much simpler if you save that random sampling. For example,
在管道中将其分成两个(或更多)步骤的好处(可能只有一个)是,如果您想要重新查看随机采样,则保存随机采样会简单得多。例如,
set.seed(42)
sampdat <- replicate(5, mtcars[sample(nrow(mtcars), 10),], simplify=FALSE)
lapply(sampdat, myfun)
# [[1]]
# [1] -0.8130999
# [[2]]
# [1] -0.8633841
# [[3]]
# [1] -0.7967049
# [[4]]
# [1] -0.901294
# [[5]]
# [1] -0.8761853
If you later realize you need to do something else with the sample data (another metric or whatever) and you don't (for time, memory, or convenience) want to have to rerun all of the other sample-aggregations, you can re-use sampdat
.
如果您后来意识到需要对样本数据执行其他操作(另一个指标或其他指标),并且您不想(为了时间、内存或便利性)必须重新运行所有其他样本聚合,则可以重用sampdat。
You can take a random sample from your datapoints using sample
. Then you can run your function n times using replicate
.
An example that takes a random sample of n=100 and computes the mean 10 times:
您可以使用Sample从您的数据点随机抽取样本。然后,您可以使用REPLICATE运行函数n次。下面是一个随机抽样n=100并计算平均值10次的示例:
> set.seed(1)
> datapoints <- runif(1000, max = 10000)
> result <- replicate(10, mean(sample(datapoints, 100)))
5194.298 5063.320 5064.992 4681.281 5008.011 4849.998 5320.206 5012.931 4900.636 4776.135
更多回答
Thank you for your comment. I think I did something wrong, and am not sure why, because every output I got was the exact same, which I do not believe is what is meant to happen. This is what I put in myfun <- function(PGLS_VO_Scores) cor(VO_HumanScores_Merged$Colour_discriminability_Absolute, VO_HumanScores_Merged$Altitude_Reported) BirdReplicationAttempt <- replicate(999, VO_HumanScores_Merged[sample(nrow(VO_HumanScores_Merged), 100),], simplify=FALSE) |> lapply(myfun)
谢谢你的评论。我想我做错了什么,我不确定为什么,因为我得到的每一个输出都是完全相同的,我不相信这是应该发生的。这是我在MyFun<-Function(PGLS_VO_SCORKS)cor(VO_HumanScores_Merged$Colour_discriminability_Absolute,VO_HumanSCORES_MERGE$ALTALITY_REPORTED)中放入的内容)BirdReplicationAttempt<-REPLICATE(999,VO_HumanScores_Merged[sample(nrow(VO_HumanScores_Merged),100),],SIMPLICE=FALSE)|>lApply(MyFun)
I have added more context to the original query in case that helps in understanding where the error occurred
我向原始查询添加了更多上下文,以防有助于理解错误发生的位置
You write a function that accepts as its sole argument PGLS_VO_Scores
but never use it, instead choosing to breach scope and grab data from something else entirely. The function is supposed to take sample data and do something with that sample data, not data that might (or might not) be in some calling environment.
您编写了一个函数,该函数接受PGLS_VO_SCORKS作为其唯一参数,但从不使用它,而是选择突破作用域并从完全不同的东西获取数据。该函数应该获取样本数据并对该样本数据执行某些操作,而不是可能(也可能不)在某个调用环境中的数据。
Try changing your function to myfun <- function(x) cor(x$Colour_discriminability_Absolute, x$Altitude_Reported)
and rerun your replication.
尝试将您的函数更改为myFun<-Function(X)COR(x$COLUR_DIRECTABILY_ADVAL,x$ALIGHTAL_REPORTED),然后重新运行复制。
Thanks, that seems to have worked. And just to confirm, the output of that, is that the p values of the correlation? Or the correlation itself?
谢谢,这似乎起作用了。为了确认一下,输出的是相关性的p值吗?或者相关性本身?
Thank you for your comment. I tried to do this using the PGLS function which I want to rerun, replacing that for the "mean". and replacing "datapoints" for my data set, so it reads as follows: replicate(999, PGLS_VO_Score(sample(VO_HumanScores_Merged, 100)))
, but I only got an error, as follows: Error in PGLS_VO_Score(sample(VO_HumanScores_Merged, 100) : could not find function "PGLS_VO_Score"
Is there a way to resolve this so that it recognises the function which I used for the entire dataset as the function I want to apply to each subset?
谢谢你的评论。我尝试使用我想要重新运行的PGLS函数来实现这一点,将其替换为“Mean”。并为我的数据集替换“datapPoints”,因此它如下所示:REPLICATE(999,PGLS_VO_SCORE(SAMPLE(VO_HumanScores_Merded,100)),但我只收到了一个错误,如下所示:ERROR in PGLS_VO_SCORE(Sample(VO_HumanScores_Merge,100):找不到函数“PGLS_VO_SCORE”有没有办法解决这个问题,使它将我对整个数据集使用的函数识别为我要应用于每个子集的函数?
我是一名优秀的程序员,十分优秀!