gpt4 book ai didi

Is there a way to create a loop where I provide a function and dataframe and subsample it, and repeat the function with a subsample N times?(有没有办法创建一个循环,在其中我提供一个函数和数据帧并对其进行子采样,然后对一个子采样重复该函数N次?)

转载 作者:bug小助手 更新时间:2023-10-28 11:06:17 25 4
gpt4 key购买 nike



I am not sure what the correct word for this would be, so apologies for getting the terminology horribly wrong. Basically I have about 1000 datapoints, and I want to randomly subsample 100 data points 999 times and perform the same function (a generalised least squares model) on each subsample, and see how often the correlation would be significant.

我不确定用什么词来形容这件事才是正确的,所以很抱歉,我把术语弄错了。基本上,我有大约1000个数据点,我想随机对100个数据点进行999次子采样,并对每个子样本执行相同的函数(广义最小二乘模型),看看相关性有多大。


I am also adding some more context, in case it helps. My data is in a data frame with various columns, and I am doing a comparing if there is a relationship between altitude and dichromatism, and whether the relationship between the two varies depending on whether dichromatism is measured using a spectrophotometer or human scoring. I also include latitude centroid of species range in these models, so the PGLS for each looks as follows:

我还添加了一些更多的背景信息,以防有帮助。我的数据在一个有不同列的数据框中,我正在比较海拔高度和双色性之间是否存在关系,以及两者之间的关系是否会因双色性是使用分光光度计测量还是使用人类评分而变化。我还在这些模型中包括了物种范围的纬度质心,因此每个模型的PGL如下所示:


PGLS_VO_Score <- gls(Colour_discriminability_Absolute ~ Altitude_Reported*Centroid.Abs, 
correlation = corPagel(1, phy = AvianTreeEdge, form = ~Species),
data = VO_HumanScores_Merged, method = "ML")

PGLS_Human_Score <- gls(Human_Score ~ Altitude_Reported*Centroid.Abs,
correlation = corPagel(1, phy = AvianTreeEdge, form = ~Species),
data = VO_HumanScores_Merged, method = "ML")

And the data frame of VO_Human_Scores_Merged included a columnn for species names, for Human Scores, for spectrophotometer scores, altitude, latitude, and then some transformed values of those (log transformed, etc.) which I did to begin with in case I needed to to transform the data to meet the assumptions of the PGLS.

VO_Human_Score_Merge的数据框包括种名栏、人类评分栏、分光光度计分栏、海拔、纬度栏以及它们的一些变换值(对数变换等)。我一开始就这样做了,以防我需要转换数据以满足PGL的假设。


更多回答
优秀答案推荐


A pipeline sampling helps to view what can be done here:

管道采样有助于查看此处可以执行的操作:


myfun <- function(x) cor(x[[1]], x[[3]])
set.seed(42)
replicate(5, mtcars[sample(nrow(mtcars), 10),], simplify=FALSE) |>
lapply(myfun)
# [[1]]
# [1] -0.8130999
# [[2]]
# [1] -0.8633841
# [[3]]
# [1] -0.7967049
# [[4]]
# [1] -0.901294
# [[5]]
# [1] -0.8761853

(My 5 is your 999, my 10 is your 100.)

(My 5是你的999,我的10是你的100。


The simplify=FALSE is required since otherwise replicate will reduce to a (nested) matrix, not what we want. My myfun is contrived, use whatever function you want.

simplify=是必需的,因为否则replicate将减少到(嵌套的)矩阵,而不是我们想要的。我的myfun是人为的,使用任何你想要的功能。


The (perhaps only) advantage to breaking it out into two (or more) steps in a pipeline is that if you want to go back to revisit the random sampling, it's much simpler if you save that random sampling. For example,

在管道中将其分成两个(或更多)步骤的好处(可能只有一个)是,如果您想要重新查看随机采样,则保存随机采样会简单得多。例如,


set.seed(42)
sampdat <- replicate(5, mtcars[sample(nrow(mtcars), 10),], simplify=FALSE)
lapply(sampdat, myfun)
# [[1]]
# [1] -0.8130999
# [[2]]
# [1] -0.8633841
# [[3]]
# [1] -0.7967049
# [[4]]
# [1] -0.901294
# [[5]]
# [1] -0.8761853

If you later realize you need to do something else with the sample data (another metric or whatever) and you don't (for time, memory, or convenience) want to have to rerun all of the other sample-aggregations, you can re-use sampdat.

如果您后来意识到需要对样本数据执行其他操作(另一个指标或其他指标),并且您不想(为了时间、内存或便利性)必须重新运行所有其他样本聚合,则可以重用sampdat。



You can take a random sample from your datapoints using sample. Then you can run your function n times using replicate.
An example that takes a random sample of n=100 and computes the mean 10 times:

您可以使用Sample从您的数据点随机抽取样本。然后,您可以使用REPLICATE运行函数n次。下面是一个随机抽样n=100并计算平均值10次的示例:


> set.seed(1)
> datapoints <- runif(1000, max = 10000)
> result <- replicate(10, mean(sample(datapoints, 100)))
5194.298 5063.320 5064.992 4681.281 5008.011 4849.998 5320.206 5012.931 4900.636 4776.135

更多回答

Thank you for your comment. I think I did something wrong, and am not sure why, because every output I got was the exact same, which I do not believe is what is meant to happen. This is what I put in myfun <- function(PGLS_VO_Scores) cor(VO_HumanScores_Merged$Colour_discriminability_Absolute, VO_HumanScores_Merged$Altitude_Reported) BirdReplicationAttempt <- replicate(999, VO_HumanScores_Merged[sample(nrow(VO_HumanScores_Merged), 100),], simplify=FALSE) |> lapply(myfun)

感谢您发送编修。我想我做错了什么,不知道为什么,因为我得到的每一个输出都是完全一样的,我不相信这是注定要发生的。这是我在myfun <- function(PGLS_VO_Scores)cor(VO_HumanScores_Merged$Colour_discriminability_Absolute,VO_HumanScores_Merged$Altitude_Reported)中放入的内容BirdReplicationAttempt <- replicate(999,VO_HumanScores_Merged[sample(nrow(VO_HumanScores_Merged),100),],simplify=)|> lapply(myfun)

I have added more context to the original query in case that helps in understanding where the error occurred

我向原始查询添加了更多上下文,以防有助于理解错误发生的位置

You write a function that accepts as its sole argument PGLS_VO_Scores but never use it, instead choosing to breach scope and grab data from something else entirely. The function is supposed to take sample data and do something with that sample data, not data that might (or might not) be in some calling environment.

您编写了一个函数,它接受PGLS_VO_Scores作为其唯一参数,但从不使用它,而是选择突破范围并完全从其他东西获取数据。该函数应该获取样本数据并对该样本数据执行某些操作,而不是可能(或可能不)在某些调用环境中的数据。

Try changing your function to myfun <- function(x) cor(x$Colour_discriminability_Absolute, x$Altitude_Reported) and rerun your replication.

尝试将您的函数更改为myFun<-Function(X)COR(x$COLUR_DIRECTABILY_ADVAL,x$ALIGHTAL_REPORTED),然后重新运行复制。

Thanks, that seems to have worked. And just to confirm, the output of that, is that the p values of the correlation? Or the correlation itself?

谢谢,这似乎起作用了。为了确认一下,输出的是相关性的p值吗?或者相关性本身?

Thank you for your comment. I tried to do this using the PGLS function which I want to rerun, replacing that for the "mean". and replacing "datapoints" for my data set, so it reads as follows: replicate(999, PGLS_VO_Score(sample(VO_HumanScores_Merged, 100))), but I only got an error, as follows: Error in PGLS_VO_Score(sample(VO_HumanScores_Merged, 100) : could not find function "PGLS_VO_Score" Is there a way to resolve this so that it recognises the function which I used for the entire dataset as the function I want to apply to each subset?

感谢您发送编修。我尝试使用PGLS函数来实现这一点,我想将其替换为“平均值”。并将我的数据集替换为“datapoints”,因此它如下所示:replicate(999,PGLS_VO_Score(sample(VO_HumanScores_Merged,100),但我只得到一个错误,如下所示:(示例(VO_HumanScores_Merged,100):无法找到函数“PGLS_VO_Score”有没有一种方法可以解决这个问题,使它识别我用于整个数据集的函数,作为我想要应用于每个子集的函数?

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com