gpt4 book ai didi

r - 如何计算分组数据集的中位数?

转载 作者:行者123 更新时间:2023-12-04 03:44:15 25 4
gpt4 key购买 nike

我的数据集如下:

salary  number
1500-1600 110
1600-1700 180
1700-1800 320
1800-1900 460
1900-2000 850
2000-2100 250
2100-2200 130
2200-2300 70
2300-2400 20
2400-2500 10
如何计算此数据集的中位数?这是我尝试过的:
x <- c(110, 180, 320, 460, 850, 250, 130, 70, 20, 10)
colnames <- "numbers"
rownames <- c("[1500-1600]", "(1600-1700]", "(1700-1800]", "(1800-1900]",
"(1900-2000]", "(2000,2100]", "(2100-2200]", "(2200-2300]",
"(2300-2400]", "(2400-2500]")
y <- matrix(x, nrow=length(x), dimnames=list(rownames, colnames))
data.frame(y, "cumsum"=cumsum(y))

numbers cumsum
[1500-1600] 110 110
(1600-1700] 180 290
(1700-1800] 320 610
(1800-1900] 460 1070
(1900-2000] 850 1920
(2000,2100] 250 2170
(2100-2200] 130 2300
(2200-2300] 70 2370
(2300-2400] 20 2390
(2400-2500] 10 2400
在这里,您可以看到中途频率为 2400/2 = 1200。它在 10701920之间。因此, 中类(1900-2000]组。您可以使用以下公式获得此结果:

Median = L + h/f (n/2 - c)


在哪里:

L is the lower class boundary of median class
h is the size of the median class i.e. difference between upper and lower class boundaries of median class
f is the frequency of median class
c is previous cumulative frequency of the median class
n/2 is total no. of observations divided by 2 (i.e. sum f / 2)


或者,通过以下方法定义 中位数类:

Locate n/2 in the column of cumulative frequency.

Get the class in which this lies.


并在代码中:
> 1900 + (1200 - 1070) / (1920 - 1070) * (2000 - 1900)    
[1] 1915.294
现在我要做的是使上面的表达式更优雅-即 1900+(1200-1070)/(1920-1070)*(2000-1900)。我怎样才能做到这一点?

最佳答案

由于您已经知道公式,因此创建一个函数为您进行计算应该足够容易。

在这里,我创建了一个基本功能来帮助您入门。该函数有四个参数:

  • frequencies:频率向量(第一个示例中为“数字”)
  • intervals:2行matrix,其列数与频率的长度相同,第一行是下层边界,第二行是上层边界。另外,“intervals”可以是data.frame中的一列,并且您可以指定sep(可能还需要trim)以使该函数自动为您创建所需的矩阵。
  • sep:intervals中“data.frame”列中的分隔符。
  • trim:字符的正则表达式,在尝试强制转换为数字矩阵之前需要将其删除。函数中内置了一种模式:trim = "cut"。这将设置正则表达式模式以从输入中删除(,),[和]。

  • 这是功能(带有注释,显示了我如何使用您的说明将其组合在一起):
    GroupedMedian <- function(frequencies, intervals, sep = NULL, trim = NULL) {
    # If "sep" is specified, the function will try to create the
    # required "intervals" matrix. "trim" removes any unwanted
    # characters before attempting to convert the ranges to numeric.
    if (!is.null(sep)) {
    if (is.null(trim)) pattern <- ""
    else if (trim == "cut") pattern <- "\\[|\\]|\\(|\\)"
    else pattern <- trim
    intervals <- sapply(strsplit(gsub(pattern, "", intervals), sep), as.numeric)
    }

    Midpoints <- rowMeans(intervals)
    cf <- cumsum(frequencies)
    Midrow <- findInterval(max(cf)/2, cf) + 1
    L <- intervals[1, Midrow] # lower class boundary of median class
    h <- diff(intervals[, Midrow]) # size of median class
    f <- frequencies[Midrow] # frequency of median class
    cf2 <- cf[Midrow - 1] # cumulative frequency class before median class
    n_2 <- max(cf)/2 # total observations divided by 2

    unname(L + (n_2 - cf2)/f * h)
    }

    以下是可使用的示例 data.frame:
    mydf <- structure(list(salary = c("1500-1600", "1600-1700", "1700-1800", 
    "1800-1900", "1900-2000", "2000-2100", "2100-2200", "2200-2300",
    "2300-2400", "2400-2500"), number = c(110L, 180L, 320L, 460L,
    850L, 250L, 130L, 70L, 20L, 10L)), .Names = c("salary", "number"),
    class = "data.frame", row.names = c(NA, -10L))
    mydf
    # salary number
    # 1 1500-1600 110
    # 2 1600-1700 180
    # 3 1700-1800 320
    # 4 1800-1900 460
    # 5 1900-2000 850
    # 6 2000-2100 250
    # 7 2100-2200 130
    # 8 2200-2300 70
    # 9 2300-2400 20
    # 10 2400-2500 10

    现在,我们可以简单地执行以下操作:
    GroupedMedian(mydf$number, mydf$salary, sep = "-")
    # [1] 1915.294

    这是一些作用于某些组合数据的函数的示例:
    set.seed(1)
    x <- sample(100, 100, replace = TRUE)
    y <- data.frame(table(cut(x, 10)))
    y
    # Var1 Freq
    # 1 (1.9,11.7] 8
    # 2 (11.7,21.5] 8
    # 3 (21.5,31.4] 8
    # 4 (31.4,41.2] 15
    # 5 (41.2,51] 13
    # 6 (51,60.8] 5
    # 7 (60.8,70.6] 11
    # 8 (70.6,80.5] 15
    # 9 (80.5,90.3] 11
    # 10 (90.3,100] 6

    ### Here's GroupedMedian's output on the grouped data.frame...
    GroupedMedian(y$Freq, y$Var1, sep = ",", trim = "cut")
    # [1] 49.49231

    ### ... and the output of median on the original vector
    median(x)
    # [1] 49.5

    顺便说一句,在您提供的示例数据中,我认为您的一个范围内有一个错误(除破折号外,所有破折号均用破折号隔开,其中一个用逗号隔开),因为 strsplit默认使用正则表达式来继续,您可以使用如下功能:
    x<-c(110,180,320,460,850,250,130,70,20,10)
    colnames<-c("numbers")
    rownames<-c("[1500-1600]","(1600-1700]","(1700-1800]","(1800-1900]",
    "(1900-2000]"," (2000,2100]","(2100-2200]","(2200-2300]",
    "(2300-2400]","(2400-2500]")
    y<-matrix(x,nrow=length(x),dimnames=list(rownames,colnames))
    GroupedMedian(y[, "numbers"], rownames(y), sep="-|,", trim="cut")
    # [1] 1915.294

    关于r - 如何计算分组数据集的中位数?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/18887382/

    25 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com