gpt4 book ai didi

R: Is there an equivalent to Stata's codebookout command?(R: 有类似Stata的codebookout命令吗?)

转载 作者:bug小助手 更新时间:2023-10-22 13:50:50 26 4
gpt4 key购买 nike



In Stata I am able to use the codebookout command to create an Excel workbook that saves name, label, and storage type of all the variables in the existing dataset with their corresponding values and value labels.

在Stata中,我可以使用codebookout命令创建一个Excel工作簿,该工作簿保存现有数据集中所有变量的名称、标签和存储类型及其相应的值和值标签。



I would like to find an equivalent function in R. So far, I've encountered the memisc library which has a function called codebook, but it does not do the same thing as in Stata.

我想在R中找到一个等价的函数。到目前为止,我遇到了memisc库,它有一个名为codebook的函数,但它的作用与Stata不同。



For example, In Stata, the output of the codebook would look like this...(see below - this is what I want)

例如,在Stata中,代码本的输出如下。。。(见下文-这就是我想要的)



Variable Name   Variable Label    Answer Label  Answer Code    Variable Type
hhid hhid Open ended String
inter_month inter_month Open ended long
year year Open ended long
org_unit org_unit long
Balaka 1
Blantyre 2
Chikwawa 3
Chiradzulu 4


i.e. each column in the data frame is evaluated to produce values for 5 different columns:

即对数据帧中的每一列进行评估以产生5个不同列的值:




  • Variable Name which is the name of the column

  • Variable Label which is
    the name of the column

  • Answer Label which is the unique values in the
    column. If there are no unique values, it is considered open ended

  • Answer Code which is the numerical assignment to each category in the Answer Label. Blank if the Answer Label is not categorical.

  • Variable Type: int, str, long (date)...



Here is my attempt:

以下是我的尝试:



CreateCodebook <- function(dF){
numbercols <- length(colnames(dF))

table <- data.frame()

for (i in 1:length(colnames(dF))){
AnswerCode <- if (sapply(dF, is.factor)[i]) 1:nrow(unique(dF[i])) else ""
AnswerLabel <- if (sapply(dF, is.factor)[i]) unique(dF[order(dF[i]),][i]) else "Open ended"
VariableName <- if (length(AnswerCode) - 1 > 1) c(colnames(dF)[i],
rep("",length(AnswerCode) - 1)) else colnames(dF)[i]
VariableLabel <- if (length(AnswerCode) - 1 > 1) c(colnames(dF)[i],
rep("",length(AnswerCode) - 1)) else colnames(dF)[i]
VariableType <- if (length(AnswerCode) - 1 > 1) c(sapply(dF, class)[i],
rep("",length(AnswerCode) - 1)) else sapply(dF, class)[i]

df = data.frame(VariableName, VariableLabel, AnswerLabel, AnswerCode, VariableType)
names(df) <- c("Variable Name", "Variable Label", "Variable Type", "Answer Code", "Answer Label")
table <- rbind(table, df)

}
return(table)
}


Unfortunately, I am getting the following warning message:

不幸的是,我收到以下警告信息:



Warning messages:
1: In `[<-.factor`(`*tmp*`, ri, value = 1:3) :
invalid factor level, NA generated
2: In `[<-.factor`(`*tmp*`, ri, value = 1:2) :
invalid factor level, NA generated


The output I produce results in the Answer Code label getting messed up:

我产生的输出导致“答案代码”标签变得一团糟:



              Variable Name Variable Label Variable Type Answer Code Answer Label
hhid hhid hhid Open ended character
month month month Open ended integer
year year year Open ended integer
org_unit org_unit org_unit Open ended character
v000 v000 v000 Open ended character
v001 v001 v001 Open ended integer
v002 v002 v002 Open ended integer
v003 v003 v003 Open ended integer
v005 v005 v005 Open ended integer
v006 v006 v006 Open ended integer
v007 v007 v007 Open ended integer
v021 v021 v021 Open ended numeric
2285 v024 v024 central <NA> factor
1 north <NA>
7119 south <NA>
11 v025 v025 rural <NA> factor
1048 v025 v025 urban <NA> factor
district_name district_name district_name Open ended character
coords_x1 coords_x1 coords_x1 Open ended numeric
coords_x2 coords_x2 coords_x2 Open ended numeric
itn_color itn_color itn_color Open ended numeric
piped piped piped Open ended numeric
sanit sanit sanit Open ended numeric
sanit_cd sanit_cd sanit_cd Open ended numeric
water water water Open ended numeric

更多回答

can you show how you've tried to answer this question so far? You could start writing some code ... (Otherwise, this is either "find an off-site resource" (off-topic) or "write code for me" (off-topic) ...)

你能展示一下到目前为止你是如何试图回答这个问题的吗?你可以开始写一些代码。。。(否则,这要么是“查找非现场资源”(脱离主题),要么是“为我写代码”(脱离话题)…)

I basically have a DataFrame (it can be any dataframe, doesn't matter) and I applied codebook to that df. But the output isn't what I want.

我基本上有一个DataFrame(它可以是任何数据帧,没关系),我将代码本应用于该df。但是输出不是我想要的。

I'm sorry I read too quickly and didn't see that you'd mentioned memisc::codebook in the original version of your question. Nevertheless, I'm afraid that (if you can't make more headway on your own) that this question may not be suitable for SO, since you basically want a customized/very specific output.

很抱歉,我读得太快了,没有看到你在问题的原始版本中提到memisc::codebook。尽管如此,我担心(如果你不能自己取得更多进展)这个问题可能不适合SO,因为你基本上想要定制/非常具体的输出。

It doesn't help that codebookout is not even a base Stata command: it's en extension (that outputs an Excel file). OP, you need to include sample data and output to allow for equivalent R code to be offered.

codebookout甚至不是一个基本的Stata命令也无济于事:它是一个扩展名(输出Excel文件)。OP,您需要包括示例数据和输出,以便提供等效的R代码。

You don't need to post your own data to create a good reproducible example (see link) ...

您不需要发布自己的数据来创建一个良好的可复制示例(请参阅链接)。。。

优秀答案推荐

I decided to take a crack at this for my own amusement. I used the built-in Titanic data set. I had an issue with one of your definitions, though: you say "If there are no unique values, it is considered open ended". But every variable of length >0 has some unique values: did you mean "if every value is unique"? Even this definition doesn't necessarily work as expected: in the Titanic data set, the responses are integer, and there happen to be only 22 unique values out of 32 total values. I didn't think that one would really want this to be enumerated, so I tested for type of factor instead (but you could substitute the length(u)==length(x) line below if you really want).

我决定试试这个,以自娱自乐。我使用了内置的泰坦尼克号数据集。不过,我对你的一个定义有意见:你说“如果没有唯一的值,它就被认为是开放式的”。但是每个长度大于0的变量都有一些唯一的值:你的意思是“如果每个值都是唯一的”吗?即使是这个定义也不一定如预期的那样起作用:在泰坦尼克号数据集中,响应是整数,而在32个总值中恰好只有22个唯一值。我不认为人们真的希望枚举它,所以我测试了因子的类型(但如果你真的想的话,你可以替换下面的length(u)==length(x)行)。



## utility function: pad vector with blanks to specified length
pad <- function(x,n,p="") {
return(c(x,rep(p,n-length(x))))
}
## process a single column
proc_col <- function(x,nm) {
u <- unique(x)
## if (length(u)==length(x)) {
if (!is.factor(x)) {
n <- 1
u <- "open ended"
cc <- ""
} else {
cc <- as.numeric(u)
n <- length(u)
}
dd <- data.frame(`Variable Name`=pad(nm,n),
`Variable Label`=pad(nm,n),
`Answer Label`=u,
`Answer Code`=cc,
`Variable Type`=pad(class(x),n),
stringsAsFactors=FALSE)
return(dd)
}
## process all columns
proc_df <- function(x) {
L <- Map(proc_col,x,names(x))
dd <- do.call(rbind,L)
rownames(dd) <- NULL
return(dd)
}


Example:

示例:



xx <- as.data.frame.table(Titanic)
proc_df(xx)

## Variable.Name Variable.Label Answer.Label Answer.Code Variable.Type
## 1 Class Class 1st 1 factor
## 2 2nd 2
## 3 3rd 3
## 4 Crew 4
## 5 Sex Sex Male 1 factor
## 6 Female 2
## 7 Age Age Child 1 factor
## 8 Adult 2
## 9 Survived Survived No 1 factor
## 10 Yes 2
## 11 Freq Freq open ended numeric


I didn't leave blank spaces before the lists of code values etc., but you can make those adjustments yourself ...

我没有在代码值列表等之前留下空格,但您可以自己进行这些调整。。。



Here is my crack at a solution:

以下是我的解决方案:



CreateCodebook <- function(dF){
numbercols <- length(colnames(dF))

table <- data.frame()

for (i in 1:length(colnames(dF))){
AnswerCode <- if (sapply(dF, is.factor)[i]) 1:nrow(unique(dF[i])) else ""
AnswerLabel <- if (sapply(dF, is.factor)[i]) unique(dF[order(dF[i]),][i]) else "Open ended"
VariableName <- if (length(AnswerCode) > 1) c(colnames(dF)[i],
rep("",length(AnswerCode) - 1)) else colnames(dF)[i]
VariableLabel <- if (length(AnswerCode) > 1) c(colnames(dF)[i],
rep("",length(AnswerCode) - 1)) else colnames(dF)[i]
VariableType <- if (length(AnswerCode) > 1) c(sapply(dF, class)[i],
rep("",length(AnswerCode) - 1)) else sapply(dF, class)[i]

df = data.frame(VariableName, VariableLabel, AnswerLabel, AnswerCode, VariableType, stringsAsFactors = FALSE)
names(df) <- c("Variable Name", "Variable Label", "Variable Type", "Answer Code", "Answer Label")
table <- rbind(table, df)

}
rownames(table) <- 1:nrow(table)
return(table)
}


Output:

输出:



   Variable Name Variable Label Variable Type Answer Code Answer Label
1 brid brid Open ended character
2 month month Open ended integer
3 year year Open ended integer
4 org_unit org_unit Open ended character
5 v000 v000 Open ended character
6 v001 v001 Open ended integer
7 v002 v002 Open ended integer
8 v003 v003 Open ended integer
9 v005 v005 Open ended integer
10 v006 v006 Open ended integer
11 v007 v007 Open ended integer
12 v021 v021 Open ended numeric
13 v024 v024 central 1 factor
14 north 2
15 south 3
16 v025 v025 rural 1 factor
17 urban 2
18 bidx bidx Open ended integer
19 district_name district_name Open ended character
20 coords_x1 coords_x1 Open ended numeric
21 coords_x2 coords_x2 Open ended numeric
22 anc4 anc4 Open ended numeric
23 antimal_48 antimal_48 Open ended numeric
24 carep carep Open ended numeric
25 csec csec Open ended numeric
26 dptv dptv Open ended numeric
27 ebreast ebreast Open ended numeric
28 fans_48 fans_48 Open ended numeric
29 ideliv ideliv Open ended numeric
30 iptp iptp Open ended numeric
31 iron90 iron90 Open ended numeric
32 measlesv measlesv Open ended numeric
33 ors ors Open ended numeric
34 ort ort Open ended numeric
35 pncwm pncwm Open ended numeric
36 sstools sstools Open ended numeric
37 tt tt Open ended numeric
38 vita vita Open ended numeric

更多回答

Thank you so much Ben! I will definitely upvote this and accept this as the answer. For my own benefit, I have also come up with my own attempt at a solution. I am very close, but I'm getting a warning message.

非常感谢本!我一定会投赞成票,并接受这个答案。为了我自己的利益,我也提出了自己的解决方案。我离得很近,但我收到了一条警告信息。

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com