gpt4 book ai didi

直接从 R 读取 .dat 和 .dct

转载 作者:行者123 更新时间:2023-12-04 13:06:06 25 4
gpt4 key购买 nike

我需要使用 .dct 文件读取 .dat 文件。有没有人用R做过?

格式为:

dictionary {
# how many lines per record
_lines(1)
# start defining the first line
_line(1)

# starting column / storage type / variable name / read format / variable label
_column(1) str8 aid %8s "respondent identifier"
...
}

“读取格式”类似于:
%2f        2 column integer variable
%12s 12 column string variable
%8.2f 8 column number with 2 implied decimal places.

此处描述了存储类型: http://www.stata.com/help.cgi?datatypes

其他用于提供信息的网站:

http://library.columbia.edu/indiv/dssc/technology/stata_write.html

http://www.stata.com/support/faqs/data-management/reading-fixed-format-data/

.dat 文件是一串数字,对应于 .dct 文件中指定的变量。 (大概这是固定宽度列中的数据)。

这是一个真实的例子:

.dtc 文件
http://goo.gl/qHZOk

数据
http://goo.gl/FRGRF

来自 stata 站点的一个具体示例是:
.dat文件(在本例中为“test.raw”)
C1245A101George Costanza
B1223B011Cosmo Kramer
.dct文件
dictionary using test2.raw {
_column(1) str5 code %5s
_column(2) int call %4f
_column(6) str1 city %1s
_column(7) int neigh %3f
_column(10) str16 name %16s
}

结果数据文件:
      +-----------------------------------------------+
| code call city neigh name |
|-----------------------------------------------|
1. | C1245 1245 A 101 George Costanza |
2. | B1223 1223 B 11 Cosmo Kramer |
+-----------------------------------------------+

最佳答案

@thelatemail 是关于如何进行的现场。这是我拼凑起来的一个小函数,让您开始使用更强大的解决方案:

read.dat.dct <- function(dat, dct) {
temp <- readLines(dct)
pattern <- "_column\\(([0-9]+)\\)\\s+([a-z0-9]+)\\s+([a-z0-9_]+)\\s+%([0-9]+).*"
classes <- c("numeric", "character", "character", "numeric")
metadata <- setNames(lapply(1:4, function(x) {
out <- gsub(pattern, paste("\\", x, sep = ""), temp)
out <- gsub("^\\s+|\\s+$|.*\\{|\\}", "", out)
out <- out[out != ""]
class(out) <- classes[x] ; out }),
c("StartPos", "Str", "ColName", "ColWidth"))
read.fwf(dat, widths = metadata[["ColWidth"]],
col.names = metadata[["ColName"]])
}

在错误检查、泛化函数等方面,您还有很多工作要做。例如,此函数不适用于重叠列,如@thelatemail 添加到您的问题的示例中所示。 “StartPos[n] + ColWidth[n]”应等于“StartPos[n+1]”形式的一些错误检查可用于停止读取文件,如果错误消息不正确。此外,结果数据的类也可以从函数生成的“元数据”列表中提取,并在 read.fwf 中分配。使用 colClasses争论。

这是一个 dat 文件和一个 dct 文件来演示:

将以下两行复制并粘贴到文本编辑器中,并将其作为“test.dat”保存在您的工作目录中。
C1245A101George Costanza
B1223B011Cosmo Kramer

将以下行复制并粘贴到文本编辑器中,并将其作为“test.dct”保存在您的工作目录中
dictionary using test.dat {
_column(1) str1 code %1s
_column(2) int call %4f
_column(6) str1 city %1s
_column(7) int neigh %3f
_column(10) str16 name %16s
}

现在,运行函数:
read.dat.dct(dat = "test.dat", dct = "test.dct")
# code call city neigh name
# 1 C 1245 A 101 George Costanza
# 2 B 1223 B 11 Cosmo Kramer

更新:改进的功能(还有很大的改进空间)
read.dat.dct <- function(dat, dct, labels.included = "no") {
temp <- readLines(dct)
temp <- temp[grepl("_column", temp)]
switch(labels.included,
yes = {
pattern <- "_column\\(([0-9]+)\\)\\s+([a-z0-9]+)\\s+(.*)\\s+%([0-9]+)[a-z]\\s+(.*)"
classes <- c("numeric", "character", "character", "numeric", "character")
N <- 5
NAMES <- c("StartPos", "Str", "ColName", "ColWidth", "ColLabel")
},
no = {
pattern <- "_column\\(([0-9]+)\\)\\s+([a-z0-9]+)\\s+(.*)\\s+%([0-9]+).*"
classes <- c("numeric", "character", "character", "numeric")
N <- 4
NAMES <- c("StartPos", "Str", "ColName", "ColWidth")
})
metadata <- setNames(lapply(1:N, function(x) {
out <- gsub(pattern, paste("\\", x, sep = ""), temp)
out <- gsub("^\\s+|\\s+$", "", out)
out <- gsub('\"', "", out, fixed = TRUE)
class(out) <- classes[x] ; out }), NAMES)

metadata[["ColName"]] <- make.names(gsub("\\s", "", metadata[["ColName"]]))

myDF <- read.fwf(dat, widths = metadata[["ColWidth"]],
col.names = metadata[["ColName"]])
if (labels.included == "yes") {
attr(myDF, "col.label") <- metadata[["ColLabel"]]
}
myDF
}

它如何处理您的数据?
temp <- read.dat.dct(dat = "http://dl.getdropbox.com/u/18116710/21600-0009-Data.txt", 
dct = "http://dl.getdropbox.com/u/18116710/21600-0009-Setup.dct",
labels.included = "yes")
dim(temp) # How big is the dataset?
# [1] 180 40
head(temp[, 1:6]) # What do the first few columns & rows look like?
# CASEID AID RRELNO RPREGNO H3PC1.H3PC1 H3PC2.H3PC2
# 1 1 57118381 5 1 1 1
# 2 2 57134970 1 2 1 1
# 3 3 57135078 1 1 1 1
# 4 4 57135078 5 1 1 1
# 5 5 57164981 1 1 7 3
# 6 6 57191909 1 3 1 1
head(attr(temp, "col.label")) # What are the variable labels?
# [1] "CASE IDENTIFICATION NUMBER" "RESPONDENT IDENTIFIER"
# [3] "ROMANTIC RELATIONSHIP NUMBER" "RELATIONSHIP PREGNANCY NUMBER"
# [5] "S23Q1 1 TOLD PARTNER PREGNANT-W3" "S23Q2 MONTHS PREG WHEN TOLD PARTNER-W3"

原始示例怎么样?
read.dat.dct("test.dat", "test.dct", labels.included = "no")
# code call city neigh name
# 1 C 1245 A 101 George Costanza
# 2 B 1223 B 11 Cosmo Kramer

关于直接从 R 读取 .dat 和 .dct,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/14224321/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com