gpt4 book ai didi

r - 如何将数据文件的某些行读入 R

转载 作者:行者123 更新时间:2023-12-03 21:48:10 26 4
gpt4 key购买 nike

我有一个包含 40,000 多行的大型数据文件。这是一个日志输入列表,看起来有点像这样:

    D 20160602 14:15:43.559 F7982D62 Req Agr:131 Mra:0 Exp:0 Mxr:0 Mnr:0 Mxd:0 Mnd:0 Nro:0      
D 20160602 14:15:43.559 F7982D62 Set Agr:130 Mra:0 Exp:0 Mxr:0 Mnr:0 Mxd:0 Mnd:0 Nro:0 I 20160602 14:15:43.559 F7982D62 GET 156.145.15.85:36773 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0065109_HS_EX__1-04692__v3_FCAD2HMUACXX_L4_p1of1_P2.fastq.gz" ""
M 20160602 14:15:43.595 DOC1: F7982D62 Request for unencrypted meta data on encrypted transaction
M 20160602 14:15:48.353 DOC1: F7982D62 Transaction has been acknowledged at 722875647
F 20160602 14:15:48.398 F7982D62 GET 156.145.15.85:36773 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0065109_HS_EX__1-04692__v3_FCAD2HMUACXX_L4_p1of1_P2.fastq.gz" "" 50725464 (4,32) "Remote Application: Session Aborted: Aborted by user interrupt"
M 20160602 14:15:48.780 DOC1: F7982D63 New download request D 20160602 14:15:48.780 F7982D63 META: 134 Path: /pcgc/public/CTD/exome/fastq/PCGC0033175_HS_EX__1-00304-01__v1_FCBC0RE4ACXX_L3_p32of96_P2.fastq.gz user: xqixh8sl pack: arg: feat: cE,s

由于它太大了,我不想将整个内容读入内存。我只需要以行标识符“F”开头并有 (0, 0) 错误的行,如下所示:
    F 20160602 14:25:11.321 F7982D50 GET 156.145.15.85:37525 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0077248_HS_EX__1-06808__v3_FCC49HJACXX_L7_p1of1_P1.fastq.gz" "" 3322771022 (0,0) "1499.61 seconds (17.7 megabits/sec)"

其他一切我都可以忽略。我的问题是:我想要一种方法来逐行读取此文件并评估它是否需要保留该行以进行导入。目前,我正在使用 for循环遍历每一行并使用 readLines()功能。它看起来像这样:
library(stringr)
con <- file("dataSet.txt", open = "r")
Fdata <- data.frame
i <- 1
j <- 1
lineLength <- length(readLines(con))
for (i in 1:lineLength){
line <- readLines("dataSet.txt", 1)
if (str_sub(line, 1, 1) == 'F' && grepl("\\(0\\,0\\)", line)[i]){
print(line)
Fdata[j,] <- rbind(line)
i <- i + 1
j <- j + 1
}
i <- i + 1
}
print(Fdata)

它运行良好,但它给我的输出不是我想要的。它只是一遍又一遍地打印文件的第一行。
    [1] "C 20160525 05:27:47.915 Rotated log file: /var/log/servedat-201605250527.log"
[1] "C 20160525 05:27:47.915 Rotated log file: /var/log/servedat-201605250527.log"
[1] "C 20160525 05:27:47.915 Rotated log file: /var/log/servedat-201605250527.log"
[1] "C 20160525 05:27:47.915 Rotated log file: /var/log/servedat-201605250527.log"

如何让它评估我是否需要该行,以及如何正确存储它(作为向量、数据框、矩阵,这并不重要)以便我可以在 for 循环之外打印它?

更新

我已将代码更改为:
    library(stringr)
con <- file("dataSet.txt", open = "r")
Fdata <- data.frame
i <- 1
j <- 1
lineLength <- length(readLines(con))
for (i in 1:lineLength){
line <- readLines(con, 1)
print(line)
if (str_sub(line, 1, 1) == 'F' && grepl("\\(0\\,0\\)", line)[i]){
print(line)
Fdata[j,] <- rbind(line)
i <- i + 1
j <- j + 1
}
i <- i + 1
}
print(Fdata)

但是,当我检查存储在行中的值时,它说它是空的。我不明白为什么它改变了。此外,它告诉我 if 语句没有正确的 TRUE/FALSE 条件,这也让我感到困惑,因为 grepl() 应该返回一个 TRUE/FALSE 值。

更新

我设法摆脱了这个错误,但是当我调用 Fdata 时我仍然没有得到任何东西。我检查了我的变量,R 说那行是空的,没有字符。我是否错误地分配了它?我希望 line 是我在数据文件中解析的行,并评估是否需要存储它。这是我更新的代码:

library(stringr)
con <- file("dataSet.txt", open = "r")
Fdata <- data.frame
i <- 1
j <- 1
lineLength <- length(readLines("dataSet.txt))
for (i in 1:lineLength){
line <- readLines(con, 1)
print(line)
if (str_sub(line, 1, 1) == 'F' && grepl("\\(0\\,0\\)", line)){
print(line)
Fdata[j,] <- rbind(line)
i <- i + 1
j <- j + 1
}
i <- i + 1
}
print(Fdata)

最佳答案

看一下这个:

con <- file("test1.txt", "r")
lines <- c()
while(TRUE) {
line = readLines(con, 1)
if(length(line) == 0) break
else if(grepl("^\\s*F{1}", line) && grepl("(0,0)", line, fixed = TRUE)) lines <- c(lines, line)
}

lines
# [1] "F 20160602 14:25:11.321 F7982D50 GET 156.145.15.85:37525 xqixh8sl AES \"/pcgc/public/Other/exome/fastq/PCGC0077248_HS_EX__1-06808__v3_FCC49HJACXX_L7_p1of1_P1.fastq.gz\" \"\" 3322771022 (0,0) \"1499.61 seconds (17.7 megabits/sec)\""

将文件流传递给 readLines以便它可以逐行读取。使用正则表达式 ^\\s*F{1}捕获以字母开头的行 F可能有空格,其中 ^表示字符串的开头。使用 fixed=T捕获 (0,0) 的完全匹配.如果两个检查都是 TRUE , 将结果附加到行。

数据 :
D 20160602 14:15:43.559 F7982D62 Req Agr:131 Mra:0 Exp:0 Mxr:0 Mnr:0 Mxd:0 Mnd:0 Nro:0      
D 20160602 14:15:43.559 F7982D62 Set Agr:130 Mra:0 Exp:0 Mxr:0 Mnr:0 Mxd:0 Mnd:0 Nro:0 I 20160602 14:15:43.559 F7982D62 GET 156.145.15.85:36773 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0065109_HS_EX__1-04692__v3_FCAD2HMUACXX_L4_p1of1_P2.fastq.gz" ""
M 20160602 14:15:43.595 DOC1: F7982D62 Request for unencrypted meta data on encrypted transaction
M 20160602 14:15:48.353 DOC1: F7982D62 Transaction has been acknowledged at 722875647
F 20160602 14:15:48.398 F7982D62 GET 156.145.15.85:36773 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0065109_HS_EX__1-04692__v3_FCAD2HMUACXX_L4_p1of1_P2.fastq.gz" "" 50725464 (4,32) "Remote Application: Session Aborted: Aborted by user interrupt"
M 20160602 14:15:48.780 DOC1: F7982D63 New download request D 20160602 14:15:48.780 F7982D63 META: 134 Path: /pcgc/public/CTD/exome/fastq/PCGC0033175_HS_EX__1-00304-01__v1_FCBC0RE4ACXX_L3_p32of96_P2.fastq.gz user: xqixh8sl pack: arg: feat: cE,s
F 20160602 14:25:11.321 F7982D50 GET 156.145.15.85:37525 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0077248_HS_EX__1-06808__v3_FCC49HJACXX_L7_p1of1_P1.fastq.gz" "" 3322771022 (0,0) "1499.61 seconds (17.7 megabits/sec)"

关于r - 如何将数据文件的某些行读入 R,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37923041/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com