r - 如何使用 R "readLines"命令从大文件中读取选定的行并将它们写入数据框？-6ren

r - 如何使用 R "readLines"命令从大文件中读取选定的行并将它们写入数据框？

转载作者：行者123 更新时间：2023-12-04 19:08:12

26

4

我从事数据清洗。我有一个函数可以识别大型输入文件中的坏行(鉴于我的内存大小，太大而无法一次性读取)并将坏行的行号作为向量返回 badRows .这个功能似乎有效。

我现在试图将坏行读入数据框中，但目前未成功。

我目前的做法是使用 read.table在与我的文件的打开连接上，使用行数向量在读取的每一行之间跳过。对于连续的坏行，此数字为零。

我计算skipVec作为:

(badRowNumbers - c(0, badRowNumbers[1:(length(badRowNumbers-1]))-1

但目前我只是将我的函数交给 skipVec全零向量。

如果我的逻辑是正确的，这应该返回所有行。它不是。相反，我收到一个错误:

"Error in read.table(con, skip = pass, nrow = 1, header = TRUE, sep = "") : no lines available in input"

我目前的功能大致基于 Miron Kursa(“mbq”)的一个功能，我发现了 here .

我的问题与那个问题有些重复，但我认为他的功能有效，所以我以某种方式打破了它。我仍在尝试理解打开文件和打开文件连接之间的区别，我怀疑问题出在某处，或者是我使用的 lapply .

我在 RStudio 0.97.551 下运行 R 3.0.1 在带有 3gig ram 的古怪的旧 Windows XP SP3 机器上。石器时代，我知道。

这是产生上述错误消息的代码:

# Make a small small test data frame, write it to a file, and read it back in 
# a row at a time.
testThis.DF <- data.frame(nnn=c(2,3,5), fff=c("aa", "bb", "cc"))  
testThis.DF 

# This function will work only if the number of bad rows is not too big for memory
write.table(testThis.DF, "testThis.DF")
con<-file("testThis.DF")
open(con)
skipVec <- c(0,0,0)
badRows.DF  <- lapply(skipVec, FUN=function(pass){
  read.table(con, skip=pass, nrow=1, header=TRUE, sep="") })
close(con)

错误发生在关闭命令之前。如果我从 lapply 和函数中拉出 readLines 命令，然后将其单独粘贴进去，我仍然会遇到相同的错误。

最佳答案

如果不是运行 read.table通过 lapply您只需手动运行前几次迭代，您就会看到发生了什么:

> read.table(con, skip=0, nrow=1, header=TRUE, sep="")
  nnn fff
1   2  aa
> read.table(con, skip=0, nrow=1, header=TRUE, sep="")
  X2 X3 bb
1  3  5 cc

因为 header = TRUE在每次迭代时读取的不是一行而是两行，因此您最终会比您想象的更快地用完行，在第三次迭代中:

> read.table(con, skip=0, nrow=1, header=TRUE, sep="")
Error in read.table(con, skip = 0, nrow = 1, header = TRUE, sep = "") : 
  no lines available in input

现在这可能仍然不是解决问题的非常有效的方法，但这是修复当前代码的方法:

write.table(testThis.DF, "testThis.DF")
con <- file("testThis.DF")
open(con)
header <- scan(con, what = character(), nlines = 1, quiet = TRUE)
skipVec <- c(0,1,0)
badRows <- lapply(skipVec, function(pass){
  line <- read.table(con, nrow = 1, header = FALSE, sep = "",
                     row.names = 1)
  if (pass) NULL else line
  })
badRows.DF <- setNames(do.call(rbind, badRows), header)
close(con)

实现更高速度的一些线索:

使用 scan而不是 read.table .读取数据为 character只有在最后，将数据放入字符矩阵或 data.frame 后，才应用 type.convert到每一列。

而不是循环遍历 skipVec , 循环其 rle如果它更短。因此，您将能够一次读取或跳过大块的行。

关于r - 如何使用 R "readLines"命令从大文件中读取选定的行并将它们写入数据框？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/19204917/

26

4

0

文章推荐： email - 如何在cakephp 2中发送smtp邮件

文章推荐： Scala 酸洗 : Writing a custom pickler/unpickler for nested structures

文章推荐： pytest - py.test 在显示测试结果后挂起

需要 Readline - 您不能创建这种类型的实例 (Readline)
这个问题在这里已经有了答案: What could be the reason that `require` doesn't work in some places? (3 个回答) 6 个月前关闭。
java - .readLine()/readLine 的替代方案仅返回列表
我正在使用读取行从维基百科获取一些文本。但读取行仅返回列表，而不是我想要的文本。有什么方法可以使用替代方案或解决我的问题吗？ public class mediawiki { public s
Python readline 和 readlines 行为
我正在编写一小段代码，其中涉及使用子进程运行一个脚本来监听一些实时数据这是我的代码: def subscriber(): try: sub = subprocess.Pope
c - 'readline/readline.h' 文件未找到
我已包括: #include "stdio.h" #include #include 我的编译器包含标志 -lreadline 但我仍然收到错误消息: fatal error: 'readl
perl - 使用 Term::Readline-readline 停止无限循环的正确方法是什么？
使用 Term::Readline::readline 停止无限循环的正确方法是什么？ ? 这样我一个都看不懂 0 #!/usr/bin/env perl use warnings; use stri
readline - 使用 GNU Readline；如何在同一程序中添加 ncurses？
标题比我的实际目标更具体: 我有一个使用 GNU Readline 的命令行程序，主要用于命令历史记录(即使用向上箭头检索以前的命令)和其他一些细节。现在，程序的输出似乎散布在用户的输入中，有时是可以
readline - ipython:按 'esc' 键会中断 readline
在 ipython 中，如果我按“esc”，然后按“enter”(可能还有其他字符？)，读行会中断。我无法再使用“向上”键搜索命令历史记录，并且某些命令(例如 control-K)失败。有没有办法在
python - 使用python打开文件对象: readlines() and readline() does not return any value
我在使用 readlines() 和 readline() 返回值时遇到问题，但在使用 read() 时却没有。任何人都知道这是怎么发生的？欣赏一下 with open('seatninger.txt
readline - 使用 GNU Readline；如何在同一程序中添加 ncurses？
标题比我的实际目标更具体: 我有一个使用 GNU Readline 的命令行程序，主要用于命令历史记录(即使用向上箭头检索以前的命令)和其他一些细节。现在，程序的输出似乎散布在用户的输入中，有时是可以
c - 停止 readline、printf，然后恢复 readline
我正在编写一个聊天客户端，它必须在接收用户输入的同时输出接收到的消息。到目前为止，我已经 fork 成两个独立的进程，其中一个继续监听套接字连接并用 printf 写出接收到的字符串。另一个使用 r
C# - 为什么 StreamReader ReadLine 在调用 ReadLine 之前读取数据？
我在 NetworkStream 上使用 StreamReader，我只想读取一行或多行，而另一个数据是 byte array(如文件数据)我不想在 StreamReader 中读取该文件数据，例如我
c# - Console.ReadLine 和 Console.In.ReadLine 之间的区别
我遇到了这两个 API，用于在 C# 的简单控制台应用程序中读取用户的输入: System.Console.ReadLine() System.Console.In.ReadLine() 这是一个我试
bash - yum 显示已安装 readline 但 readline 命令不起作用
yum 我的系统显示已安装 readline rlwrap-0.41]$ sudo yum install readline Loaded plugins: fastestmirror, presto
readline - 将 readline 接口(interface)到 Rust
我尝试做 this tutorial在 Rust 中，到目前为止，我在将 C 库连接到 Rust 时遇到了很多问题。 C 等效代码: #include #include #include #in
python - Python 中 read()、readline() 和 readlines() 的区别
我正在寻找 web Python的标题中提到的命令及其区别；但是，我并不满足于对这些命令有完整的基本理解。假设我的文件只有以下内容。 This is the first time I am posi
f# - 为什么 Console.Readline 不起作用但 Console.Readline() 起作用？
你如何在 F# 中使用 Console.Readline？与 Console.Writeline 不同，当我调用它时，它并没有受到尊重。最佳答案如果你使用 let s = Console.Read
python - 为什么 readline() 比 Python 中的 readlines() 慢得多？
在一次面试中，面试官问我为什么 readline() 比 Python 中的 readlines() 慢很多？我回答的是readlines()需要多次读取，需要更多的开销。不知道我的回答对不对。
readline - 在 OSX Lion 上使用 readline pip 安装 ipython
要在 OSX Lion 上完全运行 ipython 需要什么？我试图让 ipython 与 readline 一起工作，但没有成功。我的做法: (在虚拟环境中) pip install ipytho
javascript - 为什么我不能在 Nodejs v10 中读取 "import * as readline from ' readline'"？
在 Nodejs 文档中，我看到: import EventEmitter from 'events'; import { readFile } from 'fs'; import fs, { rea
c - 为什么 readline 库中的 readline() 不接受 UNICODE？ ANSI C语言
我写了一个简单的应用程序: #include #include #include #include int main() { char *user_input; while(u

首页

博学

6Ren·AI

商城

r - 如何使用 R "readLines"命令从大文件中读取选定的行并将它们写入数据框？