gpt4 book ai didi

r - 在 R 中使用 bigmemory 加载大型数据集的内存问题

转载 作者:行者123 更新时间:2023-12-02 04:55:15 25 4
gpt4 key购买 nike

我有一个大型文本文件(> 1000 万行,> 1 GB),我希望一次处理一行,以避免将整个文件加载到内存中。处理完每一行后,我希望将一些变量保存到 big.matrix 对象中。这是一个简化的示例:

library(bigmemory)
library(pryr)

con <- file('x.csv', open = "r")
x <- big.matrix(nrow = 5, ncol = 1, type = 'integer')

for (i in 1:5){
print(c(address(x), refs(x)))
y <- readLines(con, n = 1, warn = FALSE)
x[i] <- 2L*as.integer(y)
}

close(con)

其中x.csv包含

4
18
2
14
16

遵循此处的建议http://adv-r.had.co.nz/memory.html我已经打印了 big.matrix 对象的内存地址,它似乎随着每次循环迭代而改变:

[1] "0x101e854d8" "2"          
[1] "0x101d8f750" "2"
[1] "0x102380d80" "2"
[1] "0x105a8ff20" "2"
[1] "0x105ae0d88" "2"
  1. big.matrix 对象可以就地修改吗?

  2. 有没有更好的方法来加载、处理然后保存这些数据?目前的方法很慢!

最佳答案

  1. is there a better way to load, process and then save these data? The current method is slow!

您的方法中最慢的部分似乎是调用单独读取每一行。我们可以对数据进行“分块”,或者一次读取几行,以便不达到内存限制,同时可能加快速度。

计划如下:

  1. 计算出文件中有多少行
  2. 读入这些行的一部分
  3. 对该 block 执行一些操作
  4. 将该 block 推回到新文件中以供以后保存

    library(readr) 
    # Make a file
    x <- data.frame(matrix(rnorm(10000),100000,10))

    write_csv(x,"./test_set2.csv")

    # Create a function to read a variable in file and double it
    calcDouble <- function(calc.file,outputFile = "./outPut_File.csv",
    read.size=500000,variable="X1"){
    # Set up variables
    num.lines <- 0
    lines.per <- NULL
    var.top <- NULL
    i=0L

    # Gather column names and position of objective column
    connection.names <- file(calc.file,open="r+")
    data.names <- read.table(connection.names,sep=",",header=TRUE,nrows=1)
    close(connection.names)
    col.name <- which(colnames(data.names)==variable)

    #Find length of file by line
    connection.len <- file(calc.file,open="r+")
    while((linesread <- length(readLines(connection.len,read.size)))>0){

    lines.per[i] <- linesread
    num.lines <- num.lines + linesread
    i=i+1L
    }
    close(connection.len)

    # Make connection for doubling function
    # Loop through file and double the set variables
    connection.double <- file(calc.file,open="r+")
    for (j in 1:length(lines.per)){

    # if stops read.table from breaking
    # Read in a chunk of the file
    if (j == 1) {
    data <- read.table(connection.double,sep=",",header=FALSE,skip=1,nrows=lines.per[j],comment.char="")
    } else {
    data <- read.table(connection.double,sep=",",header=FALSE,nrows=lines.per[j],comment.char="")
    }
    # Grab the columns we need and double them
    double <- data[,I(col.name)] * 2
    if (j != 1) {
    write_csv(data.frame(double),outputFile,append = TRUE)
    } else {
    write_csv(data.frame(double),outputFile)
    }

    message(paste0("Reading from Chunk: ",j, " of ",length(lines.per)))
    }
    close(connection.double)
    }

    calcDouble("./test_set2.csv",read.size = 50000, variable = "X1")

因此,我们得到了一个包含已处理数据的 .csv 文件。您可以更改double <- data[,I(col.name)] * 2无论您需要对每个 block 执行什么操作。

关于r - 在 R 中使用 bigmemory 加载大型数据集的内存问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31407452/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com