r - 在磁盘上逐渐增长一个 ffdf 数据框-6ren

r - 在磁盘上逐渐增长一个 ffdf 数据框

转载作者：行者123 更新时间：2023-12-02 03:25:05

25

4

来自 save.ffdf 的文档:

Using ‘save.ffdf’ automagically sets the ‘finalizer’s of the ‘ff’ vectors to ‘"close"’. This means that the data will be preserved on disk when the object is removed or the R sessions is closed. Data can be deleted either using ‘delete’ or by removing the directory where the object were saved (‘dir’).

我想从一个小的 ffdf 数据框开始，一次添加一点新数据，然后在磁盘上增长它。所以我做了一个小实验:

# in R
ffiris = as.ffdf(iris)
save.ffdf(ffiris, dir = "~/Desktop/iris")

# in bash
ls ~/Desktop/iris/
## ffiris$Petal.Length.ff ffiris$Petal.Width.ff  ffiris$Sepal.Length.ff ffiris$Sepal.Width.ff  ffiris$Species.ff

# in R
# add a new column
ffiris =transform(ffiris, new1 = rep(99, nrow(iris)))
rm(ffiris)

# in bash
ls ~/Desktop/iris/
## ffiris$Petal.Length.ff ffiris$Petal.Width.ff  ffiris$Sepal.Length.ff ffiris$Sepal.Width.ff  ffiris$Species.ff

事实证明，当我删除 ffiris 时，它不会自动更新磁盘上的 ff 数据。手动保存呢？

# in R
# add a new column
ffiris =transform(ffiris, new1 = rep(99, nrow(iris)))
save.ffdf(ffiris, "~/Desktop/iris")

# in bash
ls ~/Desktop/iris/
## ffiris$Petal.Length.ff ffiris$Petal.Width.ff  ffiris$Sepal.Length.ff ffiris$Sepal.Width.ff  ffiris$Species.ff

嗯，还是不走运。为什么？

保存前删除文件夹怎么样？

# in R
ffiris = as.ffdf(iris)
unlink("~/Desktop/iris", recursive = TRUE, force = TRUE)
save.ffdf(ffiris, "~/Desktop/iris", overwrite = TRUE)
ffiris =transform(ffiris, new1 = rep(99, nrow(iris)))
unlink("~/Desktop/iris", recursive = TRUE, force = TRUE)
save.ffdf(ffiris, "~/Desktop/iris", overwrite = TRUE)

# in bash
ls ~/Desktop/iris/
# ls: /Users/ky/Desktop/iris/: No such file or directory

更陌生。即使这一切都有效，它仍然会非常低效。我正在寻找类似的东西:

updateOnDisk(ffiris)

有人能帮忙吗？

最佳答案

ff 和 ffbase 提供内存不足的 R 向量，但引入了引用语义，这可能会给 R 习语带来问题。

R 是一种函数式编程语言，这意味着函数不会更改参数和对象，而是返回修改后的副本。在 ffbase 中，我们以 R 方式实现功能，即 transform 返回原始 ffdf data.frame 的副本。这可以通过查看文件名看出:

ffiris = as.ffdf(iris)
save.ffdf(ffiris, dir = "~/Desktop/iris")
filename(ffiris) # show contents of ~/Desktop/iris

ffiris =transform(ffiris, new1 = 99) # this create a copy of the whole data.frame!
filename(ffiris)  

ffiris$new2 <- ff(rep(99, nrow(iris)))  # this creates a new column, but not yet in the right directory
filename(ffiris)

save.ffdf(ffiris, dir="~/Desktop/iris", overwrite=TRUE) # this fixes that.

Transform 目前添加新列的效率很低，因为它复制了整个数据帧(即 R 语义)。这是因为转换可能是临时结果，您不会更改原始数据。

在 ffbase2 中我们正在解决这个问题

关于r - 在磁盘上逐渐增长一个 ffdf 数据框，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/30834967/

25

4

0

文章推荐： nuxeo - 如何在没有 nuxeo-studio 的情况下使用 nuxeo

文章推荐： c++11 - 为特定类型的 shared_ptr 创建默认删除器

文章推荐： matlab - 需要在 MATLAB 中使用 0s 的概率向量替代 randsample

文章推荐： .net - Automapper AfterMap 函数初始化类

R ffdf 排序数据
我想对数据进行排序 z=as.ffdf(data.frame(w=c(4,1,2,5,7,8,65,3,2,9), x=c(12,1,3,5,65,3,2,45,34,11),y=1:10)) 我需要
r - 在磁盘上逐渐增长一个 ffdf 数据框
来自 save.ffdf 的文档: Using ‘save.ffdf’ automagically sets the ‘finalizer’s of the ‘ff’ vectors to ‘"clo
r - 如何按多列拆分/聚合大型数据框(ffdf)？
ffbase 提供函数ffdfdply 来拆分和聚合数据行。这个答案 ( https://stackoverflow.com/a/20954315/336311 ) 解释了它基本上是如何工作的。我仍然
r - 在 ffdf 上应用 tidyr 的传播
在普通数据框上，我可以根据特定列展开所有数据。但是我怎么能在 ffdf 上做到这一点。我有这样的输入。 Uid article_Topic frqnu 1 1234567
r - 子集 ffdf 对象(子集 vs ffwhich)
我正在执行大型 ffdf 对象的子集，我注意到当我使用 subset.ff 时，它会生成大量 NA。我通过使用 ffwhich 尝试了另一种方法，结果要快得多，并且没有生成 NA。这是我的测试: li
r - 为什么我的用于过滤数据的 R 代码会产生不同的结果 "fread()"和 "ffdf()"？
我有一个包含 700 万条记录和 160 个变量的大文件。我开始知道 fread() 和 read.csv.ffdf() 是处理如此大数据的两种方法。但是当我尝试使用 dplyr 过滤这两个数据集时，
r - 访问 R : read. table.ffdf 中的大型 csv 速度变慢
我对使用 R 并尝试使用来自大型 CSV 文件的数据(约 1320 万行，每行约 250 个字段，总共约 14 GB)比较陌生。在搜索了访问这些数据的快速方法后，我遇到了 ff 包和 read.tab
r - 服务器上的 ff 包... read.table.ffdf "cannot change working directory"
这可能很简单，但我搜索了很多，但找不到如何解决它。我在服务器上使用 Rstudio 2.15.1，因为我们使用大数据集并且服务器有更多内存来处理它。我想用加载 csv 文件 x <- read.c

首页

博学

6Ren·AI

商城

r - 在磁盘上逐渐增长一个 ffdf 数据框