gpt4 book ai didi

r - 有没有比 fread() 更快的方法来读取大数据?

转载 作者:行者123 更新时间:2023-12-04 15:06:02 24 4
gpt4 key购买 nike

嗨,首先我已经在堆栈和谷歌上搜索并找到了这样的帖子:
Quickly reading very large tables as dataframes .虽然这些很有帮助并且得到了很好的回答,但我正在寻找更多信息。

我正在寻找读取/导入高达 50-60GB 的“大”数据的最佳方法。
我目前正在使用 fread()函数来自 data.table这是我目前知道的最快的功能。我工作的 pc/服务器有一个很好的 cpu(工作站)和 32 GB 的 RAM,但仍然有超过 10 GB 的数据,有时接近数十亿次观察需要很多时间来读取。

我们已经有了 sql 数据库,但由于某些原因,我们必须在 R 中处理大数据。
有没有办法加速 R 或者比 fread() 更好的选择当涉及到这样的大文件时?

谢谢你。

编辑: fread("data.txt", verbose = TRUE)

omp_get_max_threads() = 2
omp_get_thread_limit() = 2147483647
DTthreads = 0
RestoreAfterFork = true
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 2 threads (omp_get_max_threads()=2, nth=2)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as integer
[02] Opening the file
Opening file C://somefolder/data.txt
File opened, size = 1.083GB (1163081280 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<ID,Dat,No,MX,NOM_TX>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep automatically ...
sep=',' with 100 lines of 5 fields using quote rule 0
Detected 5 columns on line 1. This line is either column names or first data row. Line starts as: <<ID,Dat,No,MX,NOM_TX>>
Quote rule picked = 0
fill=false and the most number of columns found is 5
[07] Detect column types, good nrow estimate and whether first row is column names
Number of sampling jump points = 100 because (1163081278 bytes from row 1 to eof) / (2 * 5778 jump0size) == 100647
Type codes (jump 000) : 5A5AA Quote rule 0
Type codes (jump 100) : 5A5AA Quote rule 0
'header' determined to be true due to column 1 containing a string on row 1 and a lower type (int32) in the rest of the 10054 sample rows
=====
Sampled 10054 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 2 to the end of last row: 1163081249
Line length: mean=56.72 sd=20.65 min=25 max=128
Estimated number of rows: 1163081249 / 56.72 = 20506811
Initial alloc = 41013622 rows (20506811 + 100%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 5A5AA
[10] Allocate memory for the datatable
Allocating 5 column slots (5 - 0 dropped) with 41013622 rows
[11] Read the data
jumps=[0..1110), chunk_size=1047820, total_size=1163081249
|--------------------------------------------------|
|==================================================|
Read 20935277 rows x 5 columns from 1.083GB (1163081280 bytes) file in 00:31.484 wall clock time
[12] Finalizing the datatable
Type counts:
2 : int32 '5'
3 : string 'A'
=============================
0.007s ( 0%) Memory map 1.083GB file
0.739s ( 2%) sep=',' ncol=5 and header detection
0.001s ( 0%) Column type detection using 10054 sample rows
1.809s ( 6%) Allocation of 41013622 rows x 5 cols (1.222GB) of which 20935277 ( 51%) rows used
28.928s ( 92%) Reading 1110 chunks (0 swept) of 0.999MB (each chunk 18860 rows) using 2 threads
+ 26.253s ( 83%) Parse to row-major thread buffers (grown 0 times)
+ 2.639s ( 8%) Transpose
+ 0.035s ( 0%) Waiting
0.000s ( 0%) Rereading 0 columns due to out-of-sample type exceptions
31.484s Total

最佳答案

假设您希望将文件完全读入 R,使用数据库或选择列/行的子集不会有太大帮助。

在这种情况下有帮助的是:
- 确保您使用的是最新版本的 data.table
- 确保设置了最佳线程数
使用 setDTthreads(0L)使用所有可用线程,默认情况下 data.table使用 50% 的可用线程。
- 检查 fread(..., verbose=TRUE) 的输出,并可能在此处将其添加到您的问题中
- 将您的文件放在快速磁盘或 RAM 磁盘上,然后从那里读取

如果您的数据有很多不同的字符变量,您可能无法获得很高的速度,因为填充 R 的内部全局字符缓存是单线程的,因此解析可以快速进行,但创建字符向量将成为瓶颈。

关于r - 有没有比 fread() 更快的方法来读取大数据?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56396770/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com