gpt4 book ai didi

r - fread() 运行时间比报告的速度要长

转载 作者:行者123 更新时间:2023-12-02 02:59:17 25 4
gpt4 key购买 nike

我试图在 EC2 实例上将一个大文件读入 R。但是,在读取某些数据后,我遇到的运行时间远远长于 fread 报告的时间量。

例如,在下面,当我只读入我的 csv 文件的第一行数据时,我有 fread 的 verbose=TRUE 输出。如您所见,报告的运行时间比实际运行时间短得多。你知道为什么会这样吗?有什么方法可以加快这个过程,使其更符合读入数据后害怕报告的运行时?

> start_time <- Sys.time()
> fread(file_name_1, nrows=1, verbose=TRUE)
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 68.770914 GB.
Memory mapping ... ok
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Positioned on line 1 after skip or autostart
This line is the autostart and not blank so searching up for the last non-blank ... line 1
Detecting sep ... ','
Detected 55 columns. Longest stretch was from line 1 to line 30
Starting data input on line 1 (either column names or first row of data). First 10 characters: bank_num,b
All the fields on line 1 are character fields. Treating as the column names.
nrow set to nrows passed in (1)
Type codes (point 0): 1114434134111034444411333333333333333333333333333311111
Type codes: 1114434134111034444411333333333333333333333333333311111 (after applying colClasses and integer64)
Type codes: 1114434134111034444411333333333333333333333333333311111 (after applying drop or select (if supplied)
Allocating 55 column slots (55 - 0 dropped)
Read 1 rows and 55 (of 55) columns from 68.771 GB file in 00:00:27
Read 1 rows. Exactly what was estimated and allocated up front
26.480s (100%) Memory map (rerun may be quicker)
0.000s ( 0%) sep and header detection
0.000s ( 0%) Count rows (wc -l)
0.000s ( 0%) Column type detection (100 rows at 10 points)
0.000s ( 0%) Allocation of 1x55 result (xMB) in RAM
0.000s ( 0%) Reading data
0.000s ( 0%) Allocation for type bumps (if any), including gc time if triggered
0.000s ( 0%) Coercing data already read in type bumps (if any)
0.000s ( 0%) Changing na.strings to NA
26.480s Total
> end_time <- Sys.time()
> end_time - start_time
Time difference of 9.695263 mins

最佳答案

请始终注明版本号;例如sessionInfo() 的输出。但我可以告诉您,您可能使用的是 CRAN 版本。

请经常检查NEWS在询问 Stack Overflow 之前。

第 3 项(在许多其他 fread 改进中):

Memory maps lazily; e.g. reading just the first 10 rows with nrow=10 is 12s down to 0.01s from cold for a 9GB file. Large files close to your RAM limit may work more reliably too. The progress meter will commence sooner and more consistently.

来自 dev 的最新版本可以通过 this install command 轻松试用.您编写了 EC2,所以大概是 Linux,但任何 Windows 用户都可以使用 Windows.zip from dev无需工具。

既然你有一个 68GB 的​​ csv,你肯定会从 data.table v1.10.5+ 中受益匪浅。请在此处更新您的进展情况。

关于r - fread() 运行时间比报告的速度要长,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47317285/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com