gpt4 book ai didi

r - 合并 data.tables 使用超过 10 GB RAM

转载 作者:行者123 更新时间:2023-12-04 10:12:52 25 4
gpt4 key购买 nike

我有两个数据表:DTmeta .当我使用 DT[meta] 合并它们时, 内存使用量增加了 10 GB 以上(合并非常慢)。怎么了?好像合并成功了,但我只能看单行,否则内存不足。 DT它本身是通过合并两个 data.tables 创建的,没有任何问题。

编辑:

key 好像有问题。我可以毫无问题地执行以下操作:

DT[,id:=1:nrow(DT)]
meta[,id:=1:nrow(DT)]
setkey(DT,id)
setkey(meta,id)

DT2<-DT[meta] # Comment from Matthew Dowle:
# X[Y] (or merge) on a key of 1:nrow(DT) is just a cbind, isn't it?

unique(DT2[,"Moor_ID",with=F]==DT2[,"Moor_ID.1",with=F])
Moor_ID
[1,] TRUE

第一个数据表:
str(DT)
Classes ‘data.table’ and 'data.frame': 10212 obs. of 55 variables:
$ DWD_ID : chr "Bremerhav" "Bremerhav" "Bremerhav" "Bremerhav" ...
$ numdays : int 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 ...
$ days : Date, format: "2009-09-01" "2009-09-02" "2009-09-03" "2009-09-04" ...
$ TBoden_dayAnzahl : int 0 0 0 0 0 0 0 0 0 0 ...
$ TBoden_dayMin : num NA NA NA NA NA NA NA NA NA NA ...
$ TBoden_dayMax : num NA NA NA NA NA NA NA NA NA NA ...
$ TBoden_dayMeanAR : num NA NA NA NA NA NA NA NA NA NA ...
$ TBoden_dayStabw : num NA NA NA NA NA NA NA NA NA NA ...
$ TBoden_dayMedian : num NA NA NA NA NA NA NA NA NA NA ...
$ TBoden_dayMeanMM : num NA NA NA NA NA NA NA NA NA NA ...
$ T2m_dayAnzahl : int 0 0 0 0 0 0 0 0 0 0 ...
$ T2m_dayMin : num 15.6 13.8 13.7 12.8 13.5 13.1 13.3 13.8 15.9 13.7 ...
$ T2m_dayMax : num 25.6 19.9 18.1 18.1 16.9 18.6 21 25.7 19.3 17.6 ...
$ T2m_dayMeanAR : num 19 16.9 15.6 15.2 14.8 ...
$ T2m_dayStabw : num 3.409 2.048 1.334 1.726 0.965 ...
$ T2m_dayMedian : num 17.2 16.8 15.2 14.8 14.5 ...
$ T2m_dayMeanMM : num 20.6 16.9 15.9 15.4 15.2 ...
$ T10cm_dayAnzahl : int 0 0 0 0 0 0 0 0 0 0 ...
$ T10cm_dayMin : num 14.3 12.6 12.9 12.2 12.7 12 12.8 11.7 15.1 12.2 ...
$ T10cm_dayMax : num 27.7 20.9 18.7 18.7 17.4 19.8 22.4 25.9 21.8 18.6 ...
$ T10cm_dayMeanAR : num 18.7 16.5 14.9 15.1 14.5 ...
$ T10cm_dayStabw : num 4.36 2.84 1.73 2.36 1.54 ...
$ T10cm_dayMedian : num 16.1 15.6 14.3 14.2 14 ...
$ T10cm_dayMeanMM : num 21 16.8 15.8 15.4 15.1 ...
$ RF_dayAnzahl : int 0 0 0 0 0 0 0 0 0 0 ...
$ RF_dayMin : num 45 58 73 56 68 62 63 44 65 58 ...
$ RF_dayMax : num 94 94 94 93 94 92 84 84 89 84 ...
$ RF_dayMean : num 68.6 76.3 78.9 74.4 86.5 ...
$ RF_dayStabw : num 17.09 12.53 5.88 9.83 5.62 ...
$ RF_dayMedian : num 64.5 74 77.5 76 87.5 77.5 75 63 77 76 ...
$ Luftdruck_dayMean : num 100.8 101 99.7 99.9 101.1 ...
$ es_day : num 2.53 1.95 1.82 1.78 1.74 ...
$ ea_day : num 1.57 1.42 1.49 1.27 1.38 ...
$ defi_day : num 0.956 0.535 0.327 0.509 0.355 ...
$ Nebel_dayAnteil : num 0 0 0 0 0 0 0 0 0 0 ...
$ Sonnenscheind_dayAnzahl: int 18 18 18 18 18 18 18 18 18 18 ...
$ Sonnenscheind_daySum : num 6.63 4.93 1.05 5.82 3.27 ...
$ julian_day : int 244 245 246 247 248 249 250 251 252 253 ...
$ zeta_day : num 2.81 2.82 2.84 2.86 2.88 ...
$ maxSonnenscheind : num 13.9 13.8 13.7 13.6 13.5 ...
$ R0_day : num 2920 2890 2860 2830 2799 ...
$ Globalstrahlung_dayMean: num NA NA NA NA NA NA NA NA NA NA ...
$ RG_day : num 13.24 11.19 6.64 12.02 9.03 ...
$ lambdaET_day : num 2.45 2.46 2.46 2.46 2.47 ...
$ sAnstieg_day : num 0.15 0.122 0.116 0.113 0.111 ...
$ gamma_day : num 0.067 0.0669 0.0659 0.0661 0.0668 ...
$ ETp_TW_day : num 2.71 2.15 1.28 2.24 1.68 ...
$ Moor_ID : chr "Ahlenmoor" "Ahlenmoor" "Ahlenmoor" "Ahlenmoor" ...
$ Distanz_in_km : num 24 24 24 24 24 ...
$ North : num 53.5 53.5 53.5 53.5 53.5 ...
$ East : num 8.58 8.58 8.58 8.58 8.58 ...
$ Hoehe_in_m : num 7 7 7 7 7 7 7 7 7 7 ...
$ Kueste_km : num 20 20 20 20 20 ...
$ peatland : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ diffmaxt2m : num -1.6 0 -0.2 0.1 -0.4 ...
- attr(*, "sorted")= chr "Moor_ID"
- attr(*, ".internal.selfref")=<externalptr>

第二个数据表:
str(meta)
Classes ‘data.table’ and 'data.frame': 10212 obs. of 6 variables:
$ Moor_ID : chr "Ahlenmoor" "Ahlenmoor" "Ahlenmoor" "Ahlenmoor" ...
$ Hoehe_Moor : num 2.35 2.35 2.35 2.35 2.35 2.35 2.35 2.35 2.35 2.35 ...
$ Kueste_km : num 15.7 15.7 15.7 15.7 15.7 ...
$ WSPsommer_muGOK: num 0.699 0.699 0.699 0.699 0.699 ...
$ WSPwinter_muGOK: num 0.446 0.446 0.446 0.446 0.446 ...
$ Moorgroesse_km2: num 59 59 59 59 59 59 59 59 59 59 ...
- attr(*, ".internal.selfref")=<externalptr>
- attr(*, "sorted")= chr "Moor_ID"

session 信息:
R version 2.15.1 (2012-06-22)
Platform: x86_64-pc-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
[5] LC_TIME=German_Germany.1252

attached base packages:
[1] grDevices datasets splines graphics stats tcltk utils methods base

other attached packages:
[1] reshape_0.8.4 plyr_1.7.1 data.table_1.8.0 svSocket_0.9-53 TinnR_1.0-5 R2HTML_2.2 Hmisc_3.9-3
[8] survival_2.36-14

loaded via a namespace (and not attached):
[1] cluster_1.14.2 grid_2.15.1 lattice_0.20-6 svMisc_0.9-65 tools_2.15.1

最佳答案

我的错。问题是键不是唯一的:

a<-data.table(x=c(1,1),y=c(1,2))
b<-data.table(x=c(1,1),y=c(3,4))
setkey(a,x)
setkey(b,x)
a[b]
x y y.1
[1,] 1 1 3
[2,] 1 2 3
[3,] 1 1 4
[4,] 1 2 4

如果 data.table 可以对此发出警告,那就太好了。

马修更新

此警告现已在 v1.8.7 中实现:

New argument allow.cartesian ( default FALSE) added to X[Y] and merge(X,Y), #2464. Prevents large allocations due to misspecified joins; e.g., duplicate key values in Y joining to the same group in X over and over again. The word cartesian is used loosely for when more than max(nrow(X),nrow(Y)) rows would be returned. The error message is verbose and includes advice.

关于r - 合并 data.tables 使用超过 10 GB RAM,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/11610562/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com