gpt4 book ai didi

r - 为什么在data.table中使用 "character is often preferred to factor"作为 key ?

转载 作者:行者123 更新时间:2023-12-04 11:25:52 26 4
gpt4 key购买 nike

data.table手册中:

In fact we like it so much that data.table contains a counting sort algorithm for character vectors using R’s internal global string cache. This is particularly fast for character vectors containing many duplicates, such as grouped data in a key column. This means that character is often preferred to factor. Factors are still fully supported, in particular ordered factors (where the levels are not in alphabetic order).


factor不只是整数,它应该比 counting sort更容易做 character吗?

最佳答案

Isn't factor just integer which should be easier to do counting sort than character?



是的,如果您已获得考虑因素。但是创建该因素的时间可能很长,这就是 setkey(和ad hoc by)想要克服的目标。尝试在随机排序的字符向量上计时 factor(),例如1e6长,1e4级。然后将其与原始随机排序的字符向量上的 setkey或即席 by进行比较。

agstudy的评论也是正确的;也就是说,字符向量(指向R高速缓存的字符串的指针)无论如何都与因子非常相似。在32位系统上,字符向量的大小与因子的整数向量相同,但是因子也具有要存储(有时是复制)的level属性。在64位系统上,指针的大小是其两倍。但是另一方面,可以直接从字符向量指针中查找R的字符串缓存,而该因数会通过级别产生额外的跳数。 (levels属性也是R字符串缓存指针的字符向量。)

关于r - 为什么在data.table中使用 "character is often preferred to factor"作为 key ?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/18304760/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com