Python bcolz 如何合并两个 ctables

转载作者：太空狗更新时间：2023-10-29 18:08:12

24

4

我正在玩这个 notebook 中的内存压缩示例中的 bcolz

到目前为止，我对这个库感到非常惊讶。我认为它对于我们所有人来说都是一个很好的工具，可以将更大的文件加载到较小的内存中(Francesc 干得好，如果您正在阅读这篇文章!)

我想知道是否有人有像使用 pandas.merge() 那样加入两个 ctables 的经验，以及如何做到这一点/内存有效。

感谢分享您的想法:-)!

最佳答案

我及时得到它..非常感谢@mdurant 的 itertoolz!!这是一些伪代码，因为我使用的示例非常难看。

# here's generic pandas
df_new = pd.merge(df1,df2) 


# example with itertoolz and bcolz
from toolz.itertoolz import join as joinz
import bcolz

#convert them to ctables
zdf1 = bcolz.ctable.fromdataframe(df1)
zdf2 = bcolz.ctable.fromdataframe(df2)

#column 2 of df1 and column 1 of df2 were the columns to join on
merged = list(joinz(1,zdf1.iter(),0,zdf2.iter()))

# where new_dtypes are the dtypes of the fields you are using
# mine new_dtypes= '|S8,|S8,|S8,|S8,|S8'
zdf3 = bcolz.fromiter(((a[0]+a[1]) for a in merged), dtype = new_dtypes, count = len(merged))

很明显，可能有一些更聪明的方法，这个例子不是很具体，但它有效，可以作为基础，让人们进一步构建它

使用示例编辑美国东部时间 10 月 21 日晚上 7 点

#download movielens data files from http://grouplens.org/datasets/movielens/
#I'm using the 1M dataset
import pandas as pd
import time
from toolz.itertoolz import join as joinz
import bcolz

t0 = time()
dset = '/Path/To/Your/Data/'
udata = os.path.join(dset, 'users.dat') 
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv(udata,sep='::',names=u_cols)

rdata = os.path.join(dset, 'ratings.dat')
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv(rdata, sep='::', names=r_cols)

print ("Time for parsing the data: %.2f" % (time()-t0,)) 
#Time for parsing the data: 4.72

t0=time()
users_ratings = pd.merge(users,ratings)
print ("Time for merging the data: %.2f" % (time()-t0,))
#Time for merging the data: 0.14

t0=time()
zratings = bcolz.ctable.fromdataframe(ratings)
zusers = bcolz.ctable.fromdataframe(users)
print ("Time for ctable conversion: %.2f" % (time()-t0,))
#Time for ctable conversion: 0.05

new_dtypes = ','.join([x[0].str for x in zusers.dtype.fields.values()][::-1] +[y[0].str for y in zratings.dtype.fields.values()][::-1])

#Do the merge with a list stored intermediately
t0 = time()
merged = list(joinz(0,zusers.iter(),0,zratings.iter()))
zuser_zrating1 = bcolz.fromiter(((a[0]+a[1]) for a in merged), dtype = new_dtypes, count = len(merged))
print ("Time for intermediate list bcolz merge: %.2f" % (time()-t0,))
#Time for intermediate list bcolz merge: 3.16

# Do the merge ONLY using iterators to limit memory consumption
t0 = time()
zuser_zrating2 = bcolz.fromiter(((a[0]+a[1]) for a in joinz(0,zusers.iter(),0,zratings.iter())) , dtype = new_dtypes, count = sum(1 for _ in joinz(0,zusers.iter(),0,zratings.iter())))
print ("Time for 2x iters of merged bcolz: %.2f" % (time()-t0,))
#Time for 2x iters of merged bcolz: 3.31

如您所见，我创建的版本比 pandas 慢 15 倍，但是通过仅使用迭代器，它将节省大量内存。请随意发表评论和/或对此进行扩展。 bcolz 似乎是一个很好的构建包。

关于Python bcolz 如何合并两个 ctables，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/25741898/

24

4

0

文章推荐： c# - SignalR - 发送消息 OnConnected

文章推荐： angular - 清除时重新打开 Angular Material 自动完成功能

文章推荐： c# - FluentValidation NotEmpty 和 EmailAddress 示例

python - 存储在 bcolz 中时数据大小会爆炸
我有一个包含约 7M 行和 3 列的数据集，其中 2 个数字和 1 个由约 20M 不同字符串 uuid 组成。数据作为 csv 文件需要大约 3G，castra 可以将其存储在大约 2G 中。我想用
Python bcolz 如何合并两个 ctables
我正在玩这个 notebook 中的内存压缩示例中的 bcolz 到目前为止，我对这个库感到非常惊讶。我认为它对于我们所有人来说都是一个很好的工具，可以将更大的文件加载到较小的内存中(Francesc
python - 模块未找到错误 : No module named 'bcolz'
我在 Conda 环境中的 Jupyter Notebook 中工作。我已经通过三种不同的方式安装了 bcolz，但我的笔记本页面上总是出现以下屏幕截图。当我在环境中输入“Python”或“Pyt
python - 使用 Blaze 追加 bcolz 列
让我们首先构建一个ctable: import pandas as pd import blaze as bl df = pd.DataFrame({'x': range(4), 'y': [2.,
python - 以 bcolz 格式保存 dask 数据帧
dask 文档指出:“BColz 是一个磁盘上、分块、压缩的列存储。这些属性使其对 dask.dataframe 非常有吸引力，它可以在其上运行得特别好。有一个特殊的 from_bcolz 函数。”
python - 使用 bcolz 将 Pandas 数据框保存到文件
我想使用 bcolz 将 pandas 数据框保存到文件。我试过: import bcolz import pandas as pd df = pd.read_csv(open("mydata.cs
python - zipline 安装错误 : failed building wheel for bcolz
我正在尝试在 mac os 的虚拟环境中安装 zipline。 Python 版本 = 3.6/numpy，预装 cython 当我在虚拟环境中尝试 pip install zipline 时，出现以
python - zipline 安装错误 : failed building wheel for bcolz
我正在尝试在 mac os 的虚拟环境中安装 zipline。 Python 版本 = 3.6/numpy，预装 cython 当我在虚拟环境中尝试 pip install zipline 时，出现以
python - “ImportError: no module named ' bcolz '” 用pip安装后
所以我目前在 Windows 7 上运行，并且正在尝试运行一些 jupyter notebook。我使用 Python 2.7.13 和 Anaconda。我做了“pip install bcolz”

首页

博学

6Ren·AI

商城

Python bcolz 如何合并两个 ctables

使用示例编辑美国东部时间 10 月 21 日晚上 7 点