python - 如何在 Pandas 中的超大数据帧上创建数据透视表-6ren

python - 如何在 Pandas 中的超大数据帧上创建数据透视表

转载作者：太空狗更新时间：2023-10-29 17:07:05

我需要从大约 6000 万行的数据集中创建一个包含 2000 列乘以大约 30-5000 万行的数据透视表。我试过在 100,000 行的 block 中旋转，这很有效，但是当我尝试通过执行 .append() 后跟 .groupby('someKey').sum() 来重新组合数据帧时，我所有的内存都被占用了python 最终崩溃了。

如何使用有限的 RAM 对如此大的数据进行数据透视？

编辑:添加示例代码

下面的代码包括一路上的各种测试输出，但最后一个打印是我们真正感兴趣的。请注意，如果我们将 segMax 更改为 3 而不是 4，代码将产生正确输出的误报.主要问题是，如果 shipmentid 条目不在 sum(wawa) 查看的每个 block 中，它就不会显示在输出中。

import pandas as pd
import numpy as np
import random
from pandas.io.pytables import *
import os

pd.set_option('io.hdf.default_format','table') 

# create a small dataframe to simulate the real data.
def loadFrame():
    frame = pd.DataFrame()
    frame['shipmentid']=[1,2,3,1,2,3,1,2,3] #evenly distributing shipmentid values for testing purposes
    frame['qty']= np.random.randint(1,5,9) #random quantity is ok for this test
    frame['catid'] = np.random.randint(1,5,9) #random category is ok for this test
    return frame

def pivotSegment(segmentNumber,passedFrame):
    segmentSize = 3 #take 3 rows at a time
    frame = passedFrame[(segmentNumber*segmentSize):(segmentNumber*segmentSize + segmentSize)] #slice the input DF

    # ensure that all chunks are identically formatted after the pivot by appending a dummy DF with all possible category values
    span = pd.DataFrame() 
    span['catid'] = range(1,5+1)
    span['shipmentid']=1
    span['qty']=0

    frame = frame.append(span)

    return frame.pivot_table(['qty'],index=['shipmentid'],columns='catid', \
                             aggfunc='sum',fill_value=0).reset_index()

def createStore():

    store = pd.HDFStore('testdata.h5')
    return store

segMin = 0
segMax = 4

store = createStore()
frame = loadFrame()

print('Printing Frame')
print(frame)
print(frame.info())

for i in range(segMin,segMax):
    segment = pivotSegment(i,frame)
    store.append('data',frame[(i*3):(i*3 + 3)])
    store.append('pivotedData',segment)

print('\nPrinting Store')   
print(store)
print('\nPrinting Store: data') 
print(store['data'])
print('\nPrinting Store: pivotedData') 
print(store['pivotedData'])

print('**************')
print(store['pivotedData'].set_index('shipmentid').groupby('shipmentid',level=0).sum())
print('**************')
print('$$$')
for df in store.select('pivotedData',chunksize=3):
    print(df.set_index('shipmentid').groupby('shipmentid',level=0).sum())

print('$$$')
store['pivotedAndSummed'] = sum((df.set_index('shipmentid').groupby('shipmentid',level=0).sum() for df in store.select('pivotedData',chunksize=3)))
print('\nPrinting Store: pivotedAndSummed') 
print(store['pivotedAndSummed'])

store.close()
os.remove('testdata.h5')
print('closed')

最佳答案

您可以使用 HDF5/pytables 进行追加。这使它远离 RAM。

使用 table format :

store = pd.HDFStore('store.h5')
for ...:
    ...
    chunk  # the chunk of the DataFrame (which you want to append)
    store.append('df', chunk)

现在您可以将它作为一个 DataFrame 一次性读入(假设这个 DataFrame 可以放入内存!):

df = store['df']

您还可以查询，以仅获取 DataFrame 的子部分。

另外:您还应该购买更多 RAM，它很便宜。

编辑:您可以从商店iteratively 进行分组/求和因为这个“映射减少”了 block :

# note: this doesn't work, see below
sum(df.groupby().sum() for df in store.select('df', chunksize=50000))
# equivalent to (but doesn't read in the entire frame)
store['df'].groupby().sum()

Edit2:在 pandas 0.16 中使用 sum 实际上不起作用(我认为它在 0.15.2 中有效)，相反你可以使用 reduce与 add :

reduce(lambda x, y: x.add(y, fill_value=0),
       (df.groupby().sum() for df in store.select('df', chunksize=50000)))

在 python 3 中你必须 import reduce from functools .

也许这样写更符合 pythonic/可读性:

chunks = (df.groupby().sum() for df in store.select('df', chunksize=50000))
res = next(chunks)  # will raise if there are no chunks!
for c in chunks:
    res = res.add(c, fill_value=0)

如果性能不佳/如果有大量新组，那么最好将 res 启动为正确大小的零(通过获取唯一的组键，例如通过遍历 block )，然后就地添加。

关于python - 如何在 Pandas 中的超大数据帧上创建数据透视表，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/29439589/

文章推荐： c - "#include "导致 "error: asm/io.h: No such file or directory"

太空狗

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 如何在 Pandas 中的超大数据帧上创建数据透视表