gpt4 book ai didi

python - 如何估计pyspark中数据框的实际大小?

转载 作者:太空狗 更新时间:2023-10-29 17:43:56 25 4
gpt4 key购买 nike

如何确定数据帧的大小?

现在我估计数据框的实际大小如下:

headers_size = key for key in df.first().asDict()
rows_size = df.map(lambda row: len(value for key, value in row.asDict()).sum()
total_size = headers_size + rows_size

它太慢了,我正在寻找更好的方法。

最佳答案

来自 Tamas Szuromi 的精彩帖子 http://metricbrew.com/how-to-estimate-rdd-or-dataframe-real-size-in-pyspark/

from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
def _to_java_object_rdd(rdd):
""" Return a JavaRDD of Object by unpickling
It will convert each Python object into Java object by Pyrolite, whenever the
RDD is serialized in batch or not.
"""
rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
return rdd.ctx._jvm.org.apache.spark.mllib.api.python.SerDe.pythonToJava(rdd._jrdd, True)

JavaObj = _to_java_object_rdd(df.rdd)

nbytes = sc._jvm.org.apache.spark.util.SizeEstimator.estimate(JavaObj)

关于python - 如何估计pyspark中数据框的实际大小?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37077432/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com