gpt4 book ai didi

python - 如何查找 pyspark 数据帧内存使用情况?

转载 作者:太空狗 更新时间:2023-10-29 22:28:50 25 4
gpt4 key购买 nike

对于 python dataframe,info() 函数提供了内存使用情况。pyspark 中是否有任何等效项?谢谢

最佳答案

尝试使用the _to_java_object_rdd() function :

import py4j.protocol  
from py4j.protocol import Py4JJavaError
from py4j.java_gateway import JavaObject
from py4j.java_collections import JavaArray, JavaList

from pyspark import RDD, SparkContext
from pyspark.serializers import PickleSerializer, AutoBatchedSerializer

# your dataframe what you'd estimate
df

# Helper function to convert python object to Java objects
def _to_java_object_rdd(rdd):
""" Return a JavaRDD of Object by unpickling
It will convert each Python object into Java object by Pyrolite, whenever the
RDD is serialized in batch or not.
"""
rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
return rdd.ctx._jvm.org.apache.spark.mllib.api.python.SerDe.pythonToJava(rdd._jrdd, True)

# First you have to convert it to an RDD
JavaObj = _to_java_object_rdd(df.rdd)

# Now we can run the estimator
sc._jvm.org.apache.spark.util.SizeEstimator.estimate(JavaObj)

关于python - 如何查找 pyspark 数据帧内存使用情况?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46228138/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com