gpt4 book ai didi

python - pyspark 需要 psutil 做什么? (面对 "UserWarning: Please install psutil to have better support with spilling")?

转载 作者:行者123 更新时间:2023-12-02 11:55:35 24 4
gpt4 key购买 nike

我开始使用 pyspark 学习 Spark,想知道以下日志消息的含义是什么?

UserWarning: Please install psutil to have better support with spilling

导致溢出的操作是两个 RDD 之间的连接:

print(user_types.join(user_genres).collect())

这听起来可能有点明显,但我的第一个问题是

我确实安装了 psutil,并且警告消失了,但我想了解到底发生了什么。有一个very similar question here ,但OP主要询问如何安装psutil

最佳答案

Spill 这里意味着将内存中的数据帧写入磁盘,这会降低 pyspark 的性能,因为写入磁盘很慢。

为什么使用 psutil

查看节点已使用的内存。

这是 pyspark 源代码 shuffle.py 的原始片段,取自 here这会引发警告。下面的代码定义了一个函数,用于在 psutil 存在或者系统是 Linux 的情况下获取已用内存。

导入 psutil 并定义 get_used_memory

try:
import psutil
def get_used_memory():
""" Return the used memory in MB """
process = psutil.Process(os.getpid())
if hasattr(process, "memory_info"):
info = process.memory_info()
else:
info = process.get_memory_info()
return info.rss >> 20
except ImportError:
def get_used_memory():
""" Return the used memory in MB """
if platform.system() == 'Linux':
for line in open('/proc/self/status'):
if line.startswith('VmRSS:'):
return int(line.split()[1]) >> 10
else:
warnings.warn("Please install psutil to have better "
"support with spilling")
if platform.system() == "Darwin":
import resource
rss = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
return rss >> 20
# TODO: support windows
return 0

写入磁盘

如果节点的使用内存大于预设限制,下面的代码将调用将数据帧写入磁盘。

def mergeCombiners(self, iterator, check=True):
""" Merge (K,V) pair by mergeCombiner """
iterator = iter(iterator)
# speedup attribute lookup
d, comb, batch = self.data, self.agg.mergeCombiners, self.batch
c = 0
for k, v in iterator:
d[k] = comb(d[k], v) if k in d else v
if not check:
continue
c += 1
if c % batch == 0 and get_used_memory() > self.memory_limit:
self._spill()
self._partitioned_mergeCombiners(iterator, self._next_limit())
break

溢出

此代码实际上将数据帧溢出写入磁盘,以防使用的内存大于预设限制。

def _spill(self):
"""
dump already partitioned data into disks.
It will dump the data in batch for better performance.
"""
global MemoryBytesSpilled, DiskBytesSpilled
path = self._get_spill_dir(self.spills)
if not os.path.exists(path):
os.makedirs(path)
used_memory = get_used_memory()
if not self.pdata:
# The data has not been partitioned, it will iterator the
# dataset once, write them into different files, has no
# additional memory. It only called when the memory goes
# above limit at the first time.
# open all the files for writing
streams = [open(os.path.join(path, str(i)), 'w')
for i in range(self.partitions)]
for k, v in self.data.iteritems():
h = self._partition(k)
# put one item in batch, make it compatitable with load_stream
# it will increase the memory if dump them in batch
self.serializer.dump_stream([(k, v)], streams[h])
for s in streams:
DiskBytesSpilled += s.tell()
s.close()
self.data.clear()
self.pdata = [{} for i in range(self.partitions)]
else:
for i in range(self.partitions):
p = os.path.join(path, str(i))
with open(p, "w") as f:
# dump items in batch
self.serializer.dump_stream(self.pdata[i].iteritems(), f)
self.pdata[i].clear()
DiskBytesSpilled += os.path.getsize(p)
self.spills += 1
gc.collect() # release the memory as much as possible
MemoryBytesSpilled += (used_memory - get_used_memory()) << 20

关于python - pyspark 需要 psutil 做什么? (面对 "UserWarning: Please install psutil to have better support with spilling")?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51226469/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com