python - pyspark 需要 psutil 做什么？ (面对 "UserWarning: Please install psutil to have better support with spilling")？-6ren

python - pyspark 需要 psutil 做什么？ (面对 "UserWarning: Please install psutil to have better support with spilling")？

转载作者：行者123 更新时间：2023-12-02 11:55:35

24

4

我开始使用 pyspark 学习 Spark，想知道以下日志消息的含义是什么？

UserWarning: Please install psutil to have better support with spilling

导致溢出的操作是两个 RDD 之间的连接:

print(user_types.join(user_genres).collect())

这听起来可能有点明显，但我的第一个问题是

我确实安装了 psutil，并且警告消失了，但我想了解到底发生了什么。有一个very similar question here ，但OP主要询问如何安装psutil。

最佳答案

Spill 这里意味着将内存中的数据帧写入磁盘，这会降低 pyspark 的性能，因为写入磁盘很慢。

为什么使用 psutil

查看节点已使用的内存。

这是 pyspark 源代码 shuffle.py 的原始片段，取自 here这会引发警告。下面的代码定义了一个函数，用于在 psutil 存在或者系统是 Linux 的情况下获取已用内存。

导入 psutil 并定义 get_used_memory

try:
    import psutil
    def get_used_memory():
        """ Return the used memory in MB """
        process = psutil.Process(os.getpid())
        if hasattr(process, "memory_info"):
            info = process.memory_info()
        else:
            info = process.get_memory_info()
        return info.rss >> 20
except ImportError:
    def get_used_memory():
        """ Return the used memory in MB """
        if platform.system() == 'Linux':
            for line in open('/proc/self/status'):
                if line.startswith('VmRSS:'):
                    return int(line.split()[1]) >> 10
        else:
            warnings.warn("Please install psutil to have better "
                          "support with spilling")
            if platform.system() == "Darwin":
                import resource
                rss = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
                return rss >> 20
            # TODO: support windows
        return 0

写入磁盘

如果节点的使用内存大于预设限制，下面的代码将调用将数据帧写入磁盘。

def mergeCombiners(self, iterator, check=True):
        """ Merge (K,V) pair by mergeCombiner """
        iterator = iter(iterator)
        # speedup attribute lookup
        d, comb, batch = self.data, self.agg.mergeCombiners, self.batch
        c = 0
        for k, v in iterator:
            d[k] = comb(d[k], v) if k in d else v
            if not check:
                continue
            c += 1
            if c % batch == 0 and get_used_memory() > self.memory_limit:
                self._spill()
                self._partitioned_mergeCombiners(iterator, self._next_limit())
                break

溢出

此代码实际上将数据帧溢出写入磁盘，以防使用的内存大于预设限制。

def _spill(self):
        """
        dump already partitioned data into disks.
        It will dump the data in batch for better performance.
        """
        global MemoryBytesSpilled, DiskBytesSpilled
        path = self._get_spill_dir(self.spills)
        if not os.path.exists(path):
            os.makedirs(path)
        used_memory = get_used_memory()
        if not self.pdata:
            # The data has not been partitioned, it will iterator the
            # dataset once, write them into different files, has no
            # additional memory. It only called when the memory goes
            # above limit at the first time.
            # open all the files for writing
            streams = [open(os.path.join(path, str(i)), 'w')
                       for i in range(self.partitions)]
            for k, v in self.data.iteritems():
                h = self._partition(k)
                # put one item in batch, make it compatitable with load_stream
                # it will increase the memory if dump them in batch
                self.serializer.dump_stream([(k, v)], streams[h])
            for s in streams:
                DiskBytesSpilled += s.tell()
                s.close()
            self.data.clear()
            self.pdata = [{} for i in range(self.partitions)]
        else:
            for i in range(self.partitions):
                p = os.path.join(path, str(i))
                with open(p, "w") as f:
                    # dump items in batch
                    self.serializer.dump_stream(self.pdata[i].iteritems(), f)
                self.pdata[i].clear()
                DiskBytesSpilled += os.path.getsize(p)
        self.spills += 1
        gc.collect()  # release the memory as much as possible
        MemoryBytesSpilled += (used_memory - get_used_memory()) << 20

关于python - pyspark 需要 psutil 做什么？ (面对 "UserWarning: Please install psutil to have better support with spilling")？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/51226469/

24

4

0

文章推荐： excel - 公式自动出现

文章推荐： excel - 将 VBA 公式插入定义的范围

文章推荐： python - 如何在 matplotlib 中创建非线性颜色条刻度

文章推荐： VBA - 范围对象在循环中仅设置一次

Maven:install 和 install:install 有什么区别？
我是 Maven 新手，正在尝试了解它是如何工作的。我知道生命周期由多个阶段组成。阶段称为他们的魔力。如果调用一个阶段，则前面的所有阶段也会执行。例如，当我调用 mvn install 时就会发生这
installation - "binary install"和 "compile and install from source"和有什么区别？哪个更好？
我想安装Ros(机器人操作系统)的驱动程序，我有两个选项:二进制安装和从源代码编译安装。我想知道哪种安装更好，每种安装有哪些优点和缺点。最佳答案源:又称源代码，通常位于某种 tarball 或 z
windows - "nuget install"、 "Install-Package"和 "choco install"之间有什么区别？
以及更具体的问题。我的理解对吗: “nuget install”总是安装到您运行它的目录吗？ “choco install”安装到特殊的 choco 目录，然后运行脚本在系统中传播它？ “nuget
安卓工作室 : "SDK installation does not have the "Extras > Android Support Repository"installed"BUT IT HAS BEEN INSTALLED
我创建了 Android 项目，但随后我立即得到出现错误的信息。 Warning:(22, 12) Dependency on a support library, but the SDK insta
installation - 句柄 "Another version of this product is already installed. Installation of this version cannot continue..."
我的安装程序有 32 位和 64 位版本，它们具有(几乎)完全相同的代码和自定义操作序列(只有与此问题无关的细微差别) 我希望我的安装程序能够检测它之前是否已安装，并在这种情况下运行我自己的代码，而不
installation - npm install:警告依赖项
我在TFS版本中使用npm install cmd。我总是得到以下警告: npm WARN optional dep failed, continuing fsevents@0.3.1 如何删除此警告
installation - install(TARGETS…)和add_subdirectory
是否可以将install(TARGETS ...)与在add_subdirectory添加的目录中定义的目标一起使用？我的用例是，我想为gtest构建一个rpm的e.gg。 gtest项目恰好有一个
windows - 微星 : installer for installers
我需要使用 MSI 创建安装程序，其目的是根据用户的区域(从环境变量读取)调用正确的安装程序。也就是说，这个安装程序应该有 3 个文件(它们本身就是安装程序)，一个用于美国，一个用于欧洲，一个用于亚洲
Android Studio 错误 "Installation did not succeed. The application could not be installed. Installation failed due to: ' null'"
我正在尝试通过 Android Studio 3.5 在我的小米 RedMi S2 上运行我的应用程序。在手机上安装应用程序时抛出错误: Installation did not succeed. T
go - `go install` 、 `govendor install +local` 和 `govendor install +vendor,^program` 有什么区别？
使用govendor时，go install、govendor install +local和govendor install +vendor,^program有什么区别？ govendor inst
windows-installer - Windows Installer 属性中可以存储的最大字符数是多少
我用谷歌搜索了很多，但找不到答案。因此，在 Windows Installer 属性值中可以存储多少个字符。如果你给出答案，你能提供答案的来源吗？最佳答案我问 Windows Installer
Cuda 补丁 : install all or install latest?
Cuda v9.0 有几个补丁我应该安装最新补丁还是安装所有补丁？ https://developer.nvidia.com/cuda-90-download-archive?target_os=W
installation - Phalcon 文档 : installation/FreeBSD
我正在尝试通过它的文档安装 phalcon!在这一步我有一个错误: installation/FreeBSD Command: pkg_add -r phalcon 错误: 'pkg_add' is
installation - 如何确定 Windows Installer 正在执行升级而不是首次安装？
我有一个安装，如果应用程序退出，它会升级该应用程序的先前版本。当安装处于升级模式时，我想跳过某些操作。如何确定安装是在升级模式还是首次安装模式下运行？我正在使用 Wise Installer，但我认
windows-installer - Windows Installer 使用哪个哈希函数？
MSI 数据库包含一个表 MsiFileHash 。根据文档MsiFileHash 表用于存储 Windows Installer 包提供的源文件的 128 位哈希。有人知道使用/应该使用什么哈希算
installation - 无法运行 npm install browserify
我尝试在本地和全局运行 npm install browserify (-g) 但我总是遇到以下错误 npm ERR! peerinvalid The package bn.js does not s
Scons Install() 仅适用于 --install-sandbox
我有一个用于我正在构建的 python 模块的 SConstruct 文件: import distutils.sysconfig env = Environment(CPPPATH=['includ
installation - 使 Windows Installer 忽略正在运行的进程
使用 Installshield 2010 和 Basic MSI 项目。我有一个之前由我的安装程序安装的 exe。该 exe 需要在安装程序升级期间运行。有没有办法保证安装程序不会尝试关闭进程？基
installation - "./..."中的 "go install ./..."在go语言中是什么意思？
我是围棋初学者。我试图编译一个 go 项目，但找不到任何解释“/...”的文档或文章。 cd ~/src/ephenation-server go install -v ./... 等待您的帮助。最
Cannot install nor cancel mongodb installation(无法安装或取消MongoDB安装)
我试过在选择和不选择‘安装Mongo指南针’选项的情况下运行安装程序，但我仍然无法安装它，也无法取消安装。然后，此设置对话冻结20-30分钟以上，没有任何进展。这实际上就是从他们的website(ht

首页

博学

6Ren·AI

商城