共享 GPU 上的 Tensorflow : how to automatically select the one that is unused-6ren

共享 GPU 上的 Tensorflow : how to automatically select the one that is unused

转载作者：行者123 更新时间：2023-12-03 04:44:11

34

4

我可以通过 ssh 访问由 n 个 GPU 组成的集群。 Tensorflow 自动给它们命名为 gpu:0,...,gpu:(n-1)。

其他人也可以访问，有时他们会随机使用 GPU。我没有明确放置任何 tf.device() ，因为这很麻烦，即使我选择了 GPU 编号 j 并且有人已经在 GPU 编号 j 上，这也会有问题。

我想检查 GPU 的使用情况，找到第一个未使用的并仅使用这个。我猜有人可以用 bash 解析 nvidia-smi 的输出并获取变量 i 并将该变量 i 作为要使用的 GPU 的编号提供给 tensorflow 脚本。

我从来没有见过这样的例子。我想这是一个很常见的问题。最简单的方法是什么？是否有纯 tensorflow 可用？

最佳答案

我不知道纯 TensorFlow 解决方案。问题是 TensorFlow 配置的现有位置是 session 配置。然而，对于 GPU 内存，GPU 内存池是为进程内的所有 TensorFlow session 共享的，因此 session 配置将是添加它的错误位置，并且没有进程全局配置的机制(但应该有，也可以能够配置进程全局 Eigen 线程池)。因此，您需要使用 CUDA_VISIBLE_DEVICES 环境变量在进程级别上进行操作。

类似这样的事情:

import subprocess, re

# Nvidia-smi GPU memory parsing.
# Tested on nvidia-smi 370.23

def run_command(cmd):
    """Run command, return output as string."""
    output = subprocess.Popen(cmd, stdout=subprocess.PIPE, shell=True).communicate()[0]
    return output.decode("ascii")

def list_available_gpus():
    """Returns list of available GPU ids."""
    output = run_command("nvidia-smi -L")
    # lines of the form GPU 0: TITAN X
    gpu_regex = re.compile(r"GPU (?P<gpu_id>\d+):")
    result = []
    for line in output.strip().split("\n"):
        m = gpu_regex.match(line)
        assert m, "Couldnt parse "+line
        result.append(int(m.group("gpu_id")))
    return result

def gpu_memory_map():
    """Returns map of GPU id to memory allocated on that GPU."""

    output = run_command("nvidia-smi")
    gpu_output = output[output.find("GPU Memory"):]
    # lines of the form
    # |    0      8734    C   python                                       11705MiB |
    memory_regex = re.compile(r"[|]\s+?(?P<gpu_id>\d+)\D+?(?P<pid>\d+).+[ ](?P<gpu_memory>\d+)MiB")
    rows = gpu_output.split("\n")
    result = {gpu_id: 0 for gpu_id in list_available_gpus()}
    for row in gpu_output.split("\n"):
        m = memory_regex.search(row)
        if not m:
            continue
        gpu_id = int(m.group("gpu_id"))
        gpu_memory = int(m.group("gpu_memory"))
        result[gpu_id] += gpu_memory
    return result

def pick_gpu_lowest_memory():
    """Returns GPU with the least allocated memory"""

    memory_gpu_map = [(memory, gpu_id) for (gpu_id, memory) in gpu_memory_map().items()]
    best_memory, best_gpu = sorted(memory_gpu_map)[0]
    return best_gpu

然后，您可以将其放入 utils.py 中，并在首次 tensorflow 导入之前在 TensorFlow 脚本中设置 GPU。浏览器

import utils
import os
os.environ["CUDA_VISIBLE_DEVICES"] = str(utils.pick_gpu_lowest_memory())
import tensorflow

关于共享 GPU 上的 Tensorflow : how to automatically select the one that is unused，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/41634674/

34

4

0

文章推荐： vb.net - 使用 WndProc 覆盖组合框的绘制

文章推荐： reporting-services - SSRS参数值关系

文章推荐： ruby-on-rails - rails 模型类型列表

文章推荐： vim - 为什么 "d1j"在 vim 中删除两行？

c++ - GCC中__attribute__((unused))和__attribute((unused))的区别
__attribute__((unused)) 和 __attribute((unused)) 可以将变量标记为未使用以避免未使用警告。它们有什么区别？最佳答案在 GCC 存储库中，在文件 c-
c++ - C++中的VS2010调试器， "unused=???"或 "unused=0"的含义
在 Visual Studio 2010(C++，非托管)中调试，信息 unused ??? 或 unused 0 应该是什么意思？我附上了两个屏幕截图， child 是现有窗口的 HWND。我也对
objective-c - 忽略单个文件中的 "Unused Entity Issue: Unused Variable"
我只想在我的 Xcode 项目的一个文件中删除此编译器警告。有办法做到这一点吗？最佳答案您可以使用 a pragma directive and the "diagnostic" keyword
.net - 为什么.NET 警告 'Unused Variables' 而不是 'unused parameter' ？
为什么.NET 警告“未使用的变量”而不是“未使用的参数”？ (我相信 Java 在这两种情况下都会发出警告。) 为什么 .NET 不关心“未使用的参数”？最佳答案在我看来，您可能真正想要保留未使
objective-c - Xcode 警告 : Unused Entity Issue: Unused Variable
我正在处理这个教程应用程序，代码给我这个警告: Xcode WARNING: Unused Entity Issue: Unused Variable 执行这条语句时报错: int newRowInd
swift - "Expression resolves to an unused l-value"与 "Expression is unused"
考虑以下代码: class Foo { let bar = "Hello world!" init () { self // Warning: Expression o
objective-c - Xcode:如何在 "unused function"和 "unused parameters"失败下强制构建
您好，我正在尝试将 zxing 二维码阅读器整合到我的应用程序中。我直接从谷歌代码网站检查了 svn，并在我的项目设置中添加了所有标题路径。然而，当我尝试构建项目时，我遇到了很多“未使用的函数”和“未
haskell - 为什么 Haskell 管道 "use () to close unused inputs and X (the uninhabited type) to close unused outputs"？
在 Pipes Tutorial ，它说: The concrete type synonyms use () to close unused inputs and X (the uninhabite
Swift 3 调用结果 (_ :parameters:completionHandler:)' is unused warning and Braced block of statements is an unused closure error
我有工作 Swift.2.3 项目，但是当我构建并将其转换为 Swift3 时，它给我: Result of call to (_:parameters:completionHandler:)' is
c++ - UNused 类的默认析构函数崩溃
我有一个基于 cmake 的包。它有几个目标可执行文件。其中一个目标时不时会崩溃。回溯如下: ... #19 XXX::~XXX (this=0x69a120, __in_chrg=) at
C指针上的指针: unused variable
我刚刚用 C 编写了我的第一个应用程序，我收到了这个警告(已编辑):unused variable pp int compteur = 1; int *p = &compteur; int **pp
c++ - 为什么这个变量被标记为 "unused"？
我的 C++ 编译器发出警告，指出以下“intVar”变量“未使用”。 void MyClass::MyMethod(bool bFlag) { int intVar = 10; if
python - Python生成器对象遍历后是否变成 "unusable"？
我正在处理一个 Flask 项目，从 API 包装器获取一些数据。包装器返回一个生成器对象，所以我在将其传递给 Flask 的 render_template() 之前打印值(for obj in g
Go导入返回 "unused import"
我是 goLang 的新手。我正在尝试从 intellij 构建一个项目，该项目使用 git 库中的一个包: import ( "github.com/aerospike/aerospike-
eslint - `no-unused-vars` 的误报
我收到很多不正确的 ESLint/TS 警告，说枚举案例“已分配一个值但从未使用过”或导入“已定义但从未使用过”。这是一些代码。所有导入都说它们已定义但从未使用过(尽管您可以看到它们在底部的类型中)
r - "unused arguments"使用方法时出错
这对我来说真是个谜。我已经这样定义了我的方法(对于类“graf”): addStatistics <- function(x) UseMethod("addStatistics") addStatis
npm - grunt unused - 循环目录中的子文件夹
我正在尝试使用 grunt-unused 删除多个子目录中所有未使用的图像链接。 .为清楚起见，这是我的文件夹结构: |-- dist | |-- site-1 | | |—-index.htm
r - 注释ggplot条形图错误: Unused arguments
我试图在我通过函数创建的条形图上注释标准消息。以下是代码: hashbar <- function(x) { suppressWarnings(library(stringr))
Delphi:强制捕获匿名方法的 "unused"变量
我在一个过程中有一个变量，我需要保持该变量的事件状态，直到该过程中的匿名方法运行为止，但我不在匿名方法中使用该变量。有没有一种惯用的方法告诉编译器无论如何都要捕获变量？例如: procedure F
delphi - "unused"类可以在Delphi XE中使用吗
我正在使用 Delphi XE、Windows 7。在应用程序中，我想启用不同的报告类型供用户选择。为此，我有 1 个基本报告类和每个报告类型(xml、csv、ppt 等)的子类。 {Just an

首页

博学

6Ren·AI

商城

共享 GPU 上的 Tensorflow : how to automatically select the one that is unused