python - 当存在 GPU 时，如何在 TensorFlow 的单个脚本中训练多个模型？-6ren

python - 当存在 GPU 时，如何在 TensorFlow 的单个脚本中训练多个模型？

转载作者：IT老高更新时间：2023-10-28 20:54:55

假设我可以在一台机器上访问多个 GPU(为了论证，假设在一台具有一定数量 RAM 和磁盘的单台机器上有 8 个 GPU，每个 GPU 的最大内存为 8GB)。我想跑在 一个脚本在一台机器上，一个程序在 TensorFlow 中评估多个模型(比如 50 或 200)，每个模型都有不同的超参数设置(比如步长、衰减率、批量大小、时期/迭代等)。在训练结束时假设我们只是记录它的准确性并摆脱模型(如果你想假设模型经常被检查点，所以扔掉模型并从头开始训练很好。你也可以假设可能会记录一些其他数据，例如特定的超参数、训练、验证、训练错误等)。

目前我有一个(伪)脚本，如下所示:

def train_multiple_modles_in_one_script_with_gpu(arg):
    '''
    trains multiple NN models in one session using GPUs correctly.

    arg = some obj/struct with the params for trianing each of the models.
    '''
    #### try mutliple models
    for mdl_id in range(100):
        #### define/create graph
        graph = tf.Graph()
        with graph.as_default():
            ### get mdl
            x = tf.placeholder(float_type, get_x_shape(arg), name='x-input')
            y_ = tf.placeholder(float_type, get_y_shape(arg))
            y = get_mdl(arg,x)
            ### get loss and accuracy
            loss, accuracy = get_accuracy_loss(arg,x,y,y_)
            ### get optimizer variables
            opt = get_optimizer(arg)
            train_step = opt.minimize(loss, global_step=global_step)
        #### run session
        with tf.Session(graph=graph) as sess:
            # train
            for i in range(nb_iterations):
                batch_xs, batch_ys = get_batch_feed(X_train, Y_train, batch_size)
                sess.run(fetches=train_step, feed_dict={x: batch_xs, y_: batch_ys})
                # check_point mdl
                if i % report_error_freq == 0:
                    sess.run(step.assign(i))
                    #
                    train_error = sess.run(fetches=loss, feed_dict={x: X_train, y_: Y_train})
                    test_error = sess.run(fetches=loss, feed_dict={x: X_test, y_: Y_test})
                    print( 'step %d, train error: %s test_error %s'%(i,train_error,test_error) )

本质上，它在一次运行中尝试了许多模型，但它在单独的图中构建每个模型，并在单独的 session 中运行每个模型。

我想我主要担心的是，我不清楚 tensorflow 在幕后如何为要使用的 GPU 分配资源。例如，它是否仅在运行 session 时加载(部分)数据集？当我创建图形和模型时，它是立即带入 GPU 还是何时插入 GPU？每次尝试新模型时，我都需要清除/释放 GPU 吗？我实际上不太关心模型是否在多个 GPU 中并行运行(这可能是一个很好的补充)，但我希望它首先串行运行所有内容而不会崩溃。有什么特别的我需要做才能让它发挥作用吗？

目前我收到一个错误，开始如下:

I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit:                   340000768
InUse:                   336114944
MaxInUse:                339954944
NumAllocs:                      78
MaxAllocSize:            335665152

W tensorflow/core/common_runtime/bfc_allocator.cc:274] ***************************************************xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 160.22MiB.  See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:975] Resource exhausted: OOM when allocating tensor with shape[60000,700]

并进一步说:

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[60000,700]
         [[Node: standardNN/NNLayer1/Z1/add = Add[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](standardNN/NNLayer1/Z1/MatMul, b1/read)]]

I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-SXM2-16GB, pci bus id: 0000:06:00.0)

然而，在输出文件(它打印的地方)的更深处，它似乎可以很好地打印出应该在训练过程中显示的错误/消息。这是否意味着它没有耗尽资源？或者它真的能够使用 GPU？如果它能够使用 CPU 而不是 CPU，那么为什么只有在将要使用 GPU 时才会发生这种错误？

奇怪的是，数据集真的没有那么大(所有 60K 点都是 24.5M)，当我在自己的计算机上本地运行单个模型时，该过程似乎使用不到 5GB。 GPU 至少有 8GB，带有它们的计算机有足够的 RAM 和磁盘(至少 16GB)。因此，tensorflow 向我抛出的错误非常令人费解。它试图做什么以及它们为什么会发生？有任何想法吗？

在阅读了建议使用多处理库的答案后，我想出了以下脚本:

def train_mdl(args):
    train(mdl,args)

if __name__ == '__main__':
    for mdl_id in range(100):
        # train one model with some specific hyperparms (assume they are chosen randomly inside the funciton bellow or read from a config file or they could just be passed or something)
        p = Process(target=train_mdl, args=(args,))
        p.start()
        p.join()
    print('Done training all models!')

老实说，我不知道为什么他的回答建议使用池，或者为什么有奇怪的元组括号，但这对我来说是有意义的。每次在上述循环中创建新进程时，是否会重新分配 tensorflow 的资源？

最佳答案

我认为从长远来看，在一个脚本中运行所有模型可能是不好的做法(请参阅下面我的建议以获得更好的替代方案)。但是，如果您想这样做，这里有一个解决方案:您可以使用 multiprocessing 将您的 TF session 封装到一个进程中。模块，这将确保 TF 在进程完成后释放 session 内存。这是一个代码片段:

from multiprocessing import Pool
import contextlib
def my_model((param1, param2, param3)): # Note the extra (), required by the pool syntax
    < your code >

num_pool_worker=1 # can be bigger than 1, to enable parallel execution 
with contextlib.closing(Pool(num_pool_workers)) as po: # This ensures that the processes get closed once they are done
     pool_results = po.map_async(my_model,
                                    ((param1, param2, param3)
                                     for param1, param2, param3 in params_list))
     results_list = pool_results.get()

来自 OP 的注意事项:如果您选择使用多处理库，随机数生成器种子不会自动重置。详情请见: Using python multiprocessing with different random seed for each process

关于 TF 资源分配:通常 TF 分配的资源比它需要的要多得多。很多时候，您可以限制每个进程使用总 GPU 内存的一小部分，并通过反复试验发现脚本所需的部分。

您可以使用以下代码段来完成

gpu_memory_fraction = 0.3 # Choose this number through trial and error
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=gpu_memory_fraction,)
session_config = tf.ConfigProto(gpu_options=gpu_options)
sess = tf.Session(config=session_config, graph=graph)

请注意，有时 TF 会增加内存使用量以加快执行速度。因此，减少内存使用量可能会使您的模型运行速度变慢。

对您编辑/评论中新问题的回答:

是的，Tensorflow 将在每次创建新进程时重新分配，并在进程结束后清除。

您编辑中的 for 循环也应该完成这项工作。我建议改用 Pool，因为它可以让您在单个 GPU 上同时运行多个模型。请参阅我关于设置的说明 gpu_memory_fraction和“选择最大数量的进程”。另请注意: (1) Pool map 为您运行循环，因此一旦使用它，您就不需要外部 for 循环。 (2) 在你的例子中，你应该有类似 mdl=get_model(args) 的东西在调用 train() 之前

奇怪的元组括号:Pool 只接受一个参数，因此我们使用一个元组来传递多个参数。见 multiprocessing.pool.map and function with two arguments更多细节。正如一个答案中所建议的那样，您可以使用以下命令使其更具可读性

def train_mdl(params):
    (x,y)=params
    < your code >

正如@Seven 建议的那样，您可以使用 CUDA_VISIBLE_DEVICES 环境变量来选择用于您的流程的 GPU。您可以在您的 python 脚本中使用 process 函数开头的以下内容 ( train_mdl ) 执行此操作。

import os # the import can be on the top of the python script
os.environ["CUDA_VISIBLE_DEVICES"] = "{}".format(gpu_id)

执行实验的更好做法 将您的训练/评估代码与超参数/模型搜索代码隔离开来。
例如。有一个名为 train.py 的脚本，它接受超参数的特定组合和对数据的引用作为参数，并为单个模型执行训练。

然后，要遍历所有可能的参数组合，您可以使用一个简单的任务(作业)队列，并将超参数的所有可能组合作为单独的作业提交。任务队列将一次向您的机器提供一份工作。通常，您还可以将队列设置为并发执行多个进程(请参阅下面的详细信息)。

具体来说，我使用 task spooler ，这是 super 容易安装和少数(不需要管理员权限，下面的详细信息)。

基本用法是(请参阅下面有关任务假脱机程序用法的说明):

ts <your-command>

在实践中，我有一个单独的 python 脚本来管理我的实验，设置每个特定实验的所有参数并将作业发送到 ts队列。

以下是我的实验经理提供的一些相关 Python 代码片段:
run_bash执行 bash 命令

def run_bash(cmd):
    p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, executable='/bin/bash')
    out = p.stdout.read().strip()
    return out  # This is the stdout from the shell command

下一个片段设置要运行的并发进程数(请参阅下面关于选择最大进程数的注释):

max_job_num_per_gpu = 2
run_bash('ts -S %d'%max_job_num_per_gpu)

下一个片段遍历超参数/模型参数的所有组合的列表。列表的每个元素都是一个字典，其中的键是 train.py 的命令行参数。脚本

for combination_dict in combinations_list:

    job_cmd = 'python train.py ' + '  '.join(
            ['--{}={}'.format(flag, value) for flag, value in combination_dict.iteritems()])

    submit_cmd = "ts bash -c '%s'" % job_cmd
    run_bash(submit_cmd)

关于选择最大进程数的说明:

如果您缺少 GPU，可以使用 gpu_memory_fraction您发现，将进程数设置为 max_job_num_per_gpu=int(1/gpu_memory_fraction)
关于任务假脱机程序 (ts) 的说明:

您可以使用以下方法设置要运行的并发进程数(“slots”):
ts -S <number-of-slots>

安装 ts不需要管理员权限。您可以使用简单的 make 从源代码下载并编译它。，将其添加到您的路径中，您就完成了。

您可以设置多个队列(我将它用于多个 GPU)，
TS_SOCKET=<path_to_queue_name> ts <your-command>
例如
TS_SOCKET=/tmp/socket-ts.gpu_queue_1 ts <your-command>TS_SOCKET=/tmp/socket-ts.gpu_queue_2 ts <your-command>

见 here进一步的使用示例

关于自动设置路径名和文件名的说明:
一旦您将主代码与实验管理器分开，您将需要一种有效的方法来生成文件名和目录名，并给定超参数。我通常将重要的超参数保存在字典中，并使用以下函数从字典键值对生成单个链式字符串。
以下是我用于执行此操作的函数:

def build_string_from_dict(d, sep='%'):
    """
     Builds a string from a dictionary.
     Mainly used for formatting hyper-params to file names.
     Key-value pairs are sorted by the key name.

    Args:
        d: dictionary

    Returns: string
    :param d: input dictionary
    :param sep: key-value separator

    """

    return sep.join(['{}={}'.format(k, _value2str(d[k])) for k in sorted(d.keys())])


def _value2str(val):
    if isinstance(val, float): 
        # %g means: "Floating point format.
        # Uses lowercase exponential format if exponent is less than -4 or not less than precision,
        # decimal format otherwise."
        val = '%g' % val
    else:
        val = '{}'.format(val)
    val = re.sub('\.', '_', val)
    return val

关于python - 当存在 GPU 时，如何在 TensorFlow 的单个脚本中训练多个模型？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/42426960/

文章推荐： python - Tensorflow中的numpy随机选择

文章推荐： java - J2EE 中的容器究竟是什么，它有什么帮助？

文章推荐： java - 将对象包装在集合中的实用方法

Powershell 脚本 - 脚本 - 我如何禁用它
我有 powershell 脚本。通过调度程序，我运行 bat 文件，该文件运行 PS1 文件。 BAT文件 Powershell.exe -executionpolicy remotesigned
Jquery - getScript 版本<脚本>..
什么更快？或者 $.getScript('../js/SOME.js', function (){ ... // with $.ajaxSetup({ cache: true });
linux - Bash 脚本 Shell 脚本
需要bash脚本来显示文件 #!/bin/bash my_ls() { # save current directory then cd to "$1" pushd "$1" >/dev/nu
ksh - Unix 脚本 - 需要提高性能的建议(shell 脚本)
我有一个输入 csv 文件，实际上我需要在输入文件中选择第 2 列和第 3 列值，并且需要转换两个值的时区(从 PT 到 CT)，转换后我需要替换转换后的时区值到文件。注意: 所有输入日期值都在太平
Bash 脚本 - 编写 init.d 脚本
我正在使用/etc/init.d/httpd 作为 init.d 脚本的模板。我了解文件中发生的所有内容，但以下行除外: LANG=$HTTPD_LANG daemon --pidfile=${pid
python - 将具有多个子选项的命令行选项传递给 python 脚本 -- shell 脚本
我有以下选择: python runscript.py -O start -a "-a "\"-o \\\"-f/dev/sda1 -b256k -Q8\\\" -l test -p maim\""
linux - 用于重命名文件的 Shell 脚本 - Shell 脚本
我对 shell 脚本完全陌生，但我需要编写一个 shell 脚本来检查文件是否存在，然后移动到另一个位置这是我写的: 一旦设备崩溃，我就会在/storage/sdcard1/1 中收集日志 #!/
linux - Bash 脚本 - Bash 脚本 - 编辑文件中文本的行
我正在使用 bash 脚本从文本文件中读取数据。数据: 04:31 Alex M.O.R.P.H. & Natalie Gioia - My Heaven http://goo.gl/rMOa2q
php - 按下按钮运行 php 脚本，该脚本会回显 javascript 脚本
这是单击按钮时运行的 javascript 的结尾 xmlObj.open ('GET', /ajax.php, true); xmlObj.send (''); } 所以这会执行根目录中的php脚本
linux - 重新激活 python 脚本 - Linux bash 脚本
关闭。这个问题需要debugging details .它目前不接受答案。编辑问题以包含 desired behavior, a specific problem or error, and th
python - Nodejs 脚本 fs.createReadStream 到 Python 脚本
我需要将文件转换为可读流以通过 api 上传，有一个使用 fs.createReadStream 的 Node js 示例。任何人都可以告诉我上述声明的 python 等价物是什么？例子 const
python - 从 shell 脚本 cron 调用 python 脚本
我有一个 shell 脚本 cron，它从同一目录调用 python 脚本，但是当这个 cron 执行时，我没有从我的 python 脚本中获得预期的输出，当我手动执行它时，我的 python 脚本的
javascript - 安全的 ajax 脚本(javascript 请求的 PHP 脚本)
如何使 XMLHttpRequest (ajax) 调用的 php 脚本安全。我的意思是，不让 PHP 文件通过直接 url 运行，只能通过脚本从我的页面调用(我不想向未登录的用户显示数据库结果，并
脚本 block 错误中的经典 asp 脚本 block 中的 javascript
我正在尝试添加以下内容我正在使用经典的 asp。但我不断收到的错误是“一个脚本 block 不能放在另一个脚本 block 内。”我尝试了此处的 document.write 技术:Javasc
php - 如何从一个 PHP 脚本(如批处理文件)中运行多个 PHP 脚本？
如何从另一个 PHP 脚本(如批处理文件)中运行多个 PHP 脚本？如果我了解 include 在做什么，我认为 include 不会起作用；因为我正在运行的每个文件都会重新声明一些相同的函数等。我想
html - 如何从 HTML5 脚本/文件/页面调用 Lua 脚本
我想创建具有动态内容的网页。我有一个 HTML 页面，我想从中调用一个 lua 脚本如何调用 lua 脚本？ ? ？从中检索数据？我可以做类似的事情吗: int xx = 0; xx
jquery - 表格行不会自动滚动。脚本 1 滚动，脚本 2 不滚动。为什么？
我删除了我的第一个问题，并重新编写了更多细节和附加 jSfiddle domos。我有一个脚本，它运行查询并返回数据，然后填充表。表中的行自动循环滚动。所有这些工作正常，并通过使用以下代码完成。然而
javascript - amp-script 未找到脚本哈希。 amp-脚本[脚本 ="hello-world"]
我尝试使用 amp 脚本，但收到此错误: “[amp-script] 脚本哈希未找到。amp-script[script="hello-world"].js 必须在元[name="amp-script
java - 从 Java 执行 Shell 脚本，具有读取操作的 Shell 脚本
我有一个读取输入的 Shell 脚本 #!/bin/bash echo "Type the year that you want to check (4 digits), followed by [E
arrays - Redis Lua 脚本 - 如何将数组作为参数传递给 nodejs 中的 Lua 脚本？
我正在从 nodejs 调用 Lua 脚本。我想传递一个数组作为参数。我在 Lua 中解析该数组时遇到问题。下面是一个例子: var script = 'local actorlist = ARGV

IT老高

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 当存在 GPU 时，如何在 TensorFlow 的单个脚本中训练多个模型？