gpt4 book ai didi

python - 如何执行多个 NN 训练?

转载 作者:行者123 更新时间:2023-12-05 05:47:29 36 4
gpt4 key购买 nike

我的机器中有两个 NVidia GPU,但我没有使用它们。

我的机器上运行了三个神经网络训练。当我尝试运行第四个时,脚本出现以下错误:

my_user@my_machine:~/my_project/training_my_project$ python3 my_project.py
Traceback (most recent call last):
File "my_project.py", line 211, in <module>
load_data(
File "my_project.py", line 132, in load_data
tx = tf.convert_to_tensor(data_x, dtype=tf.float32)
File "/home/my_user/.local/lib/python3.8/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/home/my_user/.local/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 106, in convert_to_eager_tensor
return ops.EagerTensor(value, ctx.device_name, dtype)
tensorflow.python.framework.errors_impl.FailedPreconditionError: Failed to allocate scratch buffer for device 0
my_user@my_machine:~/my_project/training_my_project$

我该如何解决这个问题?

以下是我的 RAM 使用情况:

my_user@my_machine:~/my_project/training_my_project$ free -m
total used free shared buff/cache available
Mem: 15947 6651 3650 20 5645 8952
Swap: 2047 338 1709
my_user@my_machine:~/my_project/training_my_project$

以下是我的 CPU 使用情况:

my_user@my_machine:~$ top -i
top - 12:46:12 up 79 days, 21:14, 2 users, load average: 4,05, 3,82, 3,80
Tasks: 585 total, 2 running, 583 sleeping, 0 stopped, 0 zombie
%Cpu(s): 11,7 us, 1,6 sy, 0,0 ni, 86,6 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
MiB Mem : 15947,7 total, 3638,3 free, 6662,7 used, 5646,7 buff/cache
MiB Swap: 2048,0 total, 1709,4 free, 338,6 used. 8941,6 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2081821 my_user 20 0 48,9g 2,5g 471076 S 156,1 15,8 1832:54 python3
2082196 my_user 20 0 48,8g 2,6g 467708 S 148,5 16,8 1798:51 python3
2076942 my_user 20 0 47,8g 1,6g 466916 R 147,5 10,3 2797:51 python3
1594 gdm 20 0 3989336 65816 31120 S 0,7 0,4 38:03.14 gnome-shell
93 root rt 0 0 0 0 S 0,3 0,0 0:38.42 migration/13
1185 root -51 0 0 0 0 S 0,3 0,0 3925:59 irq/54-nvidia
2075861 root 20 0 0 0 0 I 0,3 0,0 1:30.17 kworker/22:0-events
2076418 root 20 0 0 0 0 I 0,3 0,0 1:38.65 kworker/1:0-events
2085325 root 20 0 0 0 0 I 0,3 0,0 1:17.15 kworker/3:1-events
2093002 root 20 0 0 0 0 I 0,3 0,0 1:00.05 kworker/23:0-events
2100000 root 20 0 0 0 0 I 0,3 0,0 0:45.78 kworker/2:2-events
2104688 root 20 0 0 0 0 I 0,3 0,0 0:33.08 kworker/9:0-events
2106767 root 20 0 0 0 0 I 0,3 0,0 0:25.16 kworker/20:0-events
2115469 root 20 0 0 0 0 I 0,3 0,0 0:01.98 kworker/11:2-events
2115470 root 20 0 0 0 0 I 0,3 0,0 0:01.96 kworker/12:2-events
2115477 root 20 0 0 0 0 I 0,3 0,0 0:01.95 kworker/30:1-events
2116059 my_user 20 0 23560 4508 3420 R 0,3 0,0 0:00.80 top

以下是我的TF配置:

import os

os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"
# os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
# os.environ["CUDA_VISIBLE_DEVICES"] = "99" # Use both gpus for training.


import sys, random
import time
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import ModelCheckpoint
import numpy as np
from lxml import etree, objectify


# <editor-fold desc="GPU">
# resolve GPU related issues.
try:
physical_devices = tf.config.list_physical_devices('GPU')
for gpu_instance in physical_devices:
tf.config.experimental.set_memory_growth(gpu_instance, True)
except Exception as e:
pass
# END of try
# </editor-fold>

请将注释行视为注释掉的行。

相关源码:

def load_data(fname: str, class_index: int, feature_start_index: int, **selection):
i = 0
file = open(fname)
if "top_n_lines" in selection:
lines = [next(file) for _ in range(int(selection["top_n_lines"]))]
elif "random_n_lines" in selection:
tmp_lines = file.readlines()
lines = random.sample(tmp_lines, int(selection["random_n_lines"]))
else:
lines = file.readlines()

data_x, data_y = [], []
for l in lines:
row = l.strip().split()
x = [float(ix) for ix in row[feature_start_index:]]
y = encode(row[class_index])
data_x.append(x)
data_y.append(y)
# END for l in lines

num_rows = len(data_x)
given_fraction = selection.get("validation_part", 1.0)
if given_fraction > 0.9999:
valid_x, valid_y = data_x, data_y
else:
n = int(num_rows * given_fraction)
data_x, data_y = data_x[n:], data_y[n:]
valid_x, valid_y = data_x[:n], data_y[:n]
# END of if-else block

tx = tf.convert_to_tensor(data_x, np.float32)
ty = tf.convert_to_tensor(data_y, np.float32)

vx = tf.convert_to_tensor(valid_x, np.float32)
vy = tf.convert_to_tensor(valid_y, np.float32)

return tx, ty, vx, vy
# END of the function

最佳答案

使用多个 GPU

如果在具有单个 GPU 的系统上进行开发,您可以使用虚拟设备模拟多个 GPU。这使得无需额外资源即可轻松测试多 GPU 设置。

gpus = tf.config.list_physical_devices('GPU')
if gpus:
# Create 2 virtual GPUs with 1GB memory each
try:
tf.config.set_logical_device_configuration(
gpus[0],
[tf.config.LogicalDeviceConfiguration(memory_limit=1024),
tf.config.LogicalDeviceConfiguration(memory_limit=1024)])
logical_gpus = tf.config.list_logical_devices('GPU')
print(len(gpus), "Physical GPU,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Virtual devices must be set before GPUs have been initialized
print(e)

注意:虚拟设备初始化后不能修改

一旦有多个逻辑 GPU 可用于运行时,您可以通过 tf.distribute.Strategy手动放置来利用多个 GPU。

tf.distribute.Strategy 使用多个 GPU 的最佳实践,这里是一个简单的例子:

tf.debugging.set_log_device_placement(True)
gpus = tf.config.list_logical_devices('GPU')
strategy = tf.distribute.MirroredStrategy(gpus)
with strategy.scope():
inputs = tf.keras.layers.Input(shape=(1,))
predictions = tf.keras.layers.Dense(1)(inputs)
model = tf.keras.models.Model(inputs=inputs, outputs=predictions)
model.compile(loss='mse',
optimizer=tf.keras.optimizers.SGD(learning_rate=0.2))

此程序将在每个 GPU 上运行模型的副本,在它们之间拆分输入数据,也称为“数据并行性”。

有关distribution strategies的更多信息或 manual placement ,查看链接上的指南。

关于python - 如何执行多个 NN 训练?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/71017766/

36 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com