gpt4 book ai didi

python - 为什么我的 TensorFlow Convnet(尝试)训练会导致 NaN 梯度?

转载 作者:行者123 更新时间:2023-11-28 19:39:12 25 4
gpt4 key购买 nike

我修改了 MNIST (28x28) Convnet 教程代码以接受更大的图像 (150x150)。但是,当我尝试训练时,我收到此错误(完整堆栈跟踪请参见结尾):

W tensorflow/core/common_runtime/executor.cc:1076] 0x2e97d30 Compute status: Invalid argument: ReluGrad input is not finite. : Tensor had NaN values

这是我的代码。令人担忧的是,我在使用来自磁盘的图像数据时遇到了与在生成嘈杂的红色/蓝色/绿色方 block 并尝试按颜色对它们进行分类时一样的错误。生成 RGB 数据的代码不同于扫描 JPG 图像数据目录的代码。因此,要么是我加载自己的数据的方式存在系统性错误,要么是我提出的架构有问题。 (我可以包含这些模块,但我担心这会使这篇文章变得冗长难读。)

编辑:我已经尝试过使用较大图像 (30x30) 的相同代码,它确实有效。那么错误可能与 (150x150) 问题的高维度有关?

import tensorflow as tf
import numpy as np
import data.image_loader

###############################
##### Set hyperparameters #####
###############################

num_epochs = 2
width = 150
height = 150
num_categories = 2
num_channels = 3
batch_size = 100 # for my sanity
num_training_examples = 2000
num_test_examples = 200

num_batches = num_training_examples/batch_size

####################################################################################
##### It's convenient to define some methods to perform frequent routine tasks #####
####################################################################################

def weight_variable(shape):
'''
Generates a TensorFlow Tensor. This Tensor gets initialized with values sampled from the truncated normal
distribution. Its purpose will be to store model parameters.
:param shape: The dimensions of the desired Tensor
:return: The initialized Tensor
'''
initial = tf.truncated_normal(shape, stddev=0.1)
return tf.Variable(initial)

def bias_variable(shape):
'''
Generates a TensorFlow Tensor. This Tensor gets initialized with values sampled from <some?> distribution.
Its purpose will be to store bias values.
:param shape: The dimensions of the desired Tensor
:return: The initialized Tensor
'''
initial = tf.constant(0.1, shape=shape)
return tf.Variable(initial)

def conv2d(x, W):
'''
Generates a conv2d TensorFlow Op. This Op flattens the weight matrix (filter) down to 2D, then "strides" across the
input Tensor x, selecting windows/patches. For each little_patch, the Op performs a right multiply:
W . little_patch
and stores the result in the output layer of feature maps.
:param x: a minibatch of images with dimensions [batch_size, height, width, 3]
:param W: a "filter" with dimensions [window_height, window_width, input_channels, output_channels]
e.g. for the first conv layer:
input_channels = 3 (RGB)
output_channels = number_of_desired_feature_maps
:return: A TensorFlow Op that convolves the input x with the filter W.
'''
return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):
'''
Genarates a max-pool TensorFlow Op. This Op "strides" a window across the input x. In each window, the maximum value
is selected and chosen to represent that region in the output Tensor. Hence the size/dimensionality of the problem
is reduced.
:param x: A Tensor with dimensions [batch_size, height, width, 3]
:return: A TensorFlow Op that max-pools the input Tensor, x.
'''
return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],
strides=[1, 2, 2, 1], padding='SAME')


############################
##### Set up the model #####
############################

x = tf.placeholder("float", shape=[None, height, width, num_channels])
x_image = tf.reshape(x, [-1, width, height, num_channels])
y_ = tf.placeholder("float", shape=[None, num_categories])

#1st conv layer
W_conv1 = weight_variable([5, 5, num_channels, 32]) #5x5 conv window, 3 colour channels, 32 outputted feature maps
b_conv1 = bias_variable([32])
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)

#2nd conv layer
W_conv2 = weight_variable([5, 5, 32, 64])
b_conv2 = bias_variable([64])
h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)

#fully connected layer
W_fc1 = weight_variable([38 * 38 * 64, 1024])
b_fc1 = bias_variable([1024])
h_pool2_flat = tf.reshape(h_pool2, [-1, 38*38*64])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

#droupout
keep_prob = tf.placeholder("float")
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

#softmax output layer
W_fc2 = weight_variable([1024, num_categories])
b_fc2 = bias_variable([num_categories])
y_conv=tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)

#saving model
saver = tf.train.Saver()

###################################
##### Load data from the disk #####
###################################

dataset = data.image_loader.ImageLoad(base_path="/home/hal9000/Datasets/id_dataset3",
num_categories=num_categories,
width=width,
height=height)

data_training = np.asarray(np.split(dataset.data_training, num_batches))
labels_training = np.asarray(np.split(dataset.labels_training, num_batches))

data_test = np.split(dataset.data_test, 1)
labels_test = np.split(dataset.labels_test, 1)

####################################################
##### Train the model and evaluate performance #####
####################################################

with tf.Session() as sess:
cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv))
#train_step = tf.train.GradientDescentOptimizer(0.01).minimize(cross_entropy)
train_step = tf.train.AdamOptimizer(0.0005).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(y_conv,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
sess.run(tf.initialize_all_variables())
for j in range(num_epochs):
for i in range(num_batches):
train_step.run(feed_dict={x: np.asarray(data_training[i]), y_: np.asarray(labels_training[i]), keep_prob: 0.5})

print "=== EPOCH: " + str(j) + " ==="
print "test accuracy: %g \n"%accuracy.eval(feed_dict={x: data_test[i], y_: labels_test[i], keep_prob: 1.0})

saver.save(sess, "saved_models/convnet_image" + str(j) + ".ckpt")

错误:

I tensorflow/core/common_runtime/local_device.cc:40] Local device intra op parallelism threads: 8
I tensorflow/core/common_runtime/direct_session.cc:58] Direct session inter op parallelism threads: 8
W tensorflow/core/common_runtime/executor.cc:1076] 0xc8991e0 Compute status: Invalid argument: ReluGrad input is not finite. : Tensor had NaN values
[[Node: gradients/Relu_grad/Relu/CheckNumerics = CheckNumerics[T=DT_FLOAT, message="ReluGrad input is not finite.", _device="/job:localhost/replica:0/task:0/cpu:0"](add)]]
W tensorflow/core/common_runtime/executor.cc:1076] 0xc8991e0 Compute status: Invalid argument: ReluGrad input is not finite. : Tensor had NaN values
[[Node: gradients/Relu_1_grad/Relu_1/CheckNumerics = CheckNumerics[T=DT_FLOAT, message="ReluGrad input is not finite.", _device="/job:localhost/replica:0/task:0/cpu:0"](add_1)]]
W tensorflow/core/common_runtime/executor.cc:1076] 0xc8991e0 Compute status: Invalid argument: ReluGrad input is not finite. : Tensor had NaN values
[[Node: gradients/Relu_2_grad/Relu_2/CheckNumerics = CheckNumerics[T=DT_FLOAT, message="ReluGrad input is not finite.", _device="/job:localhost/replica:0/task:0/cpu:0"](add_2)]]
Traceback (most recent call last):
File "/home/hal9000/PycharmProjects/TensorFlow_Experiments_0.4/neural_nets/image_convnet.py", line 137, in <module>
train_step.run(feed_dict={x: np.asarray(data_training[i]), y_: np.asarray(labels_training[i]), keep_prob: 0.5})
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1325, in run
_run_using_default_session(self, feed_dict, self.graph, session)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2945, in _run_using_default_session
session.run(operation, feed_dict)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 368, in run
results = self._do_run(target_list, unique_fetch_targets, feed_dict_string)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 444, in _do_run
e.code)
tensorflow.python.framework.errors.InvalidArgumentError: ReluGrad input is not finite. : Tensor had NaN values
[[Node: gradients/Relu_grad/Relu/CheckNumerics = CheckNumerics[T=DT_FLOAT, message="ReluGrad input is not finite.", _device="/job:localhost/replica:0/task:0/cpu:0"](add)]]
Caused by op u'gradients/Relu_grad/Relu/CheckNumerics', defined at:
File "/home/hal9000/PycharmProjects/TensorFlow_Experiments_0.4/neural_nets/image_convnet.py", line 131, in <module>
train_step = tf.train.AdamOptimizer(0.0005).minimize(cross_entropy)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 186, in minimize
aggregation_method=aggregation_method)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 232, in compute_gradients
aggregation_method=aggregation_method)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients.py", line 445, in gradients
in_grads = _AsList(grad_fn(op_wrapper, *out_grads))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_grad.py", line 126, in _ReluGrad
t = _VerifyTensor(op.inputs[0], op.name, "ReluGrad input is not finite.")
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_grad.py", line 119, in _VerifyTensor
verify_input = array_ops.check_numerics(t, message=msg)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 48, in check_numerics
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/op_def_library.py", line 664, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1834, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1043, in __init__
self._traceback = _extract_stack()

...which was originally created as op u'Relu', defined at:
File "/home/hal9000/PycharmProjects/TensorFlow_Experiments_0.4/neural_nets/image_convnet.py", line 82, in <module>
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 547, in relu
return _op_def_lib.apply_op("Relu", features=features, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/op_def_library.py", line 664, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1834, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1043, in __init__
self._traceback = _extract_stack()


Process finished with exit code 1

最佳答案

一个可能的麻烦来源是 tf.log(y_conv),它将为 y_conv 中的任何零发出 NaN 值。 tf.nn.softmax_cross_entropy_with_logits() operator 提供了一个数值稳定(且更有效)的损失计算版本。以下应该工作得更好:

logits = tf.matmul(h_fc1_drop, W_fc2) + b_fc2
y_conv = tf.nn.softmax(logits)

cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits, y_)

correct_prediction = tf.equal(tf.argmax(y_conv,1), tf.argmax(y_,1))

关于python - 为什么我的 TensorFlow Convnet(尝试)训练会导致 NaN 梯度?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35106101/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com