gpt4 book ai didi

python - Tensorflow 阅读 CSV - 什么是最好的方法

转载 作者:太空宇宙 更新时间:2023-11-03 11:20:22 24 4
gpt4 key购买 nike

所以我一直在尝试不同的方式来读取一个包含 97K 行并且每行包含 500 个特征(大约 100 MB)的 CSV 文件。

我的第一种方法是使用 numpy 将所有数据读入内存:

raw_data = genfromtxt(文件名, dtype=numpy.int32, delimiter=',')

此命令运行时间太长,我需要找到一种更好的方法来读取我的文件。

第二种方法是遵循以下准则: https://www.tensorflow.org/programmers_guide/reading_data

我注意到的第一件事是每个 epoch 的运行时间要长得多。由于我使用的是随机梯度下降,这可以解释为因为每个批处理都需要从文件中读取

有没有办法优化第二种方法?

我的代码(第二种方法):

reader = tf.TextLineReader()
filename_queue = tf.train.string_input_producer([filename])
_, csv_row = reader.read(filename_queue) # read one line
data = tf.decode_csv(csv_row, record_defaults=rDefaults) # use defaults for this line (in case of missing data)

labels = data[0]
features = data[labelsSize:labelsSize+featuresSize]

# minimum number elements in the queue after a dequeue, used to ensure
# that the samples are sufficiently mixed
# I think 10 times the BATCH_SIZE is sufficient
min_after_dequeue = 10 * batch_size

# the maximum number of elements in the queue
capacity = 20 * batch_size

# shuffle the data to generate BATCH_SIZE sample pairs
features_batch, labels_batch = tf.train.shuffle_batch([features, labels], batch_size=batch_size, num_threads=10, capacity=capacity, min_after_dequeue=min_after_dequeue)

****

coordinator = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coordinator)

try:
# And then after everything is built, start the training loop.
for step in xrange(max_steps):
global_step = step + offset_step
start_time = time.time()

# Run one step of the model. The return values are the activations
# from the `train_op` (which is discarded) and the `loss` Op. To
# inspect the values of your Ops or variables, you may include them
# in the list passed to sess.run() and the value tensors will be
# returned in the tuple from the call.
_, __, loss_value, summary_str = sess.run([eval_op_train, train_op, loss_op, summary_op])

except tf.errors.OutOfRangeError:
print('Done training -- epoch limit reached')
finally:
coordinator.request_stop()

# Wait for threads to finish.
coordinator.join(threads)
sess.close()

最佳答案

一种解决方案是使用 TFRecords 将数据转换为 tensorflow 二进制格式。

参见 TensorFlow Data Input (Part 1): Placeholders, Protobufs & Queues

并将 CSV 文件转换为 TFRecords 查看 this片段:

csv = pandas.read_csv("your.csv").values
with tf.python_io.TFRecordWriter("csv.tfrecords") as writer:
for row in csv:
features, label = row[:-1], row[-1]
example = tf.train.Example()
example.features.feature["features"].float_list.value.extend(features)
example.features.feature["label"].int64_list.value.append(label)
writer.write(example.SerializeToString())

虽然从本地文件系统流式传输(非常)大的文件,但在更实际的用例中,从 AWS S3、HDFS 等远程存储中流式传输(非常)大的文件。 Gensim smart_open 可能会有所帮助 python 库:

    # stream lines from an S3 object
for line in smart_open.smart_open('s3://mybucket/mykey.txt'):
print line

关于python - Tensorflow 阅读 CSV - 什么是最好的方法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44052834/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com