gpt4 book ai didi

python - 使用 TensorFlow Transform 创建序列示例

转载 作者:太空宇宙 更新时间:2023-11-03 19:53:19 29 4
gpt4 key购买 nike

借助 TensorFlow Transform,我们可以使用 Apache Beam 预处理数据。设置此类管道时的要求之一是定义一个 DatasetMetadata 对象,该对象包含模式,该模式具有将数据从磁盘或内存格式解析为张量所需的信息.

在官方文档中,我们给出了以下形式的示例:

raw_data_metadata = dataset_metadata.DatasetMetadata(
dataset_schema.from_feature_spec({
's': tf.FixedLenFeature([], tf.string),
'y': tf.FixedLenFeature([], tf.float32),
'x': tf.FixedLenFeature([], tf.float32),
}))

如果您的原始数据是以下形式的字典,那么这一切都很好:

{
's': 'example string',
'y': 32.0,
'x': 35.0
}

但是,在为 SequenceExample 定义架构时,我有些迷失。更具体地说,考虑我的数据具有以下格式:

{
# context features
'length': 5,
# sequence features
'tokens': [
{
'raw': 'The',
'ner-tag': 'O'
},
{
'raw': 'European',
'ner-tag': 'B-org'
},
{
'raw': 'Union',
'ner-tag': 'I-org'
},
{
'raw': 'is',
'ner-tag': 'O'
},
{
'raw': 'nice',
'ner-tag': 'O'
}
...
]
}

上面我有一个包含 2 个序列的句子:

  • ner-tag 将用作模型标签的序列
  • 原始序列将用作模型的特征

如何为此类示例创建 TFT 数据模式?

这个文档有点缺失。非常感谢任何帮助!

最佳答案

好吧,经过更多研究,答案是你不能。

TensorFlow Transform 尚不支持 SequenceExample。检查this .

目前看来,执行此操作的唯一方法是让 Beam Pipeline 创建 SequenceExamples,将它们序列化并将它们写入 TFRecords。

鉴于上述句子对象结构,您需要首先创建一个 Beam DoFn,将每个句子转换为序列化的 SequenceExample:

class ConvertJSONSentenceToSerializedSequenceExample(beam.DoFn):

def make_example(self, sentence):
# the context features
sentence_level_details = tf.train.Features(feature={
'length': tf.train.Feature(int64_list=tf.train.Int64List(value=[sentence['length']]))
})

# create sequence data
word_features = []
ner_tags_features = []
for token in sentence['tokens']:
# create each of the features, then add them to the corresponding feature list
word_feature = tf.train.Feature(bytes_list=tf.train.BytesList(value=[token['raw'].encode('utf-8')]))
word_features.append(word_feature)

ner_tag_feature = tf.train.Feature(int64_list=tf.train.Int64List(value=[token['']]))
ner_tags_features.append(ner_tag_feature)

words = tf.train.FeatureList(feature=word_features)
ner_tags = tf.train.FeatureList(feature=ner_tags_features)

sentence_sequences = tf.train.FeatureLists(feature_list={
'words': words,
'ner-tags': ner_tags
})

ex = tf.train.SequenceExample(
context = sentence_level_details,
feature_lists = sentence_sequences
)

return ex

def process(self, sentence, **kwargs):
try:
ex = self.make_example(sentence)
yield ex.SerializeToString()
except Exception as e:
logging.warning("JSON sentence could not be converted into SequenceExample: " + str(e))
return None

完成此操作后,您可以使用 beam.io.tfrecordio 模块将这些序列化的 SequenceExample 转换为 TFRecord(s) 文件:

with beam.Pipeline(RUNNER, options=opts) as p:
(p
...
| 'Convert sentences to serialized TensorFlow SequenceExamples' >> beam.ParDo(ConvertJSONSentenceToSerializedSequenceExample())
| 'Write to TFRecord files' >> tfrecordio.WriteToTFRecord(
os.path.join(OUTPUT_DIR, 'train'),
file_name_suffix='.gz'
# default coder is the BytesCoder, which will work since we have serialized the training data
)

关于python - 使用 TensorFlow Transform 创建序列示例,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59697286/

29 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com