gpt4 book ai didi

python - Tensorflow pad序列特征列

转载 作者:行者123 更新时间:2023-11-30 08:52:56 31 4
gpt4 key购买 nike

如何在特征列中填充序列以及feature_column中的维度是什么。

我正在使用 Tensorflow 2.0 并实现一个文本摘要示例。对于机器学习、深度学习和 TensorFlow 来说还很陌生。

我遇到了 feature_column 并发现它们很有用,因为我认为它们可以嵌入到模型的处理管道中。

在不使用 feature_column 的经典场景中,我可以预处理文本,对其进行标记,将其转换为数字序列,然后将它们填充到 maxlen > 说 100 个字。使用 feature_column 时我无法完成此操作。

以下是我到目前为止所写的内容。


train_dataset = tf.data.experimental.make_csv_dataset(
'assets/train_dataset.csv', label_name=LABEL, num_epochs=1, shuffle=True, shuffle_buffer_size=10000, batch_size=1, ignore_errors=True)

vocabulary = ds.get_vocabulary()

def text_demo(feature_column):
feature_layer = tf.keras.experimental.SequenceFeatures(feature_column)
article, _ = next(iter(train_dataset.take(1)))

tokenizer = tf_text.WhitespaceTokenizer()

tokenized = tokenizer.tokenize(article['Text'])

sequence_input, sequence_length = feature_layer({'Text':tokenized.to_tensor()})

print(sequence_input)

def categorical_column(feature_column):
dense_column = tf.keras.layers.DenseFeatures(feature_column)

article, _ = next(iter(train_dataset.take(1)))

lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(
filters='')
lang_tokenizer.fit_on_texts(article)

tensor = lang_tokenizer.texts_to_sequences(article)

tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor,
padding='post', maxlen=50)

print(dense_column(tensor).numpy())


text_seq_vocab_list = tf.feature_column.sequence_categorical_column_with_vocabulary_list(key='Text', vocabulary_list=list(vocabulary))
text_embedding = tf.feature_column.embedding_column(text_seq_vocab_list, dimension=8)
text_demo(text_embedding)

numerical_voacb_list = tf.feature_column.categorical_column_with_vocabulary_list(key='Text', vocabulary_list=list(vocabulary))
embedding = tf.feature_column.embedding_column(numerical_voacb_list, dimension=8)
categorical_column(embedding)

我也很困惑这里使用什么,sequence_categorical_column_with_vocabulary_listcategorical_column_with_vocabulary_list。在文档中,SequenceFeatures 也没有解释,尽管我知道这是一个实验性功能。

我也无法理解 dimension 参数的作用是什么?

最佳答案

其实这个

I am also confused as to what to use here, sequence_categorical_column_with_vocabulary_list or categorical_column_with_vocabulary_list.

应该是第一个问题,因为它影响对主题名称的解释。

此外,您对文本摘要的含义也不太清楚。。您要将处理后的文本传递到什么类型的模型\层?

顺便说一下,这很重要,因为tf.keras.layers.DenseFeaturestf.keras.experimental.SequenceFeatures假定适用于不同的网络架构和方法。

作为 SequenceFeatures layer 的文档表示 SequenceFeatures 的输出层应该被输入到序列网络中,例如 RNN。

DenseFeatures 会生成密集张量作为输出,因此适合其他类型的网络。

当您在代码片段中执行标记化时,您将在模型中使用嵌入。那么你有两个选择:

  1. 将学习到的嵌入向前传递到密集层。这意味着您不会分析词序。
  2. 将学习到的嵌入传递到卷积层、循环层、平均池化层、LSTM 层,因此也可以使用词序进行学习

第一个选项需要使用:

  • tf.keras.layers.DenseFeatures
  • tf.feature_column.categorical_column_*() 之一
  • tf.feature_column.embedding_column()

第二个选项需要使用:

  • tf.keras.experimental.SequenceFeatures
  • tf.feature_column.sequence_categorical_column_*() 之一
  • tf.feature_column.embedding_column()

以下是示例。这两个选项的预处理和训练部分是相同的:

import tensorflow as tf
print(tf.__version__)

from tensorflow import feature_column

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import text_to_word_sequence
import tensorflow.keras.utils as ku
from tensorflow.keras.utils import plot_model

import pandas as pd
from sklearn.model_selection import train_test_split

DATA_PATH = 'C:\SoloLearnMachineLearning\Stackoverflow\TextDataset.csv'

#it is just two column csv, like:
# text;label
# A wiki is run using wiki software;0
# otherwise known as a wiki engine.;1

dataframe = pd.read_csv(DATA_PATH, delimiter = ';')
dataframe.head()

# Preprocessing before feature_clolumn includes
# - getting the vocabulary
# - tokenization, which means only splitting on tokens.
# Encoding sentences with vocablary will be done by feature_column!
# - padding
# - truncating

# Build vacabulary
vocab_size = 100
oov_tok = '<OOV>'

sentences = dataframe['text'].to_list()

tokenizer = Tokenizer(num_words = vocab_size, oov_token="<OOV>")

tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

# if word_index shorter then default value of vocab_size we'll save actual size
vocab_size=len(word_index)
print("vocab_size = word_index = ",len(word_index))

# Split sentensec on tokens. here token = word
# text_to_word_sequence() has good default filter for
# charachters include basic punctuation, tabs, and newlines
dataframe['text'] = dataframe['text'].apply(text_to_word_sequence)

dataframe.head()

max_length = 6

# paddind and trancating setnences
# do that directly with strings without using tokenizer.texts_to_sequences()
# the feature_colunm will convert strings into numbers
dataframe['text']=dataframe['text'].apply(lambda x, N=max_length: (x + N * [''])[:N])
dataframe['text']=dataframe['text'].apply(lambda x, N=max_length: x[:N])
dataframe.head()

# Define method to create tf.data dataset from Pandas Dataframe
def df_to_dataset(dataframe, label_column, shuffle=True, batch_size=32):
dataframe = dataframe.copy()
#labels = dataframe.pop(label_column)
labels = dataframe[label_column]

ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
if shuffle:
ds = ds.shuffle(buffer_size=len(dataframe))
ds = ds.batch(batch_size)
return ds

# Split dataframe into train and validation sets
train_df, val_df = train_test_split(dataframe, test_size=0.2)

print(len(train_df), 'train examples')
print(len(val_df), 'validation examples')

batch_size = 32
ds = df_to_dataset(dataframe, 'label',shuffle=False,batch_size=batch_size)

train_ds = df_to_dataset(train_df, 'label', shuffle=False, batch_size=batch_size)
val_ds = df_to_dataset(val_df, 'label', shuffle=False, batch_size=batch_size)

# and small batch for demo
example_batch = next(iter(ds))[0]
example_batch

# Helper methods to print exxample outputs of for defined feature_column

def demo(feature_column):
feature_layer = tf.keras.layers.DenseFeatures(feature_column)
print(feature_layer(example_batch).numpy())

def seqdemo(feature_column):
sequence_feature_layer = tf.keras.experimental.SequenceFeatures(feature_column)
print(sequence_feature_layer(example_batch))

这里我们提供第一个选项,当我们不使用词序来学习时

# Define categorical colunm for our text feature, 
# which is preprocessed into lists of tokens
# Note that key name should be the same as original column name in dataframe
text_column = feature_column.
categorical_column_with_vocabulary_list(key='text',
vocabulary_list=list(word_index))
#indicator_column produce one-hot-encoding. These lines just to compare with embedding
#print(demo(feature_column.indicator_column(payment_description_3)))
#print(payment_description_2,'\n')

# argument dimention here is exactly the dimension of the space in which tokens
# will be presented during model's learning
# see the tutorial at https://www.tensorflow.org/beta/tutorials/text/word_embeddings
text_embedding = feature_column.embedding_column(text_column, dimension=8)
print(demo(text_embedding))

# The define the layers and model it self
# This example uses Keras Functional API instead of Sequential just for more generallity

# Define DenseFeatures layer to pass feature_columns into Keras model
feature_layer = tf.keras.layers.DenseFeatures(text_embedding)

# Define inputs for each feature column.
# See https://github.com/tensorflow/tensorflow/issues/27416#issuecomment-502218673
feature_layer_inputs = {}

# Here we have just one column
# Important to define tf.keras.Input with shape
# corresponding to lentgh of our sequence of words
feature_layer_inputs['text'] = tf.keras.Input(shape=(max_length,),
name='text',
dtype=tf.string)
print(feature_layer_inputs)

# Define outputs of DenseFeatures layer
# And accually use them as first layer of the model
feature_layer_outputs = feature_layer(feature_layer_inputs)
print(feature_layer_outputs)

# Add consequences layers.
# See https://keras.io/getting-started/functional-api-guide/
x = tf.keras.layers.Dense(256, activation='relu')(feature_layer_outputs)
x = tf.keras.layers.Dropout(0.2)(x)

# This example supposes binary classification, as labels are 0 or 1
x = tf.keras.layers.Dense(1, activation='sigmoid')(x)

model = tf.keras.models.Model(inputs=[v for v in feature_layer_inputs.values()],
outputs=x)

model.summary()

# This example supposes binary classification, as labels are 0 or 1
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy']
#run_eagerly=True
)

# Note that fit() method looking up features in train_ds and valdation_ds by name in
# tf.keras.Input(shape=(max_length,), name='text'

# This model of cause will learn nothing because of fake data.

num_epochs = 5
history = model.fit(train_ds,
validation_data=val_ds,
epochs=num_epochs,
verbose=1
)

第二个选择是当我们关心单词顺序并学习我们的模型时。

# Define categorical colunm for our text feature, 
# which is preprocessed into lists of tokens
# Note that key name should be the same as original column name in dataframe
text_column = feature_column.
sequence_categorical_column_with_vocabulary_list(key='text',
vocabulary_list=list(word_index))

# arguemnt dimention here is exactly the dimension of the space in
# which tokens will be presented during model's learning
# see the tutorial at https://www.tensorflow.org/beta/tutorials/text/word_embeddings
text_embedding = feature_column.embedding_column(text_column, dimension=8)
print(seqdemo(text_embedding))

# The define the layers and model it self
# This example uses Keras Functional API instead of Sequential
# just for more generallity

# Define SequenceFeatures layer to pass feature_columns into Keras model
sequence_feature_layer = tf.keras.experimental.SequenceFeatures(text_embedding)

# Define inputs for each feature column. See
# см. https://github.com/tensorflow/tensorflow/issues/27416#issuecomment-502218673
feature_layer_inputs = {}
sequence_feature_layer_inputs = {}

# Here we have just one column

sequence_feature_layer_inputs['text'] = tf.keras.Input(shape=(max_length,),
name='text',
dtype=tf.string)
print(sequence_feature_layer_inputs)

# Define outputs of SequenceFeatures layer
# And accually use them as first layer of the model

# Note here that SequenceFeatures layer produce tuple of two tensors as output.
# We need just first to pass next.
sequence_feature_layer_outputs, _ = sequence_feature_layer(sequence_feature_layer_inputs)
print(sequence_feature_layer_outputs)
# Add consequences layers. See https://keras.io/getting-started/functional-api-guide/

# Conv1D and MaxPooling1D will learn features from words order
x = tf.keras.layers.Conv1D(8,4)(sequence_feature_layer_outputs)
x = tf.keras.layers.MaxPooling1D(2)(x)
# Add consequences layers. See https://keras.io/getting-started/functional-api-guide/
x = tf.keras.layers.Dense(256, activation='relu')(x)
x = tf.keras.layers.Dropout(0.2)(x)

# This example supposes binary classification, as labels are 0 or 1
x = tf.keras.layers.Dense(1, activation='sigmoid')(x)

model = tf.keras.models.Model(inputs=[v for v in sequence_feature_layer_inputs.values()],
outputs=x)
model.summary()

# This example supposes binary classification, as labels are 0 or 1
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy']
#run_eagerly=True
)

# Note that fit() method looking up features in train_ds and valdation_ds by name in
# tf.keras.Input(shape=(max_length,), name='text'

# This model of cause will learn nothing because of fake data.

num_epochs = 5
history = model.fit(train_ds,
validation_data=val_ds,
epochs=num_epochs,
verbose=1
)

请在我的 github 上找到包含此示例的完整 jupiter 笔记本:

feature_column.embedding_column() 中的参数维度正是模型学习过程中 token 呈现的空间维度。请参阅教程 https://www.tensorflow.org/beta/tutorials/text/word_embeddings详细解释

另请注意,使用 feature_column.embedding_column()tf.keras.layers.Embedding() 的替代方案。如您所见feature_column从预处理管道中进行编码步骤,但您仍然应该手动进行句子的分割、填充和截断。

关于python - Tensorflow pad序列特征列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57346191/

31 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com