gpt4 book ai didi

word2vec - 如何从 ft_word2vec (sparklyr-package) 获取词嵌入矩阵?

转载 作者:行者123 更新时间:2023-12-05 04:56:08 25 4
gpt4 key购买 nike

我还有一个关于 word2vec 领域的问题。我正在使用“sparklyr”包。在这个包中,我调用了 ft_word2vec() 函数。我在理解输出时遇到了一些麻烦:对于我提供给 ft_word2vec() 函数的每个句子/段落,我总是得到相同数量的向量。甚至,如果我的句子/段落多于单词。对我来说,这看起来像是我得到了段落向量。也许代码示例有助于理解我的问题?

# add your spark_connection here as 'spark_connection = '

# create example data frame
FK_data = data.frame(sentences = c("This is my first sentence",
"It is followed by the second sentence",
"At the end there is the last sentence"))

# move the data to spark
sc_FK_data <- copy_to(spark_connection, FK_data, name = "FK_data", overwrite = TRUE)

# prepare data for ft_word2vec (sentences have to be tokenized [=list of words instead of one string in each row])
sc_FK_data <- ft_tokenizer(sc_FK_data, input_col = "icd_long", output_col = "tokens")

# split data into test and trainings sets
partitions <- sc_FK_data %>%
sdf_random_split(training = 0.7, test = 0.3, seed = 123456)
FK_train <- partitions$training
FK_test <- partitions$test

# given a trainings data set (FK_train) with a column "tokens" (for each row = a list of strings)
mymodel = ft_word2vec(
FK_train,
input_col = "tokens",
output_col = "word2vec",
vector_size = 15,
min_count = 1,
max_sentence_length = 4444,
num_partitions = 1,
step_size = 0.1,
max_iter = 10,
seed = 123456,
uid = random_string("word2vec_"))

# I tried to get the data from spark with:
myemb = mymodel %>% sparklyr::collect()

有人有过类似经历吗?有人可以解释 ft_word2vec() 函数返回的内容吗?您有关于如何使用此函数获取词嵌入向量的示例吗?还是返回的列确实包含段落向量?

最佳答案

我的同事找到了解决方案!如果您知道如何操作,说明就会真正开始变得有意义了!

# add your spark_connection here as 'spark_connection = '

# create example data frame
FK_data = data.frame(sentences = c("This is my first sentence",
"It is followed by the second sentence",
"At the end there is the last sentence"))

# move the data to spark
sc_FK_data <- copy_to(spark_connection, FK_data, name = "FK_data", overwrite = TRUE)

# prepare data for ft_word2vec (sentences have to be tokenized [=list of words instead of one string in each row])
sc_FK_data <- ft_tokenizer(sc_FK_data, input_col = "icd_long", output_col = "tokens")

# split data into test and trainings sets
partitions <- sc_FK_data %>%
sdf_random_split(training = 0.7, test = 0.3, seed = 123456)
FK_train <- partitions$training
FK_test <- partitions$test

# CHANGES FOLLOW HERE:
# We have to use the spark connection instead of the data. For me this was the confusing part, since i thought no data -> no model.
# maybe we can think of this step as an initialization
mymodel = ft_word2vec(
spark_connection,
input_col = "tokens",
output_col = "word2vec",
vector_size = 15,
min_count = 1,
max_sentence_length = 4444,
num_partitions = 1,
step_size = 0.1,
max_iter = 10,
seed = 123456,
uid = random_string("word2vec_"))

# now that we have our model initialized, we add the word-embeddings to the model
w2v_model = ml_fit(w2v_model, sc_FK_EMB)

# now we can collect the embedding vectors
emb = word2vecmodel$vectors %>% collect()

关于word2vec - 如何从 ft_word2vec (sparklyr-package) 获取词嵌入矩阵?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/65040039/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com