gpt4 book ai didi

r - TidyText 聚类

转载 作者:行者123 更新时间:2023-12-03 16:01:12 25 4
gpt4 key购买 nike

我想使用 R 和 tidytext 对相似的单词进行聚类包裹。
我已经创建了我的 token ,现在想将其转换为矩阵以对其进行聚类。我想尝试一些 token 技术,看看哪种技术提供了最紧凑的集群。
我的代码如下(取自 widyr 包的文档)。我只是无法进行下一步。任何人都可以帮忙吗?

library(janeaustenr)
library(dplyr)
library(tidytext)

# Comparing Jane Austen novels
austen_words <- austen_books() %>%
unnest_tokens(word, text)

# closest books to each other
closest <- austen_words %>%
pairwise_similarity(book, word, n) %>%
arrange(desc(similarity))
我知道如何在 closest 周围创建聚类算法.
这段代码会让我到达那里,但我不知道如何从上一节转到矩阵。
d <- dist(m)
kfit <- kmeans(d, 4, nstart=100)

最佳答案

您可以通过 为此创建适当的矩阵类型转换 来自 tidytext。 cast_有几个功能,例如 cast_sparse() .
让我们使用四本示例书,并将书中的章节聚集在一起:

library(tidyverse)
library(tidytext)
library(gutenbergr)
my_mirror <- "http://mirrors.xmission.com/gutenberg/"

books <- gutenberg_download(c(36, 158, 164, 345),
meta_fields = "title",
mirror = my_mirror)

books %>%
count(title)
#> # A tibble: 4 x 2
#> title n
#> * <chr> <int>
#> 1 Dracula 15568
#> 2 Emma 16235
#> 3 The War of the Worlds 6474
#> 4 Twenty Thousand Leagues under the Sea 12135

# break apart the chapters
by_chapter <- books %>%
group_by(title) %>%
mutate(chapter = cumsum(str_detect(text, regex("^chapter ",
ignore_case = TRUE)))) %>%
ungroup() %>%
filter(chapter > 0) %>%
unite(document, title, chapter)

glimpse(by_chapter)
#> Rows: 50,315
#> Columns: 3
#> $ gutenberg_id <int> 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, …
#> $ text <chr> "CHAPTER ONE", "", "THE EVE OF THE WAR", "", "", "No one…
#> $ document <chr> "The War of the Worlds_1", "The War of the Worlds_1", "T…

words_sparse <- by_chapter %>%
unnest_tokens(word, text) %>%
anti_join(get_stopwords(source = "smart")) %>%
count(document, word, sort = TRUE) %>%
cast_sparse(document, word, n)
#> Joining, by = "word"

class(words_sparse)
#> [1] "dgCMatrix"
#> attr(,"package")
#> [1] "Matrix"
dim(words_sparse)
#> [1] 182 18124
words_sparse对象是通过 cast_sparse() 创建的稀疏矩阵.您可以了解更多 converting back and forth from tidy and non-tidy formats for text in this chapter .
现在您有了字数矩阵(即文档词矩阵,您可以考虑使用 weighting by tf-idf 而不是计数),您可以使用 kmeans() .每本书中的多少章节聚集在一起?
kfit <- kmeans(words_sparse, centers = 4)

enframe(kfit$cluster, value = "cluster") %>%
separate(name, into = c("title", "chapter"), sep = "_") %>%
count(title, cluster) %>%
arrange(cluster)
#> # A tibble: 8 x 3
#> title cluster n
#> <chr> <int> <int>
#> 1 Dracula 1 26
#> 2 The War of the Worlds 1 1
#> 3 Dracula 2 28
#> 4 Emma 2 9
#> 5 The War of the Worlds 2 26
#> 6 Twenty Thousand Leagues under the Sea 2 9
#> 7 Twenty Thousand Leagues under the Sea 3 37
#> 8 Emma 4 46
创建于 2021-02-04 由 reprex package (v1.0.0)
一组全是艾玛,一组全是海底两万里,一组全是四本书的章节。

关于r - TidyText 聚类,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66030942/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com