gpt4 book ai didi

r - 如何计算文档中单词与特定术语的接近度

转载 作者:行者123 更新时间:2023-12-02 09:19:22 25 4
gpt4 key购买 nike

我正在尝试找出一种方法来计算文档中特定术语的单词接近度以及平均接近度(按单词)。我知道 SO 上也有类似的问题,但没有任何东西可以给我我需要的答案,甚至没有给我指出一些有用的地方。假设我有以下文本:

song <- "Far over the misty mountains cold To dungeons deep and caverns old We 
must away ere break of day To seek the pale enchanted gold. The dwarves of
yore made mighty spells, While hammers fell like ringing bells In places deep,
where dark things sleep, In hollow halls beneath the fells. For ancient king
and elvish lord There many a gleaming golden hoard They shaped and wrought,
and light they caught To hide in gems on hilt of sword. On silver necklaces
they strung The flowering stars, on crowns they hung The dragon-fire, in
twisted wire They meshed the light of moon and sun. Far over the misty
mountains cold To dungeons deep and caverns old We must away, ere break of
day, To claim our long-forgotten gold. Goblets they carved there for
themselves And harps of gold; where no man delves There lay they long, and
many a song Was sung unheard by men or elves. The pines were roaring on the
height, The winds were moaning in the night. The fire was red, it flaming
spread; The trees like torches blazed with light. The bells were ringing in
the dale And men they looked up with faces pale; The dragon’s ire more fierce
than fire Laid low their towers and houses frail. The mountain smoked beneath
the moon; The dwarves they heard the tramp of doom. They fled their hall to
dying fall Beneath his feet, beneath the moon. Far over the misty mountains
grim To dungeons deep and caverns dim We must away, ere break of day,
To win our harps and gold from him!"

我希望能够看到“fire”一词(也可以互换)两侧(左侧 15 个,右侧 15 个)内 15 个(我希望这个数字可以互换)单词内出现的单词每次它出现时。我想查看每个单词以及它在每个“fire”实例的 15 个单词范围内出现的次数。例如,“火”使用了 3 次。在这 3 次中,“light”一词有两次落入两侧 15 个单词之内。我想要一个表格来显示单词、它在指定的邻近度 15 内出现的次数、最大距离(在本例中为 12)、最小距离(为 7)和平均距离(在是 9.5)。

我想我需要几个步骤和包才能完成这项工作。我的第一个想法是使用 quanteda 中的“kwic”函数,因为它允许您选择围绕特定术语的“窗口”。然后,基于 kwic 结果的术语频率计数并不那么困难(删除了频率的停用词,但没有删除了单词邻近度度量)。我真正的问题是找到距焦点术语的最大、最小和平均距离,然后将结果放入一个漂亮的整洁表格中,其中术语按频率降序排列为行,列给出频率计数、最大距离、最小距离距离和平均距离。

这是我到目前为止所拥有的:

library(quanteda)
library(tm)

mysong <- char_tolower(song)

toks <- tokens(mysong, remove_hyphens = TRUE, remove_punct = TRUE,
remove_numbers = TRUE, remove_symbols = TRUE)

mykwic <- kwic(toks, "fire", window = 15, valuetype ="fixed")
thekwic <- as.character(mykwic)

thekwic <- removePunctuation(thekwic)
thekwic <- removeNumbers(thekwic)
thekwic <- removeWords(thekwic, stopwords("en"))

kwicFreq <- termFreq(thekwic)

非常感谢任何帮助。

最佳答案

我建议结合我的 tidytext 来解决这个问题和 fuzzyjoin包。

您可以首先将其标记为每个单词一行的数据框,添加一个 position 列,并删除停用词:

library(tidytext)
library(dplyr)

all_words <- data_frame(text = song) %>%
unnest_tokens(word, text) %>%
mutate(position = row_number()) %>%
filter(!word %in% tm::stopwords("en"))

然后,您可以仅查找单词 fire,并使用 fuzzyjoin 中的 difference_inner_join() 来查找这些行中 15 个单词以内的所有行。然后,您可以使用 group_by()summarize() 获取每个单词所需的统计信息。

library(fuzzyjoin)

nearby_words <- all_words %>%
filter(word == "fire") %>%
select(focus_term = word, focus_position = position) %>%
difference_inner_join(all_words, by = c(focus_position = "position"), max_dist = 15) %>%
mutate(distance = abs(focus_position - position))

words_summarized <- nearby_words %>%
group_by(word) %>%
summarize(number = n(),
maximum_distance = max(distance),
minimum_distance = min(distance),
average_distance = mean(distance)) %>%
arrange(desc(number))

本例中的输出:

# A tibble: 49 × 5
word number maximum_distance minimum_distance average_distance
<chr> <int> <dbl> <dbl> <dbl>
1 fire 3 0 0 0.0
2 light 2 12 7 9.5
3 moon 2 13 9 11.0
4 bells 1 14 14 14.0
5 beneath 1 11 11 11.0
6 blazed 1 10 10 10.0
7 crowns 1 5 5 5.0
8 dale 1 15 15 15.0
9 dragon 1 1 1 1.0
10 dragon’s 1 5 5 5.0
# ... with 39 more rows

请注意,此方法还允许您同时对多个焦点词执行分析。您所要做的就是将 filter(word == "fire") 更改为 filter(word %in% c("fire", "otherword")) ,并将 group_by(word) 更改为 group_by(focus_word, word)

关于r - 如何计算文档中单词与特定术语的接近度,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44057639/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com