gpt4 book ai didi

r - 指定类别中关键字的匹配数

转载 作者:行者123 更新时间:2023-12-05 05:43:19 26 4
gpt4 key购买 nike

对于大规模文本分析问题,我有一个包含属于不同类别的单词的数据框,以及一个包含字符串列和每个类别的(空)计数列的数据框。我现在想获取每个单独的字符串,检查出现了哪些已定义的词,并将它们计入适当的类别。

作为一个简化的例子,给定下面的两个数据框,我想计算文本单元格中出现的每种动物类型的数量。

df_texts <- tibble(
text=c("the ape and the fox", "the tortoise and the hare", "the owl and the the
grasshopper"),
mammals=NA,
reptiles=NA,
birds=NA,
insects=NA
)

df_animals <- tibble(animals=c("ape", "fox", "tortoise", "hare", "owl", "grasshopper"),
type=c("mammal", "mammal", "reptile", "mammal", "bird", "insect"))

所以我想要的结果是:

df_result <- tibble(
text=c("the ape and the fox", "the tortoise and the hare", "the owl and the the
grasshopper"),
mammals=c(2,1,0),
reptiles=c(0,1,0),
birds=c(0,0,1),
insects=c(0,0,1)
)

是否有一种直接的方法来实现适用于更大数据集的关键字匹配和计数?

提前致谢!

最佳答案

这是在 tidyverse 中处理它的方法。先看df_texts$text中的字符串是否包含动物,然后统计它们并按文本和类型求和。

library(tidyverse)

cbind(df_texts[, 1], sapply(df_animals$animals, grepl, df_texts$text)) %>%
pivot_longer(-text, names_to = "animals") %>%
left_join(df_animals) %>%
group_by(text, type) %>%
summarise(sum = sum(value)) %>%
pivot_wider(id_cols = text, names_from = type, values_from = sum)

text bird insect mammal reptile
<chr> <int> <int> <int> <int>
1 "the ape and the fox" 0 0 2 0
2 "the owl and the the \n grasshopper" 1 0 0 0
3 "the tortoise and the hare" 0 0 1 1

考虑到每个文本的多次出现:

cbind(df_texts[, 1], t(sapply(df_texts$text, str_count, df_animals$animals, USE.NAMES = F))) %>% 
setNames(c("text", df_animals$animals)) %>%
pivot_longer(-text, names_to = "animals") %>%
left_join(df_animals) %>%
group_by(text, type) %>%
summarise(sum = sum(value)) %>%
pivot_wider(id_cols = text, names_from = type, values_from = sum)

关于r - 指定类别中关键字的匹配数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/71871613/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com