gpt4 book ai didi

r - 计算可变长度话语中最终单词的频率列表

转载 作者:行者123 更新时间:2023-12-01 23:24:56 28 4
gpt4 key购买 nike

我有一个大型数据框,其中包含可变大小的话语:

df <- structure(list(size = c(2, 2, 3, 3, 4, 4, 3, 3), 
w1 = c("come", "why", "er", "well", "she", "well", "er", "well"),
w2 = c("on","that", "i", "not", "'s", "thanks", "super", "she"),
w3 = c(NA, NA, "can", "today", "going", "they", "cool", "can"),
w4 = c(NA,NA, NA, NA, "on", "can", NA, NA)),
row.names = c(NA, -8L), class = "data.frame")

我想将 w1 中的utterance-initial words 与其他 中的所有utterance-final words 进行比较w 列,包含带有计数和比例的频率列表。我可以计算话语开头单词的频率列表:

library(dplyr)
df %>%
group_by(w1) %>%
summarise(n = n()) %>%
mutate(prop = n / sum(n)) %>%
arrange(desc(prop))
# A tibble: 5 x 3
w1 n prop
<chr> <int> <dbl>
1 well 3 0.375
2 er 2 0.25
3 come 1 0.125
4 she 1 0.125
5 why 1 0.125

但是,当这些单词位于不同的 w 列时,如何计算最终话语的列表呢?

预期:

# A tibble: 5 x 3
w_last n prop
<chr> <int> <dbl>
1 can 3 0.375
2 on 2 0.25
3 cool 1 0.125
4 that 1 0.125
5 today 1 0.125

终于有了另一个解决方案:

df %>%
mutate(w_last = c(apply(., 1, function(x) tail(na.omit(x), 1)))) %>%
group_by(w_last) %>%
summarise(n = n()) %>%
mutate(prop = n / sum(n)) %>%
arrange(desc(prop))

最佳答案

tidyverse 语法风格的三种方法

1 您可以在不同的列中提取 final_word 并在其上创建 prop.table。 (仅在 dplyr 中)

df %>% rowwise() %>%
mutate(final_word = get(paste0('w', size))) %>%
janitor::tabyl(final_word)

final_word n percent
can 3 0.375
cool 1 0.125
on 2 0.250
that 1 0.125
today 1 0.125

2 稍微重组数据。

  • 旋转 格式。
  • 仅保留 sizeword_number 匹配的那些行
  • 使用 janitor::tabyl() 生成您的 prop.table(可以在 janitor 中以有用的方式进一步格式化)
df <- structure(list(size = c(2, 2, 3, 3, 4, 4, 3, 3), 
w1 = c("come", "why", "er", "well", "she", "well", "er", "well"),
w2 = c("on","that", "i", "not", "'s", "thanks", "super", "she"),
w3 = c(NA, NA, "can", "today", "going", "they", "cool", "can"),
w4 = c(NA,NA, NA, NA, "on", "can", NA, NA)),
row.names = c(NA, -8L), class = "data.frame")


df
#> size w1 w2 w3 w4
#> 1 2 come on <NA> <NA>
#> 2 2 why that <NA> <NA>
#> 3 3 er i can <NA>
#> 4 3 well not today <NA>
#> 5 4 she 's going on
#> 6 4 well thanks they can
#> 7 3 er super cool <NA>
#> 8 3 well she can <NA>
library(tidyverse)
library(janitor)

df %>% pivot_longer(!size, values_drop_na = T) %>%
filter(as.numeric(substr(name, 2, nchar(name))) == size) %>%
janitor::tabyl(value)
#> value n percent
#> can 3 0.375
#> cool 1 0.125
#> on 2 0.250
#> that 1 0.125
#> today 1 0.125

reprex package 创建于 2021-05-06 (v2.0.0)


3 顺便说一句,您可以专门反转序列,并计算最后一列的 words,在 tidyr 中使用 结合分离

df %>% unite('W', starts_with('w'), sep = '=', na.rm = T, remove = T) %>%
separate(W, into = paste0('w', seq_len(1 + max(str_count(.$W, '=')))), fill = 'left', sep = '=')

size w1 w2 w3 w4
1 2 <NA> <NA> come on
2 2 <NA> <NA> why that
3 3 <NA> er i can
4 3 <NA> well not today
5 4 she 's going on
6 4 well thanks they can
7 3 <NA> er super cool
8 3 <NA> well she can

关于r - 计算可变长度话语中最终单词的频率列表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/67417178/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com