gpt4 book ai didi

r - 提高在大字符串向量上计算词分数总和的性能?

转载 作者:行者123 更新时间:2023-12-04 09:20:32 25 4
gpt4 key购买 nike

我有一个看起来像这样的字符串:

 [1] "What can we learn from the Mahabharata "                                                                
[2] "What are the most iconic songs associated with the Vietnam War "
[3] "What are some major social faux pas to avoid when visiting Malta "
[4] "Will Ready Boost technology contribute to CFD software usage "
[5] "Who is Jon Snow " ...

以及为每个单词分配分数的数据框:
   word score
the 11
to 9
What 9
I 7
a 6
are 6

我想为我的每个字符串分配其中包含的单词分数的总和,我的解决方案是以下函数
 score_fun<- function(x)

# obtaining the list of words

{z <- unlist(strsplit(x,' '));

# returning the sum of the words' scores

return(sum(word_scores$score[word_scores$word %in% z]))}

# using sapply() in conjunction with the function

scores <- sapply(my_strings, score_fun, USE.NAMES = F)

# the output will look like
scores
[1] 20 26 24 9 0 0 38 32 30 0

我遇到的问题是性能问题,我有大约 50 万个字符串和超过一百万个单词,在我的 I-7 16GB 机器上使用该功能需要一个多小时。
此外,解决方案只是感觉不雅,笨重..

有更好(更有效)的解决方案吗?

重现数据:
 my_strings <- c("What can we learn from the Mahabharata ", "What are the most iconic songs associated with the Vietnam War ", 
"What are some major social faux pas to avoid when visiting Malta ",
"Will Ready Boost technology contribute to CFD software usage ",
"Who is Jon Snow ", "Do weighing scales measure mass or weight ",
"What will happen to the money in foreign banks after demonetizing 500 and 1000 rupee notes ",
"Is it mandatory to stay for 11 months in a rented house if the rental agreement was made for 11 months ",
"What are some really good positive comments to say on a cricket field to your teammates ",
"Is Donald Trump fact free ")


word_scores <- data.frame(word = c("the", "to", "What", "I", "a", "are", "in", "of", "and", "do"
), score = c(11L, 9L, 9L, 7L, 6L, 6L, 6L, 6L, 3L, 3L), stringsAsFactors = F)

最佳答案

您可以使用 tidytext::unnest_tokens 标记为单词然后加入并聚合:

library(tidyverse)
library(tidytext)

data_frame(string = my_strings, id = seq_along(string)) %>%
unnest_tokens(word, string, 'words', to_lower = FALSE) %>%
distinct() %>%
left_join(word_scores) %>%
group_by(id) %>%
summarise(score = sum(score, na.rm = TRUE))

#> # A tibble: 10 × 2
#> id score
#> <int> <int>
#> 1 1 20
#> 2 2 26
#> 3 3 24
#> 4 4 9
#> 5 5 0
#> 6 6 0
#> 7 7 38
#> 8 8 32
#> 9 9 30
#> 10 10 0

如果您愿意,可以保留原始字符串,或者在最后通过 ID 重新加入它们。

在小数据上它要慢得多,但它在规模上变得更快,例如当 my_strings重新采样到长度为 10,000:

Unit: milliseconds
expr min lq mean median uq max neval
Reduce 5440.03300 5656.41350 5815.2094 5814.0406 5944.9969 6206.2502 100
sapply 460.75930 486.94336 511.2762 503.4932 532.2363 746.8376 100
tidytext 86.92182 94.65745 101.7064 100.1487 107.3289 134.7276 100

关于r - 提高在大字符串向量上计算词分数总和的性能?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43565864/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com