r - 提高在大字符串向量上计算词分数总和的性能？-6ren

r - 提高在大字符串向量上计算词分数总和的性能？

转载作者：行者123 更新时间：2023-12-04 09:20:32

25

4

我有一个看起来像这样的字符串:

 [1] "What can we learn from the Mahabharata "                                                                
 [2] "What are the most iconic songs associated with the Vietnam War "                                        
 [3] "What are some major social faux pas to avoid when visiting Malta "                                      
 [4] "Will Ready Boost technology contribute to CFD software usage "                                          
 [5] "Who is Jon Snow " ...

以及为每个单词分配分数的数据框:

   word score
   the    11
    to     9
  What     9
     I     7
     a     6
   are     6

我想为我的每个字符串分配其中包含的单词分数的总和，我的解决方案是以下函数

 score_fun<- function(x)

 # obtaining the list of words 

 {z <- unlist(strsplit(x,' ')); 

 # returning the sum of the words' scores     

 return(sum(word_scores$score[word_scores$word %in% z]))} 

 # using sapply() in conjunction with the function  

 scores <- sapply(my_strings, score_fun, USE.NAMES = F)

 # the output will look like 
 scores
 [1] 20 26 24  9  0  0 38 32 30  0

我遇到的问题是性能问题，我有大约 50 万个字符串和超过一百万个单词，在我的 I-7 16GB 机器上使用该功能需要一个多小时。
此外，解决方案只是感觉不雅，笨重..

有更好(更有效)的解决方案吗？

重现数据:

 my_strings <- c("What can we learn from the Mahabharata ", "What are the most iconic songs associated with the Vietnam War ", 
"What are some major social faux pas to avoid when visiting Malta ", 
"Will Ready Boost technology contribute to CFD software usage ", 
"Who is Jon Snow ", "Do weighing scales measure mass or weight ", 
"What will happen to the money in foreign banks after demonetizing 500 and 1000 rupee notes ", 
"Is it mandatory to stay for 11 months in a rented house if the rental agreement was made for 11 months ", 
"What are some really good positive comments to say on a cricket field to your teammates ", 
"Is Donald Trump fact free ")


word_scores <- data.frame(word = c("the", "to", "What", "I", "a", "are", "in", "of", "and", "do"
), score = c(11L, 9L, 9L, 7L, 6L, 6L, 6L, 6L, 3L, 3L), stringsAsFactors = F)

最佳答案

您可以使用 tidytext::unnest_tokens 标记为单词然后加入并聚合:

library(tidyverse)
library(tidytext)

data_frame(string = my_strings, id = seq_along(string)) %>% 
    unnest_tokens(word, string, 'words', to_lower = FALSE) %>% 
    distinct() %>%
    left_join(word_scores) %>% 
    group_by(id) %>%
    summarise(score = sum(score, na.rm = TRUE))

#> # A tibble: 10 × 2
#>       id score
#>    <int> <int>
#> 1      1    20
#> 2      2    26
#> 3      3    24
#> 4      4     9
#> 5      5     0
#> 6      6     0
#> 7      7    38
#> 8      8    32
#> 9      9    30
#> 10    10     0

如果您愿意，可以保留原始字符串，或者在最后通过 ID 重新加入它们。

在小数据上它要慢得多，但它在规模上变得更快，例如当 my_strings重新采样到长度为 10,000:

Unit: milliseconds
     expr        min         lq      mean    median        uq       max neval
   Reduce 5440.03300 5656.41350 5815.2094 5814.0406 5944.9969 6206.2502   100
   sapply  460.75930  486.94336  511.2762  503.4932  532.2363  746.8376   100
 tidytext   86.92182   94.65745  101.7064  100.1487  107.3289  134.7276   100

关于r - 提高在大字符串向量上计算词分数总和的性能？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/43565864/

25

4

0

文章推荐： c - 数组 ptr 的地址等于它的值？

文章推荐： r - 如何从 predict() 的输出中得到真阳性、假阴性表？

文章推荐： javafx-2 - 调整窗口大小时，如何使 Imageview 调整自身大小

文章推荐： R将默认原点设置为as.Date

MySQL 总和 + 总和
我基本上有三个表: hunt_c_usershunt_c_collected_eggshunt_c_achievements 我目前只使用 hunt_c_users 和 hunt_c_collecte
SQL 总和
我已经计算了不同表中计数的总和。这会执行两次，每个 performanceID 一次。现在我想得到两个总和的总和。下面是我目前做的两个总和的代码: SELECT SUM((COUNT (Bo
Jquery 总和
我有一个对 2 个值求和的脚本。我计划添加更多值(value)，但首先我需要让它发挥作用。人们告诉我给他们 NUMBER 值，所以我这样做了，但现在它甚至没有给出输出。 base = 0; $("#F
MongoDB - 总和
我正在尝试计算在我们的数据库中跟踪的花费总额。每个订单文档包含一个字段“total_price” 我正在尝试使用以下代码: db.orders.aggregate({ $group: {
基于在另一个表中查找代码和值的 Excel 总和
给定 Excel 2013(或更高版本)中的 2 个命名表: tbl发票 ID InvRef Total 1 I/123 45 2 I/234
特殊表格之间的 VBA 总和
希望你们一切都好。我来这里是因为我从今天早上开始就试图解决一个问题，我再也受不了了。这就是上下文:我有一个 excel 工作簿，其中有不同的工作表，其中包含不同国家/地区的不同商业计划。我的目标是制
sql - 在查询结果中插入行(总和)
我有一份报告显示客户订购的产品及其价格: CompanyA Product 7 14.99 CompanyA Product 3 45.95 CompanyA Prod
python - Redis时间序列-总和
我使用此python客户端: https://github.com/ryananguiano/python-redis-timeseries 如何汇总所有匹配？ ts = TimeSeries(cli
excel - 总和/计数公式自动调整插入的行
希望创建一个总和和计数公式，该公式将自动调整以适应范围内插入的新行。例如，如果我在单元格 D55 中有公式 =SUM(D17:D54)。每次我在该范围内插入新行时，我都需要更改公式的顶部范围来解释它
python - 聚合具有相同日期的列(总和)
所以，我需要聚合日期相同的行。到目前为止，我的代码返回以下内容: date value source 0 2018-04-08 15:52:26.1
javascript - 将表中的数值相加(总和)
我有数字输入数量约为 30 我需要将它们全部汇总到一个字段我拥有的在下面查看:
具有特定数量条目的列的 MYSQL 总和
您好，我正在尝试根据以下数据计算过去三个月中出现不止一次的不同帐户 ID 的数量；我想要 2 作为查询结果，因为 test1@gmail.com 和 test2@gmail.com 出现超过 1 次。
php - 如何从一个表中选择与另一表中的差异(总和)？
我有两个带有以下字段的表: ... orders.orderID orders.orderValue 和 payments.orderID payments.payVal 在 payments.pay
来自别名未知列的 MySQL 总和
我想按 image_gallery 和 video_gallery 两列的 DESC 进行排序。 SELECT b.*, c.title as category, (S
mysql - 如何从别名查找mysql中的总计(总和)？
实际上我的原始数据库为 SELECT sum(data1,data2) as database_value,sum(data3,data4) as database_not_value from t
javascript - JavaScript 总和
我试图获取三个分数中每一个的值并将它们相加并显示在“总计:”中。我的问题是，我不知道如何做到这一点，以便每次其中一个分数值发生变化时，相应的总分值也会随之变化。我可以在某处调用“onchange”来
按第一个分组的元组列表中元组的第二个和第三个元素的 Python 总和
如何获得按第一个值分组的元组列表中第二个和第三个值的总和？即: list_of_tuples = [(1, 3, 1), (1, 2, 4), (2, 1, 0), (2, 2, 0)] expec
python - 总和、平均和其他
我正在尝试将我的列表中的整数转换为列表的总和和平均值，并说明任何低于冰点 F<32 的温度。每当我尝试获取总和或平均值时，我都会收到错误提示“+: 'int' 和 'str' 不支持的操作数类型”。我
ios - NSDecimalNumber 总和
在我的 ios 项目中，我使用了两个实体 (CoreData):具有一对多关系的 Person 和 Gifts 我知道如何计算给一个人的礼物总和: NSDecimalNumber *orderSum=
SQLITE:显示每个类别的总计(总和)
我有两个表(输入和类别): CREATE TABLE categories ( iId INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT, sNam

首页

博学

6Ren·AI

商城

r - 提高在大字符串向量上计算词分数总和的性能？