R:内存不足，如何遍历行？-6ren

R:内存不足，如何遍历行？

转载作者：行者123 更新时间：2023-12-01 10:54:22

24

4

我有一个包含 700.000 多行的数据框 (myDF)，每行有两列，id 和 text。该文本有 140 个字符的文本(推文)，我想对我从网上获得的情感分析进行分析。但是，无论我尝试什么，我在带有 4gb ram 的 macbook 上都有内存问题。

我在想也许我可以遍历行，例如做前 10 个，然后做第二个 10...等等。 (即使批量为 100，我也会遇到问题)这会解决问题吗？以这种方式循环的最佳方式是什么？

我在这里发布我的代码:

library(plyr)
library(stringr)

# function score.sentiment
score.sentiment = function(sentences, pos.words, neg.words, .progress='none')
{
   # Parameters
   # sentences: vector of text to score
   # pos.words: vector of words of postive sentiment
   # neg.words: vector of words of negative sentiment
   # .progress: passed to laply() to control of progress bar

   # create simple array of scores with laply
   scores = laply(sentences,
   function(sentence, pos.words, neg.words)
   {

      # split sentence into words with str_split (stringr package)
      word.list = str_split(sentence, "\\s+")
      words = unlist(word.list)

      # compare words to the dictionaries of positive & negative terms
      pos.matches = match(words, pos.words)
      neg.matches = match(words, neg.words)

      # get the position of the matched term or NA
      # we just want a TRUE/FALSE
      pos.matches = !is.na(pos.matches)
      neg.matches = !is.na(neg.matches)

      # final score
    score = sum(pos.matches)- sum(neg.matches)
      return(score)
      }, pos.words, neg.words, .progress=.progress )

   # data frame with scores for each sentence
   scores.df = data.frame(text=sentences, score=scores)
   return(scores.df)
}

# import positive and negative words
pos = readLines("positive_words.txt")
neg = readLines("negative_words.txt")

# apply function score.sentiment


myDF$scores = score.sentiment(myDF$text, pos, neg, .progress='text')

最佳答案

4 GB 听起来足以存储 700,000 个 140 个字符的句子。另一种计算情绪分数的方法可能更节省内存和时间，并且/或更容易分成 block 。不是处理每个句子，而是将整组句子分解成单词

words <- str_split(sentences, "\\s+")

然后确定每个句子中有多少个单词，并创建一个单词向量

len <- sapply(words, length)
words <- unlist(words, use.names=FALSE)

通过重新使用 words变量我释放了以前使用的内存用于重新循环(不需要显式调用垃圾收集器，这与 @cryo111 中的建议相反!)。你可以找到一个词是否在pos.words中与否，不用担心 NA，使用 words %in% pos.words .但是我们可以聪明一点，计算这个逻辑向量的累积和，然后在每个句子的最后一个单词处对累积和进行子集

cumsum(words %in% pos.words)[len]

并计算单词数作为这个的导数

pos.match <- diff(c(0, cumsum(words %in% pos.words)[len]))

这是 pos.match你分数的一部分。所以

scores <- diff(c(0, cumsum(words %in% pos.words)[len])) - 
          diff(c(0, cumsum(words %in% neg.words)[len]))

就是这样。

score_sentiment <-
    function(sentences, pos.words, neg.words)
{
    words <- str_split(sentences, "\\s+")
    len <- sapply(words, length)
    words <- unlist(words, use.names=FALSE)
    diff(c(0, cumsum(words %in% pos.words)[len])) - 
      diff(c(0, cumsum(words %in% neg.words)[len]))
}

这里的目的是一次性处理所有句子

myDF$scores <- score_sentiment(myDF$text, pos, neg)

这避免了 for 循环，虽然与 lapply 相比本质上并不低效和 friend 如果正确实现如@joran 所示，与矢量化解决方案相比效率非常低。大概 sentences不会在这里被复制，返回(只是)分数不会浪费内存返回我们已经知道的信息(句子)。最大的内存将是 sentences和 words .

如果内存仍然是个问题，那么我会创建一个索引，可以用来将文本分成更小的组，并计算每个组的分数

nGroups <- 10 ## i.e., about 70k sentences / group
idx <- seq_along(myDF$text)
grp <- split(idx, cut(idx, nGroups, labels=FALSE))
scorel <- lapply(grp, function(i) score_sentiment(myDF$text[i], pos, neg))
myDF$scores <- unlist(scorel, use.names=FALSE)

首先确保myDF$text实际上是一个字符，例如 myDF$test <- as.character(myDF$test)

关于R:内存不足，如何遍历行？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/16267477/

24

4

0

文章推荐： css - Ipad Mini 特定 CSS 媒体查询？

文章推荐： artificial-intelligence - Tron 的良好启发式

文章推荐： Spring 数据 JPA Repository @autowired 给出 null

SQL 查询导致我 sleep 不足
所以我正在为考试复习，并在 SQL 河(或荒地)中撞到了一块大石头我制作了以下表格并插入了以下数据: create table Permissions ( fileName VARCHAR(
JQueryUI 对话框 maxWidth 不足
我有一个使用 maxWidth 定义的 jqueryui 对话框。 $("#myDialog").dialog({ autoOpen: false, width: 'a
c - 如何使用平方根优化c中的循环(完美、丰富、不足)
注意:我遗漏了不相关的代码所以我目前正在研究 CCC 1996 P1，这个问题的全部目的是能够计算一个整数输入是完美数、不足数还是充数。我上面列出的代码可以工作，但是我认为它太慢了。该代码会迭代每个
r - R 中的关联规则 RAM 不足
已关闭。此问题需要 debugging details 。目前不接受答案。编辑问题以包含 desired behavior, a specific problem or error, and the
python - Redis 使用的 RAM 不足
我正在使用 Go 和 Redis 开发 API。问题是RAM使用不足，我找不到问题的根源。 TL;DR 版本有数百/数千个哈希对象。每个 1 KB 的对象(键+值)占用大约 0.5 MB 的 RAM
kubernetes - 由于 CPU 不足，Pod 处于挂起状态
在我的 GCE Kubernetes 集群上，我无法再创建 pod。 Warning FailedScheduling pod (www.caveconditions.com-f1be467e3
kubernetes - Amazon EKS Fargate中的 pod 不足
当我尝试在EKS Fargate群集上安装指标服务器时，它抛出错误: 0/4 nodes are available: 4 Insufficient pods. 按照以下说明从此处安装指标服务器:ht
ios - 为什么 iOS 终止后台应用程序而不是以不同方式处理 RAM 不足？
遍布this document Apple 提到 iOS 在某些情况下会终止应用程序，最常见的原因似乎是释放一些 RAM。这会导致未实现状态恢复的应用程序出现问题——用户正在处理和暂时离开的一些内容可
audio - Google Cloud Speech:配额组 token 不足
尝试处理一个10分钟的音频文件时出现以下错误。我刚刚开始使用Google Cloud产品，所以我是唯一访问此资源的人。我怎么可能超出配额？配额设置为其默认值，我认为我没有任何限制。还有其他原因吗？我
r - 对R中事物类型的全面考察； 'mode' 和 'class' 和 'typeof' 不足
R 语言让我感到困惑。实体有模式和类，但即使这样也不足以完全描述实体。这个answer说 In R every 'object' has a mode and a class. 所以我做了这些实验:
kubernetes - Openshift:没有与以下所有谓词匹配的可用节点::cpu 不足 (173)、MatchNodeSelector (5)
我在 west-1 有一个 Openshift v3 项目。在其中，我有一个运行良好的应用程序，但在 GitHub 提交代码中非常下游的内容后，该应用程序停止工作。问题在于制作 pod: No nod
kubernetes - Openshift:没有与以下所有谓词匹配的可用节点::cpu 不足 (173)、MatchNodeSelector (5)
我在 west-1 有一个 Openshift v3 项目。在其中，我有一个运行良好的应用程序，但在 GitHub 提交代码中非常下游的内容后，该应用程序停止工作。问题在于制作 pod: No nod
wolfram-mathematica - 我可以使用 Stackoverflow API 检查哪些 SO 回答者 sleep 不足？
在 how-do-i-access-the-stackoverflow-api-from-mathematica我概述了如何使用 SO API 让 Mathematica 制作一些有趣的顶级回答者声誉
node.js - 小型 Node.js 应用程序 Pod 的 GKE CPU 不足
所以在 GKE 上，我有一个 Node.js app，每个 pod 使用大约:CPU(cores): 5m, MEMORY: 100Mi 但是我只能为每个 Node 部署 1 个 pod。我使用的是
javascript - 消费者的服务 'AnalyticsDefaultGroup' 的配额 'USER-100s' 和限制 'analyticsreporting.googleapis.com' 的 token 不足
我正在使用 async.eachOfSeries 超过 300 个数组并请求一些 GA api，它工作正常但有时我会收到错误.. UnhandledPromiseRejectionWarning:错误
amazon-s3 - 0/3 个节点可用 : 1 node(s) had taints that the pod didn't tolerate, 2 cpu 不足。 MR3 hive
我正在尝试在 AWS ec2 上托管的 kubernetes 集群上使用 mr3 设置配置单元。当我运行命令 run-hive.sh 时，Hive 服务器启动，并且 master-DAg 被初始化，但
google-cloud-pubsub - 消费者 'administrator' 的服务 'CLIENT_PROJECT-100s' 的配额 'pubsub.googleapis.com' 和限制 'project_number:#' 的 token 不足
创建订阅时有时会出现以下错误: Insufficient tokens for quota 'administrator' and limit 'CLIENT_PROJECT-100s' of ser

首页

博学

6Ren·AI

商城

R:内存不足，如何遍历行？