gpt4 book ai didi

r - 向量化 for 循环以加速 R 中的程序

转载 作者:行者123 更新时间:2023-12-04 09:13:04 26 4
gpt4 key购买 nike

我正在为我的 寻找一些简单的矢量化方法for 循环 在 R。
我有以下数据框,其中包含句子和两本正面和负面词的词典:

# Create data.frame with sentences
sent <- data.frame(words = c("just right size and i love this notebook", "benefits great laptop",
"wouldnt bad notebook", "very good quality", "orgtop",
"great improvement for that bad product but overall is not good", "notebook is not good but i love batterytop"), user = c(1,2,3,4,5,6,7),
stringsAsFactors=F)

# Create pos/negWords
posWords <- c("great","improvement","love","great improvement","very good","good","right","very","benefits",
"extra","benefit","top","extraordinarily","extraordinary","super","benefits super","good","benefits great",
"wouldnt bad")
negWords <- c("hate","bad","not good","horrible")

现在我创建原始数据框的副本来模拟大数据集:
# Replicate original data.frame - big data simulation (700.000 rows of sentences)
df.expanded <- as.data.frame(replicate(100000,sent$words))
# library(zoo)
sent <- coredata(sent)[rep(seq(nrow(sent)),100000),]
rownames(sent) <- NULL

对于我的下一步,我将不得不对字典中的单词及其情绪得分进行降序排序(pos word = 1 和 neg word = -1)。
# Ordering words in pos/negWords
wordsDF <- data.frame(words = posWords, value = 1,stringsAsFactors=F)
wordsDF <- rbind(wordsDF,data.frame(words = negWords, value = -1))
wordsDF$lengths <- unlist(lapply(wordsDF$words, nchar))
wordsDF <- wordsDF[order(-wordsDF[,3]),]
rownames(wordsDF) <- NULL

然后我用 for 循环定义以下函数:
# Sentiment score function
scoreSentence2 <- function(sentence){
score <- 0
for(x in 1:nrow(wordsDF)){
matchWords <- paste("\\<",wordsDF[x,1],'\\>', sep="") # matching exact words
count <- length(grep(matchWords,sentence)) # count them
if(count){
score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue)
sentence <- gsub(paste0('\\s*\\b', wordsDF[x,1], '\\b\\s*', collapse='|'), ' ', sentence) # remove matched words from wordsDF
# library(qdapRegex)
sentence <- rm_white(sentence)
}
}
score
}

我在我的数据框中的句子上调用前一个函数:
# Apply scoreSentence function to sentences
SentimentScore2 <- unlist(lapply(sent$words, scoreSentence2))
# Time consumption for 700.000 sentences in sent data.frame:
# user system elapsed
# 1054.19 0.09 1056.17
# Add sentiment score to origin sent data.frame
sent <- cbind(sent, SentimentScore2)

期望的输出是:
Words                                             user      SentimentScore2
just right size and i love this notebook 1 2
benefits great laptop 2 1
wouldnt bad notebook 3 1
very good quality 4 1
orgtop 5 0
.
.
.

等等……

拜托,任何人都可以帮助我减少原始方法的计算时间。由于我在 R 中的初学者编程技能,我最终:-)
您的任何帮助或建议将不胜感激。非常感谢您提前。

最佳答案

本着“授人以渔不如授人以渔”的精神,我将引导您完成:

  • 复制你的代码:你会把它搞砸的!
  • 找出瓶颈:

    1a:使问题更小:
    Rep  <- 100
    df.expanded <- as.data.frame(replicate(nRep,sent$words))
    library(zoo)
    sent <- coredata(sent)[rep(seq(nrow(sent)),nRep),]

    1b:保留引用解决方案:您将更改您的代码,并且很少有事件比优化代码更能引起错误!
    sentRef <- sent

    并在代码末尾添加相同但注释掉的内容,以记住您的引用在哪里。为了更轻松地检查您是否弄乱了代码,您可以在代码末尾自动测试它:
    library("testthat")
    expect_equal(sent,sentRef)

    1c:围绕代码触发profiler查看:
    Rprof()
    SentimentScore2 <- unlist(lapply(sent$words, scoreSentence2))
    Rprof(NULL)

    1d:查看结果,以R为基数:
    summaryRprof()

    还有更好的工具,你可以检查包
    轮廓R
    或者
    线路配置文件

    线路配置文件
    是我选择的工具,这里有一个真正的附加值,可以将问题缩小到这两行:
    matchWords <- paste("\\<",wordsDF[x,1],'\\>', sep="") # matching exact words
    count <- length(grep(matchWords,sentence)) # count them
  • 修理它。

    3.1 幸运的是,主要问题相当简单:您不需要在函数中的第一行,在之前移动它。顺便说一句,这同样适用于您的 paste0()。您的代码变为:
    matchWords <- paste("\\<",wordsDF[,1],'\\>', sep="") # matching exact words
    matchedWords <- paste0('\\s*\\b', wordsDF[,1], '\\b\\s*')

    # Sentiment score function
    scoreSentence2 <- function(sentence){
    score <- 0
    for(x in 1:nrow(wordsDF)){
    count <- length(grep(matchWords[x],sentence)) # count them
    if(count){
    score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue)
    sentence <- gsub(matchedWords[x],' ', sentence) # remove matched words from wordsDF
    require(qdapRegex)
    # sentence <- rm_white(sentence)
    }
    }
    score
    }

    这将 1000 次重复的执行时间从
    5.64 秒到 2.32 秒。不错的投资!

    3.2 下一个 bootle 脖子是“count <-”行,但我认为
    shadow 有正确的答案:-) 结合我们得到:
    matchWords <- paste("\\<",wordsDF[,1],'\\>', sep="") # matching exact words
    matchedWords <- paste0('\\s*\\b', wordsDF[,1], '\\b\\s*')

    # Sentiment score function
    scoreSentence2 <- function(sentence){
    score <- 0
    for(x in 1:nrow(wordsDF)){
    count <- grepl(matchWords[x],sentence) # count them
    score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue)
    sentence <- gsub(matchedWords[x],' ', sentence) # remove matched words from wordsDF
    require(qdapRegex)
    # sentence <- rm_white(sentence)
    }
    score
    }

  • 在这里,速度提高了 0.18 秒或 31 倍......

    关于r - 向量化 for 循环以加速 R 中的程序,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28670686/

    26 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com