使用 tidytext 删除包含停用词的 ngram-6ren

使用 tidytext 删除包含停用词的 ngram

转载作者：行者123 更新时间：2023-12-04 13:01:49

24

4

更新:感谢您的投入。我重写了这个问题并添加了一个更好的例子来突出我的第一个例子中没有涵盖的隐含要求。

问题
我要找一个将军tidy删除包含停用词的 ngram 的解决方案。简而言之，ngram 是由空格分隔的单词串。一个unigram包含1个单词，一个bigram包含2个单词，依此类推。我的目标是在使用 unnest_tokens() 后将其应用于数据框.该解决方案应该使用包含任何长度(uni、bi、tri..)或至少 bi & tri 及以上的 ngram 混合的数据帧。

有关 ngram 的更多信息，请参阅 wiki:https://en.wikipedia.org/wiki/N-gram

我知道这个问题:Remove ngrams with leading and trailing stopwords .但是，我正在寻找一个通用的解决方案，它不需要停用词作为前导或尾随，并且也可以很好地扩展。

正如评论中所指出的，这里记录了一个二元组的解决方案:https://www.tidytextmining.com/ngrams.html#counting-and-filtering-n-grams

新示例数据

ngram_df <- tibble::tribble(
  ~Document,                   ~ngram,
          1,                    "the",
          1,              "the basis",
          1,                  "basis",
          1,       "basis of culture",
          1,                "culture",
          1,        "is ground water",
          1,           "ground water",
          1, "ground water treatment"
  )
stopword_df <- tibble::tribble(
  ~word, ~lexicon,
  "the", "custom",
   "of", "custom",
   "is", "custom"
  )
desired_output <- tibble::tribble(
  ~Document,                   ~ngram,
          1,                  "basis",
          1,                "culture",
          1,           "ground water",
          1, "ground water treatment"
  )

创建于 2019-03-21 由 reprex package (v0.2.1)

期望的行为

ngram_df应该转化为desired_output ，使用来自 word 的停用词stopword_df中的栏目.

应删除包含停用词的每一行

应该尊重单词边界(即寻找 is 不应该删除 basis )

我第一次尝试下面的reprex:

示例数据

library(tidyverse)
library(tidytext)
df <- "Groundwater remediation is the process that is used to treat polluted groundwater by removing the pollutants or converting them into harmless products." %>% 
  enframe() %>% 
  unnest_tokens(ngrams, value, "ngrams", n = 2)
#apply magic here

df
#> # A tibble: 21 x 2
#>     name ngrams                 
#>    <int> <chr>                  
#>  1     1 groundwater remediation
#>  2     1 remediation is         
#>  3     1 is the                 
#>  4     1 the process            
#>  5     1 process that           
#>  6     1 that is                
#>  7     1 is used                
#>  8     1 used to                
#>  9     1 to treat               
#> 10     1 treat polluted         
#> # ... with 11 more rows

停用词列表示例

stopwords <- c("is", "the", "that", "to")

期望的输出

#> Source: local data frame [9 x 2]
#> Groups: <by row>
#> 
#> # A tibble: 9 x 2
#>    name ngrams                 
#>   <int> <chr>                  
#> 1     1 groundwater remediation
#> 2     1 treat polluted         
#> 3     1 polluted groundwater   
#> 4     1 groundwater by         
#> 5     1 by removing            
#> 6     1 pollutants or          
#> 7     1 or converting          
#> 8     1 them into              
#> 9     1 harmless products

创建于 2019-03-20 由 reprex package (v0.2.1)

(例句来自: https://en.wikipedia.org/wiki/Groundwater_remediation)

最佳答案

在这里，您有另一种使用上一个答案中的“stopwords_collapsed”的方法:

swc <- paste(stopwords, collapse = "|")
df <- df[str_detect(df$ngrams, swc) == FALSE, ] #select rows without stopwords

df
# A tibble: 8 x 2
   name ngrams                 
  <int> <chr>                  
1     1 groundwater remediation
2     1 treat polluted         
3     1 polluted groundwater   
4     1 groundwater by         
5     1 by removing            
6     1 pollutants or          
7     1 or converting          
8     1 harmless products

这里有一个比较两个系统的简单基准:

#benchmark
txtexp <- rep(txt,1000000)
dfexp <- txtexp %>% 
    enframe() %>% 
    unnest_tokens(ngrams, value, "ngrams", n = 2)

benchmark("mutate+filter (small text)" = {df1 <- df %>%
        mutate(
            has_stop_word = str_detect(ngrams, stopwords_collapsed)
        ) %>%
        filter(!has_stop_word)},
          "[] row selection (small text)" = {df2 <- df[str_detect(df$ngrams, stopwords_collapsed) == FALSE, ]},
        "mutate+filter (large text)" = {df3 <- dfexp %>%
            mutate(
                has_stop_word = str_detect(ngrams, stopwords_collapsed)
            ) %>%
            filter(!has_stop_word)},
        "[] row selection (large text)" = {df4 <- dfexp[str_detect(dfexp$ngrams, stopwords_collapsed) == FALSE, ]},
          replications = 5,
          columns = c("test", "replications", "elapsed")
)

                           test replications elapsed
4 [] row selection (large text)            5   30.03
2 [] row selection (small text)            5    0.00
3    mutate+filter (large text)            5   30.64
1    mutate+filter (small text)            5    0.00

关于使用 tidytext 删除包含停用词的 ngram，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/55264150/

24

4

0

文章推荐： reactjs - 使用带有 typescript 的 ag-grid-react 的错误

文章推荐： microsoft-graph-api - 使用 MS Graph API 订阅/导入 iCal URL

文章推荐： Airflow :值错误:无法配置处理程序 'processor' - wasb 记录器

android振铃振动激活/停用
我找不到在来电时激活和停用振动的方法。菜单中的选项 --> 设置 --> 声音和显示提到 - PHONE VIBRATE - 来电时手机振动... 我想通过代码激活和停用它(如果可能的话)。最佳答
javascript - 如何使底层元素不可点击/停用？
我有两个元素在彼此之上。当我点击第一个 div 上的按钮时，第二个 div 在第一个 div 之上打开，我想要做的是让底层 div 成为非交互式的(我不能点击底层 div 上的任何东西只要 overl
iphone - 停用 UIScrollView 减速
有没有办法取消 UIScrollView 的减速？我想允许用户滚动 Canvas ，但我不希望用户抬起手指后 Canvas 继续滚动。最佳答案这可以通过利用 UIScrollView 委托(de
qt - 停用 (?) QML 项目以避免操作区域重叠
这里是关于 Stack Oveflow 的第一个问题，所以不要作恶! :) 言归正传:如果有堆叠的元素和堆叠的操作区域，如何继续操作以确保您对所看到的内容而不是底层元素进行操作？我正在学习有关 Qt
javascript - 停用 JavaScript 页面级事件处理程序
这个问题已经有答案了: Deleting Objects in JavaScript (14 个回答) 已关闭 9 年前。我有一个类，我通过以下方式调用: this.infiniteScroll =
java - 停用 optaplanner 规则
我有一个优化问题，正在尝试使用 optaplanner 来解决。求解算法使用一组规则。引擎使用一个对象来捕获每个规则的权重。规则的最终得分是规则的中间得分乘以权重。分数设置在每条规则的右侧。每个规则的
iphone - 停用 UIScrollView 减速
有没有办法取消 UIScrollView 的减速？我想允许用户滚动 Canvas ，但我不希望用户抬起手指后 Canvas 继续滚动。最佳答案这可以通过利用 UIScrollView 委托(de
android - 停用 APK 的影响
我正在尝试更新 native android 应用程序，该应用程序以前是由其他一些人在某些跨平台技术(Titanium)中构建和上传的。应用程序以高级模式发布，其中针对平板电脑和手机有不同的构建。但现
java - 停用 JList 中的按字母选择
我有一个 JList 列表和以下代码行: list.getInputMap().put(KeyStroke.getKeyStroke('d'), "action"); 因此，当我的列表处于焦点状态并且
sql - 停用 postgres 用户帐户
有没有办法通过 SQL 语句停用 postgres 用户帐户？我想阻止用户使用他们的数据库，但不删除用户或他们的数据库。最佳答案您还可以考虑 ALTER USER someone WITH NO
ios - 停用 segue 中的推送动画
我有一个问题。我有一个 ViewController1，它通过 Push-segue 打开 ViewController2。//两者都是NavigationControllers - (void)pr
ios - 停用 unicode 字符的语言环境
当我去 Playground 写 let test = "\u{062F}\u{0625} Hello" 时，我得到 Hello دإ(通过当我从输出控制台复制到这里时，我得到 دإ Hello) 似
ios - 如何阻止 Avaudiosesion 停用？
我想通过扬声器播放歌曲，同时能够使用 Quickblox 接听视频通话。我的音频速率越来越乱了。还有一个更大的问题是，当通话结束时，quickblox 框架将 Audio Session 设置为停用
tomcat - 停用 tomcat session
我有一个工作项目，我以 tomcat 用户身份登录，但我不知道如何注销，我尝试停用 tomcat session ，我们使用 java spring，这是我尝试从 Controller : @Requ
c# - 单击按钮激活/停用 javascript
我正在使用 javascript 来缩放我的 asp.net 网页上的图像。我想在上面放 2 个按钮，例如“缩放”、“取消缩放”，并相应地激活/停用 javascript 功能。现在我有一个 java
ios - 如何在程序运行时禁用(停用)按钮？
我有一个 TextField 和一个按钮。此 TextField 最多可包含 3 个字母或数字。这是我的问题。当程序运行时，如果这个文本字段为空或者如果这个文本字段不只包含数字，我希望我的按钮被禁用
javascript - 停用 Bootstrap 事件标签单击
我有几个组，可以选择三个按钮。我试图做到这一点，以便当有人选择 N/A 按钮时，它会禁用其他两个按钮。当取消选择 N/A 按钮时，将启用其他两个按钮。我让它在我的机器上工作，其他两个按钮被着色为禁用，
jquery - 停用 fadeToggle 的闪烁
HTML: Button Main Menu A Main Menu B
javascript - 下拉选择，更新用户帐户激活/停用
我是 php 新手，如何才能完成这项工作删除.php prepare("UPDATE tbluser set status=1 WHERE id=:id"); $stmt->execute(
python - 激活/停用 virtualenv
周五，我开始在运行 Ubuntu 14.04 的 VPS 上编写我的第一个 python API hello world 示例。我使用 python3，创建文件夹，virtualenv，激活它，然后断

首页

博学

6Ren·AI

商城

使用 tidytext 删除包含停用词的 ngram