gpt4 book ai didi

使用 tidytext 删除包含停用词的 ngram

转载 作者:行者123 更新时间:2023-12-04 13:01:49 24 4
gpt4 key购买 nike

更新:感谢您的投入。我重写了这个问题并添加了一个更好的例子来突出我的第一个例子中没有涵盖的隐含要求。

问题
我要找一个将军tidy删除包含停用词的 ngram 的解决方案。简而言之,ngram 是由空格分隔的单词串。一个unigram包含1个单词,一个bigram包含2个单词,依此类推。我的目标是在使用 unnest_tokens() 后将其应用于数据框.该解决方案应该使用包含任何长度(uni、bi、tri..)或至少 bi & tri 及以上的 ngram 混合的数据帧。

  • 有关 ngram 的更多信息,请参阅 wiki:https://en.wikipedia.org/wiki/N-gram
  • 我知道这个问题:Remove ngrams with leading and trailing stopwords .但是,我正在寻找一个通用的解决方案,它不需要停用词作为前导或尾随,并且也可以很好地扩展。
  • 正如评论中所指出的,这里记录了一个二元组的解决方案:https://www.tidytextmining.com/ngrams.html#counting-and-filtering-n-grams

  • 新示例数据

    ngram_df <- tibble::tribble(
    ~Document, ~ngram,
    1, "the",
    1, "the basis",
    1, "basis",
    1, "basis of culture",
    1, "culture",
    1, "is ground water",
    1, "ground water",
    1, "ground water treatment"
    )
    stopword_df <- tibble::tribble(
    ~word, ~lexicon,
    "the", "custom",
    "of", "custom",
    "is", "custom"
    )
    desired_output <- tibble::tribble(
    ~Document, ~ngram,
    1, "basis",
    1, "culture",
    1, "ground water",
    1, "ground water treatment"
    )

    创建于 2019-03-21 由 reprex package (v0.2.1)

    期望的行为
  • ngram_df应该转化为desired_output ,使用来自 word 的停用词stopword_df中的栏目.
  • 应删除包含停用词的每一行
  • 应该尊重单词边界(即寻找 is 不应该删除 basis )

  • 我第一次尝试下面的reprex:

    示例数据

    library(tidyverse)
    library(tidytext)
    df <- "Groundwater remediation is the process that is used to treat polluted groundwater by removing the pollutants or converting them into harmless products." %>%
    enframe() %>%
    unnest_tokens(ngrams, value, "ngrams", n = 2)
    #apply magic here

    df
    #> # A tibble: 21 x 2
    #> name ngrams
    #> <int> <chr>
    #> 1 1 groundwater remediation
    #> 2 1 remediation is
    #> 3 1 is the
    #> 4 1 the process
    #> 5 1 process that
    #> 6 1 that is
    #> 7 1 is used
    #> 8 1 used to
    #> 9 1 to treat
    #> 10 1 treat polluted
    #> # ... with 11 more rows

    停用词列表示例
    stopwords <- c("is", "the", "that", "to")

    期望的输出

    #> Source: local data frame [9 x 2]
    #> Groups: <by row>
    #>
    #> # A tibble: 9 x 2
    #> name ngrams
    #> <int> <chr>
    #> 1 1 groundwater remediation
    #> 2 1 treat polluted
    #> 3 1 polluted groundwater
    #> 4 1 groundwater by
    #> 5 1 by removing
    #> 6 1 pollutants or
    #> 7 1 or converting
    #> 8 1 them into
    #> 9 1 harmless products

    创建于 2019-03-20 由 reprex package (v0.2.1)

    (例句来自: https://en.wikipedia.org/wiki/Groundwater_remediation)

    最佳答案

    在这里,您有另一种使用上一个答案中的“stopwords_collapsed”的方法:

    swc <- paste(stopwords, collapse = "|")
    df <- df[str_detect(df$ngrams, swc) == FALSE, ] #select rows without stopwords

    df
    # A tibble: 8 x 2
    name ngrams
    <int> <chr>
    1 1 groundwater remediation
    2 1 treat polluted
    3 1 polluted groundwater
    4 1 groundwater by
    5 1 by removing
    6 1 pollutants or
    7 1 or converting
    8 1 harmless products

    这里有一个比较两个系统的简单基准:
    #benchmark
    txtexp <- rep(txt,1000000)
    dfexp <- txtexp %>%
    enframe() %>%
    unnest_tokens(ngrams, value, "ngrams", n = 2)

    benchmark("mutate+filter (small text)" = {df1 <- df %>%
    mutate(
    has_stop_word = str_detect(ngrams, stopwords_collapsed)
    ) %>%
    filter(!has_stop_word)},
    "[] row selection (small text)" = {df2 <- df[str_detect(df$ngrams, stopwords_collapsed) == FALSE, ]},
    "mutate+filter (large text)" = {df3 <- dfexp %>%
    mutate(
    has_stop_word = str_detect(ngrams, stopwords_collapsed)
    ) %>%
    filter(!has_stop_word)},
    "[] row selection (large text)" = {df4 <- dfexp[str_detect(dfexp$ngrams, stopwords_collapsed) == FALSE, ]},
    replications = 5,
    columns = c("test", "replications", "elapsed")
    )

    test replications elapsed
    4 [] row selection (large text) 5 30.03
    2 [] row selection (small text) 5 0.00
    3 mutate+filter (large text) 5 30.64
    1 mutate+filter (small text) 5 0.00

    关于使用 tidytext 删除包含停用词的 ngram,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55264150/

    24 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com