gpt4 book ai didi

通过正则表达式替换 quanteda token

转载 作者:行者123 更新时间:2023-12-04 17:20:15 24 4
gpt4 key购买 nike

我想明确替换 quanteda 包的 tokens 类对象中定义的特定标记。我未能复制适用于 stringr 的标准方法。

目标是用 c("XXX", "of") 形式的两个标记替换 "XXXof" 形式的所有标记。

请看下面的最小值:

suppressPackageStartupMessages(library(quanteda))
library(stringr)

text = "It was a beautiful day down to the coastof California."

# I would solve this with stringr as follows:
text_stringr = str_replace( text, "(^.*?)(of)", "\\1 \\2" )
text_stringr
#> [1] "It was a beautiful day down to the coast of California."

# I fail to find a similar solution with quanteda that works on objects of class tokens
tok = tokens( text )

# I want to replace "coastof" with "coast"
tokens_replace( tok, "(^.*?)(of)", "\\1 \\2", valuetype = "regex" )
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "It" "was" "a" "beautiful" "day"
#> [6] "down" "to" "the" "\\1 \\2" "California"
#> [11] "."

任何解决方法?

reprex package 创建于 2021-03-16 (v1.0.0)

最佳答案

您可以使用混合来构建需要分隔的单词及其分隔形式的列表,然后使用tokens_replace() 执行替换。这样做的好处是允许您在应用之前整理列表,这意味着您可以验证您没有发现您可能不想应用的替代品。

suppressPackageStartupMessages(library("quanteda"))

toks <- tokens("It was a beautiful day down to the coastof California.")

keys <- as.character(tokens_select(toks, "(^.*?)(of)", valuetype = "regex"))
vals <- stringr::str_replace(keys, "(^.*?)(of)", "\\1 \\2") %>%
strsplit(" ")

keys
## [1] "coastof"
vals
## [[1]]
## [1] "coast" "of"

tokens_replace(toks, keys, vals)
## Tokens consisting of 1 document.
## text1 :
## [1] "It" "was" "a" "beautiful" "day"
## [6] "down" "to" "the" "coast" "of"
## [11] "California" "."

reprex package 创建于 2021-03-16 (v1.0.0)

关于通过正则表达式替换 quanteda token ,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66651356/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com