删除粘在类标记的 quanteda 对象的单词上的数字-6ren

删除粘在类标记的 quanteda 对象的单词上的数字

转载作者：行者123 更新时间：2023-12-03 23:40:11

相关问题可以在 here 中找到但不直接解决我在下面讨论的这个问题。
我的目标是删除与 token 一起出现的任何数字。例如，我希望能够摆脱以下情况下的数字:13f , 408-k , 10-k等。我正在使用 量子达作为主要工具。我有一个经典的语料库对象，我使用函数 tokens() 对其进行了标记。 .参数 remove_numbers = TRUE在这种情况下似乎不起作用，因为它只是忽略 token 并将它们留在原处。如果我使用 tokens_remove()使用特定的正则表达式，这会删除标记，这是我想避免的，因为我对剩余的文本内容感兴趣。
这是我展示如何通过函数 str_remove_all() 解决问题的最小部分在纵梁 .它有效，但对于大对象可能会非常慢。
我的问题是:有没有办法在不离开 的情况下达到相同的结果？量子达 (例如，在类 tokens 的对象上)？

library(quanteda)
#> Package version: 2.1.2
#> Parallel computing: 2 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.
#> 
#> Attaching package: 'quanteda'
#> The following object is masked from 'package:utils':
#> 
#>     View
library(stringr)

mytext = c( "This is a sentence with correctly spaced digits like K 16.",
            "This is a sentence with uncorrectly spaced digits like 123asd and well101.")

# Tokenizing
mytokens = tokens(mytext, 
                  remove_punct = TRUE,
                  remove_numbers = TRUE )
mytokens
#> Tokens consisting of 2 documents.
#> text1 :
#>  [1] "This"      "is"        "a"         "sentence"  "with"      "correctly"
#>  [7] "spaced"    "digits"    "like"      "K"        
#> 
#> text2 :
#>  [1] "This"        "is"          "a"           "sentence"    "with"       
#>  [6] "uncorrectly" "spaced"      "digits"      "like"        "123asd"     
#> [11] "and"         "well101"

# the tokens "123asd" and "well101" are still there.
# I can be more specific using a regex but this removes the tokens altogether
# 
mytokens_wrong = tokens_remove( mytokens, pattern = "[[:digit:]]", valuetype = "regex")
mytokens_wrong
#> Tokens consisting of 2 documents.
#> text1 :
#>  [1] "This"      "is"        "a"         "sentence"  "with"      "correctly"
#>  [7] "spaced"    "digits"    "like"      "K"        
#> 
#> text2 :
#>  [1] "This"        "is"          "a"           "sentence"    "with"       
#>  [6] "uncorrectly" "spaced"      "digits"      "like"        "and"

# This is the workaround which seems to be working but can be very slow.
# I am using stringr::str_remove_all() function
# 
mytokens_ok = lapply( mytokens, function(x) str_remove_all( x, "[[:digit:]]" ) )
mytokens_ok
#> $text1
#>  [1] "This"      "is"        "a"         "sentence"  "with"      "correctly"
#>  [7] "spaced"    "digits"    "like"      "K"        
#> 
#> $text2
#>  [1] "This"        "is"          "a"           "sentence"    "with"       
#>  [6] "uncorrectly" "spaced"      "digits"      "like"        "asd"        
#> [11] "and"         "well"

创建于 2021-02-15 由 reprex package (v0.3.0)

最佳答案

另一个答案是巧妙地使用 tokens_split()但如果您想要删除单词中间的数字，则并不总是有效(因为它将把包含内部数字的原始单词分成两个)。
这是从类型(唯一标记/单词)中删除数字字符的有效方法:

library("quanteda")
## Package version: 2.1.2

mytext <- c(
  "This is a sentence with correctly spaced digits like K 16.",
  "This is a sentence with uncorrectly spaced digits like 123asd and well101."
)
toks <- tokens(mytext, remove_punct = TRUE, remove_numbers = TRUE)

# get all types with digits
typesnum <- grep("[[:digit:]]", types(toks), value = TRUE)
typesnum
## [1] "123asd"  "well101"

# replace the types with types without digits
tokens_replace(toks, typesnum, gsub("[[:digit:]]", "", typesnum))
## Tokens consisting of 2 documents.
## text1 :
##  [1] "This"      "is"        "a"         "sentence"  "with"      "correctly"
##  [7] "spaced"    "digits"    "like"      "K"        
## 
## text2 :
##  [1] "This"        "is"          "a"           "sentence"    "with"       
##  [6] "uncorrectly" "spaced"      "digits"      "like"        "asd"        
## [11] "and"         "well"

注意通常我推荐 stringi 对于所有正则表达式操作，但为了简单起见，这里使用了基本包函数。
创建于 2021-02-15 由 reprex package (v1.0.0)

关于删除粘在类标记的 quanteda 对象的单词上的数字，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/66205204/

文章推荐： twitter-bootstrap - Morris.js 图表在 Bootstrap 选项卡内不起作用

文章推荐： internet-explorer - 网络驱动程序 | IE9 |设置自动下载文件

css - IE7 输入定位错误(粘!)
查看此页面:http://jsbin.com/itufix使用 IE(页面自动启用 IE7 模式)。在这里您将找到如何使用显示 block 呈现内联元素(输入和跨度)的示例。所有元素的边距和填充都设
ios - 在自定义 UITableviewCell 中设置图像不会*粘*
我有一个自定义的 UITabvleViewCell，其中有一个 UIImageView。当在 cellForRowAtIndexPath 中设置单元格时，我可以很好地设置图像(尽管我没有)，但是在某些

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

删除粘在类标记的 quanteda 对象的单词上的数字