gpt4 book ai didi

r - 在 R 中标记单词时如何保留非字母数字符号?

转载 作者:行者123 更新时间:2023-12-01 13:29:29 25 4
gpt4 key购买 nike

我正在使用 R 中的 tokenizers 包对文本进行标记化,但是“@”或“&”等非字母数字符号丢失了,我需要保留它们。这是我正在使用的功能:

tokenize_ngrams("My number & email address user@website.com", lowercase = FALSE, n = 3, n_min = 1,stopwords = character(), ngram_delim = " ", simplify = FALSE)

我知道 tokenize_character_shingles 具有允许保留标点符号的 strip_non_alphanum 参数,但标记化应用于字符,而不是单词。

有人知道如何处理这个问题吗?

最佳答案

如果您可以使用不同的包 ngram,它有两个有用的函数,可以保留那些非 alpha

> library(ngram)
> print(ngram("My number & email address user@website.com",n = 2), output = 'full')
number & | 1
email {1} |

My number | 1
& {1} |

address user@website.com | 1
NULL {1} |

& email | 1
address {1} |

email address | 1
user@website.com {1} |

> print(ngram_asweka("My number & email address user@website.com",1,3), output = 'full')
[1] "My number &" "number & email"
[3] "& email address" "email address user@website.com"
[5] "My number" "number &"
[7] "& email" "email address"
[9] "address user@website.com" "My"
[11] "number" "&"
[13] "email" "address"
[15] "user@website.com"
>

另一个漂亮的包 quanteda 通过 remove_punct 参数提供了更大的灵 active 。

> library(quanteda)
> tokenize(text, ngrams = 1:3)
tokenizedTexts from 1 document.
Component 1 :
[1] "My" "number"
[3] "&" "email"
[5] "address" "user@website.com"
[7] "My_number" "number_&"
[9] "&_email" "email_address"
[11] "address_user@website.com" "My_number_&"
[13] "number_&_email" "&_email_address"
[15] "email_address_user@website.com"

>

关于r - 在 R 中标记单词时如何保留非字母数字符号?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46729981/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com