gpt4 book ai didi

python - CountVectorizer 的单个字母的空词汇表

转载 作者:太空宇宙 更新时间:2023-11-03 15:55:24 25 4
gpt4 key购买 nike

尝试将字符串转换为数值向量,

### Clean the string
def names_to_words(names):
print('a')
words = re.sub("[^a-zA-Z]"," ",names).lower().split()
print('b')

return words


### Vectorization
def Vectorizer():
Vectorizer= CountVectorizer(
analyzer = "word",
tokenizer = None,
preprocessor = None,
stop_words = None,
max_features = 5000)
return Vectorizer


### Test a string
s = 'abc...'
r = names_to_words(s)
feature = Vectorizer().fit_transform(r).toarray()

但是当我遇到:

 ['g', 'o', 'm', 'd']

有错误:

ValueError: empty vocabulary; perhaps the documents only contain stop words

这样的单字母字符串似乎有问题。我应该怎么办?谢谢

最佳答案

CountVectorizer 中的默认 token_pattern 正则表达式选择至少有 2 个字符的单词作为 stated in documentation :

token_pattern : string

Regular expression denoting what constitutes a “token”, only used ifanalyzer == 'word'. The default regexp select tokens of 2 or morealphanumeric characters (punctuation is completely ignored and alwaystreated as a token separator).

来自source code of CountVectorizer它是 r"(?u)\b\w\w+\b

将其更改为 r"(?u)\b\w+\b 以包含 1 个字母的单词。

将您的代码更改为以下内容(包含上述建议的 token_pattern 参数):

Vectorizer= CountVectorizer(
analyzer = "word",
tokenizer = None,
preprocessor = None,
stop_words = None,
max_features = 5000,
token_pattern = r"(?u)\b\w+\b")

关于python - CountVectorizer 的单个字母的空词汇表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43601358/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com