gpt4 book ai didi

python - CountVectorizer 将单词转换为小写

转载 作者:太空宇宙 更新时间:2023-11-03 14:46:02 27 4
gpt4 key购买 nike

在我的分类模型中,我需要保留大写字母,但是当我使用 sklearn countVectorizer 构建词汇表时,大写字母转换为小写字母!

为了排除隐式分词,我构建了一个分词器,它只传递文本而无需任何操作..

我的代码:

co = dict()

def tokenizeManu(txt):
return txt.split()

def corpDict(x):
print('1: ', x)
count = CountVectorizer(ngram_range=(1, 1), tokenizer=tokenizeManu)
countFit = count.fit_transform(x)
vocab = count.get_feature_names()
dist = np.sum(countFit.toarray(), axis=0)
for tag, count in zip(vocab, dist):
co[str(tag)] = count

x = ['I\'m John Dev', 'We are the only']

corpDict(x)
print(co)

输出:

1:  ["I'm John Dev", 'We are the only'] #<- before building the vocab.
{'john': 1, 'the': 1, 'we': 1, 'only': 1, 'dev': 1, "i'm": 1, 'are': 1} #<- after

最佳答案

如文档中所述,here . CountVectorizer 有一个参数 lowercase,默认为 True。为了禁用此行为,您需要按如下方式设置 lowercase=False:

count  = CountVectorizer(ngram_range=(1, 1), tokenizer=tokenizeManu, lowercase=False)

关于python - CountVectorizer 将单词转换为小写,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49380998/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com