gpt4 book ai didi

python - tf-idf 向量化器在 char_wb 的特征词中有空格?

转载 作者:行者123 更新时间:2023-12-01 08:25:09 25 4
gpt4 key购买 nike

我用

singleTFIDF = TfidfVectorizer(
analyzer='char_wb',
ngram_range=(4,6),
stop_words=my_stop_words,
max_features=50
).fit([text])

并且想知道为什么我的功能中有空格,例如“chaft”

如何避免这种情况?我需要自己对其进行标记化和预处理吗?

最佳答案

使用analyzer='word'

当我们使用analyzer='char_wb'时,矢量化器会填充空格,因为它不会针对单词进行标记;它针对字符进行标记。

根据documentation对于分析器参数:

analyzer{‘word’, ‘char’, ‘char_wb’} or callable, default=’word’

Whether the feature should be made of word or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

看下面的例子:

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = TfidfVectorizer(
analyzer='char_wb',
ngram_range= (4,6))
X = vectorizer.fit_transform(corpus)
print([(len(w),w) for w in vectorizer.get_feature_names()])

[(4, ' and'), (5, ' and '), (4, ' doc'), (5, ' docu'), (6, ' docum'),(4, ' fir'), (5, ' firs'), (6, ' first'), (4, ' is '), (4, ' one'),(5, ' one.'), (6, ' one. '), (4, ' sec'), (5, ' seco'), (6, ' secon'),(4, ' the'), (5, ' the '), (4, ' thi'), (5, ' thir'), (6, ' third'),(5, ' this'), (6, ' this '), (4, 'and '), (4, 'cond'), (5, 'cond '),(4, 'cume'), (5, 'cumen'), (6, 'cument'), (4, 'docu'), (5, 'docum'),(6, 'docume'), (4, 'econ'), (5, 'econd'), (6, 'econd '), (4, 'ent '),(4, 'ent.'), (5, 'ent. '), (4, 'ent?'), (5, 'ent? '), (4, 'firs'), (5,'first'), (6, 'first '), (4, 'hird'), (5, 'hird '), (4, 'his '), (4,'ird '), (4, 'irst'), (5, 'irst '), (4, 'ment'), (5, 'ment '), (5,'ment.'), (6, 'ment. '), (5, 'ment?'), (6, 'ment? '), (4, 'ne. '), (4,'nt. '), (4, 'nt? '), (4, 'ocum'), (5, 'ocume'), (6, 'ocumen'), (4,'ond '), (4, 'one.'), (5, 'one. '), (4, 'rst '), (4, 'seco'), (5,'secon'), (6, 'second'), (4, 'the '), (4, 'thir'), (5, 'third'), (6,'third '), (4, 'this'), (5, 'this '), (4, 'umen'), (5, 'ument'), (6,'ument '), (6, 'ument.'), (6, 'ument?')]

注意:

  • 输出/功能包括'this'(在开头填充了原始文本中不存在的额外空格;句子以开头'这个')
  • 输出/功能包括'ment。 '(在末尾添加了原文中没有的额外空格;句子以 'document.' 结尾)
  • 输出/特征包括'is the',因为该n-gram跨越单词边界,但'char_wb' 分析器仅创建“单词边界内”的 n 元语法

关于python - tf-idf 向量化器在 char_wb 的特征词中有空格?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54308898/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com