python - CountVectorizer(分析器 ='char_wb')未按预期工作-6ren

python - CountVectorizer(分析器 ='char_wb')未按预期工作

转载作者：太空狗更新时间：2023-10-30 00:48:48

我正在尝试使用 scikit-learn 的 CountVectorizer 计算字符 2-gram，忽略空格。在docs它提到了参数 analyzer 声明

Whether the feature should be made of word or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries.

但是，“char_wb”似乎没有像我预期的那样工作。例如:

corpus = [
    "The blue dog Blue",
    "Green the green cat",
    "The green mouse",
]

# CountVectorizer character 2-grams with word boundaries
vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(2, 2), min_df=1) 
X = vectorizer.fit_transform(corpus)
vectorizer.get_feature_names()
[' b',
 ' c',
 ' d',
 ' g',
 ' m',
 ' t',
 'at',
 'bl',
 'ca', ....

请注意像“b”这样的示例，其中包含一个空格。给了什么？

最佳答案

我认为这是文档中长期存在的错误，欢迎您帮助修复。更正确的说法是:

Option ‘char_wb’ creates character n-grams, but does not generate n-grams that cross word boundaries.

更改似乎是在 this commit 中进行的以确保;查看贡献者的 comment .将二元语法输出与 analyzer='char' 的输出进行比较时，它看起来特别尴尬，但是当你增加到三元语法时，你会看到空格可以开始或结束一个 n-gram，但不能在中间.这有助于表示特征的词首或词尾性质，而无需捕获嘈杂的交叉词字符 n-gram。它还确保，与提交之前不同，所有提取的 n-gram 都具有相同的长度!

关于python - CountVectorizer(分析器 ='char_wb')未按预期工作，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/36188875/

文章推荐： c# - 是否可以使数组大小大小取决于属性值(int)？ C#

文章推荐： c# - 多对多关系中的 EF Core Include()

文章推荐： python - 将库目录添加到 PyCharm

文章推荐： c# - 在异步方法中捕获异常后激活

python - CountVectorizer(分析器 ='char_wb')未按预期工作
我正在尝试使用 scikit-learn 的 CountVectorizer 计算字符 2-gram，忽略空格。在docs它提到了参数 analyzer 声明 Whether the feature
python - tf-idf 向量化器在 char_wb 的特征词中有空格？
我用 singleTFIDF = TfidfVectorizer( analyzer='char_wb', ngram_range=(4,6), stop_words=my_s
python - 带有 char_wb 的 tf-idf 会忽略自定义预处理器吗？
我有 import nltk from nltk.stem.snowball import GermanStemmer def my_tokenizer(doc): stemmer= Germa

太空狗

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - CountVectorizer(分析器 ='char_wb')未按预期工作