gpt4 book ai didi

python - CountVectorizer 但对于文本组

转载 作者:行者123 更新时间:2023-12-05 04:39:39 27 4
gpt4 key购买 nike

使用以下代码,CountVectorizer 将“风干肉”分解为 3 个不同的向量。但我想要的是将“风干肉”保留为 1 个向量。我该怎么做?

我运行的代码:

from sklearn.feature_extraction.text import CountVectorizer
food_names = ['Air-dried meat', 'Almonds', 'Amaranth']
count_vect = CountVectorizer(binary=True)
bow_rep = count_vect.fit(food_names)
#Look at the vocabulary mapping
print("Our vocabulary: ", count_vect.vocabulary_)

当前输出:

Our vocabulary:  {'air': 0, 'dried': 3, 'meat': 4, 'almonds': 1, 'amaranth': 2}

期望的输出:

Our vocabulary:  {'air-dried meat': 3, 'almonds': 1, 'amaranth': 2}

最佳答案

您可以使用 CountVectorizer 中的选项改变行为 - 即。 token_patterntokenizer


如果你使用token_pattern='.+'

CountVectorizer(binary=True, token_pattern='.+')

然后它将列表中的每个元素都视为单个单词。

from sklearn.feature_extraction.text import CountVectorizer

food_names = ['Air-dried meat', 'Almonds', 'Amaranth']

count_vect = CountVectorizer(binary=True, token_pattern='.+')
bow_rep = count_vect.fit(food_names)

print("Our vocabulary:", count_vect.vocabulary_)

结果:

Our vocabulary: {'air-dried meat': 0, 'almonds': 1, 'amaranth': 2}

如果你使用tokenizer=shlex.split

CountVectorizer(binary=True, tokenizer=shlex.split)

然后你可以使用""将字符串中的单词分组

from sklearn.feature_extraction.text import CountVectorizer
import shlex

food_names = ['"Air-dried meat" other words', 'Almonds', 'Amaranth']

count_vect = CountVectorizer(binary=True, tokenizer=shlex.split)
bow_rep = count_vect.fit(food_names)

print("Our vocabulary:", count_vect.vocabulary_)

结果:

Our vocabulary: {'air-dried meat': 0, 'other': 3, 'words': 4, 'almonds': 1, 'amaranth': 2}

顺便说一句:DataScience 门户网站上的类似问题

how to avoid tokenizing w/ sklearn feature extraction


编辑:

您还可以将 food_names 转换为 lower() 并用作 vocabulary

vocabulary = [x.lower() for x in food_names]

count_vect = CountVectorizer(binary=True, vocabulary=vocabulary)

它也会将其视为词汇表中的单个元素

from sklearn.feature_extraction.text import CountVectorizer

food_names = ["Air-dried meat", "Almonds", "Amaranth"]
vocabulary = [x.lower() for x in food_names]

count_vect = CountVectorizer(binary=True, vocabulary=vocabulary)

bow_rep = count_vect.fit(food_names)
print("Our vocabulary:", count_vect.vocabulary_)

问题是当您想将这些方法与 transform() 一起使用时,因为只有 tokenizer=shlex.split 会在转换后的文本中拆分文本。但它也需要文本中的""来捕获风干肉

from sklearn.feature_extraction.text import CountVectorizer
import shlex

food_names = ['"Air-dried meat" Almonds Amaranth']

count_vect = CountVectorizer(binary=True, tokenizer=shlex.split)
bow_rep = count_vect.fit(food_names)
print("Our vocabulary:", count_vect.vocabulary_)

text = 'Almonds of Germany'
temp = count_vect.transform([text])
print(text, temp.toarray())

text = '"Air-dried meat"'
temp = count_vect.transform([text])
print(text, temp.toarray())

text = 'Air-dried meat'
temp = count_vect.transform([text])
print(text, temp.toarray())

关于python - CountVectorizer 但对于文本组,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/70405314/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com