gpt4 book ai didi

python - sklearn TfidfVectorizer : How to make few words to only be part of bi gram in the features

转载 作者:太空宇宙 更新时间:2023-11-04 04:17:27 25 4
gpt4 key购买 nike

我希望 TfidfVectorizer 的特征化考虑一些预定义的单词,例如 “script”、“rule”, 仅在二元语法中使用。

如果我有文本 “脚本包含是一个具有业务规则规则的脚本”

如果我使用上面的文字

tfidf = TfidfVectorizer(ngram_range=(1,2),stop_words='english')

我应该得到

['script include','business rule','include','business']

最佳答案

from sklearn.feature_extraction import text 
# Given a vocabulary returns a filtered vocab which
# contain only tokens in include_list and which are
# not stop words
def filter_vocab(full_vocab, include_list):
b_list = list()
for x in full_vocab:
add = False
for t in x.split():
if t in text.ENGLISH_STOP_WORDS:
add = False
break
if t in include_list:
add = True
if add:
b_list.append(x)
return b_list

# Get all the ngrams (one can also use nltk.util.ngram)
ngrams = TfidfVectorizer(ngram_range=(1,2), norm=None, smooth_idf=False, use_idf=False)
X = ngrams.fit_transform(["Script include is a script that has rule which has a business rule"])
full_vocab = ngrams.get_feature_names()

# filter the full ngram based vocab
filtered_v = filter_vocab(full_vocab,["include", "business"])

# Get tfidf using the new filtere vocab
vectorizer = TfidfVectorizer(ngram_range=(1,2), vocabulary=filtered_v)
X = vectorizer.fit_transform(["Script include is a script that has rule which has a business rule"])
v = vectorizer.get_feature_names()
print (v)

注释代码以解释它在做什么

关于python - sklearn TfidfVectorizer : How to make few words to only be part of bi gram in the features,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55162090/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com