gpt4 book ai didi

python - 如何使用 pandas/sklearn 删除停止短语/停止 ngram(多词字符串)?

转载 作者:行者123 更新时间:2023-11-28 21:04:22 25 4
gpt4 key购买 nike

我想防止某些短语进入我的模型。例如,我想防止“红玫瑰”进入我的分析。我了解如何添加 Adding words to scikit-learn's CountVectorizer's stop list 中给出的单个停用词通过这样做:

from sklearn.feature_extraction import text
additional_stop_words=['red','roses']

但是,这也会导致未检测到其他 ngram,例如“红色郁金香”或“蓝色玫瑰”。

我正在构建一个 TfidfVectorizer 作为我模型的一部分,我意识到我需要的处理可能必须在此阶段之后输入,但我不确定如何执行此操作。

我的最终目标是对一段文本进行主题建模。这是我正在处理的一段代码(几乎直接借自 https://de.dariah.eu/tatom/topic_model_python.html#index-0):

from sklearn import decomposition

from sklearn.feature_extraction import text
additional_stop_words = ['red', 'roses']

sw = text.ENGLISH_STOP_WORDS.union(additional_stop_words)
mod_vectorizer = text.TfidfVectorizer(
ngram_range=(2,3),
stop_words=sw,
norm='l2',
min_df=5
)

dtm = mod_vectorizer.fit_transform(df[col]).toarray()
vocab = np.array(mod_vectorizer.get_feature_names())
num_topics = 5
num_top_words = 5
m_clf = decomposition.LatentDirichletAllocation(
n_topics=num_topics,
random_state=1
)

doctopic = m_clf.fit_transform(dtm)
topic_words = []

for topic in m_clf.components_:
word_idx = np.argsort(topic)[::-1][0:num_top_words]
topic_words.append([vocab[i] for i in word_idx])

doctopic = doctopic / np.sum(doctopic, axis=1, keepdims=True)
for t in range(len(topic_words)):
print("Topic {}: {}".format(t, ','.join(topic_words[t][:5])))

编辑

示例数据框(我尝试插入尽可能多的边缘情况),df:

   Content
0 I like red roses as much as I like blue tulips.
1 It would be quite unusual to see red tulips, but not RED ROSES
2 It is almost impossible to find blue roses
3 I like most red flowers, but roses are my favorite.
4 Could you buy me some red roses?
5 John loves the color red. Roses are Mary's favorite flowers.

最佳答案

TfidfVectorizer 允许自定义预处理器。您可以使用它来进行任何需要的调整。

例如,要从您的示例语料库(不区分大小写)中删除所有连续出现的“red”+“roses”标记,请使用:

import re
from sklearn.feature_extraction import text

cases = ["I like red roses as much as I like blue tulips.",
"It would be quite unusual to see red tulips, but not RED ROSES",
"It is almost impossible to find blue roses",
"I like most red flowers, but roses are my favorite.",
"Could you buy me some red roses?",
"John loves the color red. Roses are Mary's favorite flowers."]

# remove_stop_phrases() is our custom preprocessing function.
def remove_stop_phrases(doc):
# note: this regex considers "... red. Roses..." as fair game for removal.
# if that's not what you want, just use ["red roses"] instead.
stop_phrases= ["red(\s?\\.?\s?)roses"]
for phrase in stop_phrases:
doc = re.sub(phrase, "", doc, flags=re.IGNORECASE)
return doc

sw = text.ENGLISH_STOP_WORDS
mod_vectorizer = text.TfidfVectorizer(
ngram_range=(2,3),
stop_words=sw,
norm='l2',
min_df=1,
preprocessor=remove_stop_phrases # define our custom preprocessor
)

dtm = mod_vectorizer.fit_transform(cases).toarray()
vocab = np.array(mod_vectorizer.get_feature_names())

现在 vocab 已删除所有 red roses 引用。

print(sorted(vocab))

['Could buy',
'It impossible',
'It impossible blue',
'It quite',
'It quite unusual',
'John loves',
'John loves color',
'Mary favorite',
'Mary favorite flowers',
'blue roses',
'blue tulips',
'color Mary',
'color Mary favorite',
'favorite flowers',
'flowers roses',
'flowers roses favorite',
'impossible blue',
'impossible blue roses',
'like blue',
'like blue tulips',
'like like',
'like like blue',
'like red',
'like red flowers',
'loves color',
'loves color Mary',
'quite unusual',
'quite unusual red',
'red flowers',
'red flowers roses',
'red tulips',
'roses favorite',
'unusual red',
'unusual red tulips']

更新(每个评论线程):

要将所需的停用词连同自定义停用词一起传递给包装函数,请使用:

desired_stop_phrases = ["red(\s?\\.?\s?)roses"]
desired_stop_words = ['Could', 'buy']

def wrapper(stop_words, stop_phrases):

def remove_stop_phrases(doc):
for phrase in stop_phrases:
doc = re.sub(phrase, "", doc, flags=re.IGNORECASE)
return doc

sw = text.ENGLISH_STOP_WORDS.union(stop_words)
mod_vectorizer = text.TfidfVectorizer(
ngram_range=(2,3),
stop_words=sw,
norm='l2',
min_df=1,
preprocessor=remove_stop_phrases
)

dtm = mod_vectorizer.fit_transform(cases).toarray()
vocab = np.array(mod_vectorizer.get_feature_names())

return vocab

wrapper(desired_stop_words, desired_stop_phrases)

关于python - 如何使用 pandas/sklearn 删除停止短语/停止 ngram(多词字符串)?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45426215/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com