gpt4 book ai didi

python - 将 coef 拆分为适用于多类的数组

转载 作者:太空宇宙 更新时间:2023-11-03 11:39:44 24 4
gpt4 key购买 nike

我使用此函数绘制每个标签的最佳和最差特征 (coef)。

 def plot_coefficients(classifier, feature_names, top_features=20):
coef = classifier.coef_.ravel()
for i in np.split(coef,6):
top_positive_coefficients = np.argsort(i)[-top_features:]
top_negative_coefficients = np.argsort(i)[:top_features]
top_coefficients = np.hstack([top_negative_coefficients, top_positive_coefficients])
# create plot
plt.figure(figsize=(15, 5))
colors = ["red" if c < 0 else "blue" for c in i[top_coefficients]]
plt.bar(np.arange(2 * top_features), i[top_coefficients], color=colors)
feature_names = np.array(feature_names)
plt.xticks(np.arange(1, 1 + 2 * top_features), feature_names[top_coefficients], rotation=60, ha="right")
plt.show()

将其应用于 sklearn.LinearSVC:

if (name == "LinearSVC"):   
print(clf.coef_)
print(clf.intercept_)
plot_coefficients(clf, cv.get_feature_names())

使用的 CountVectorizer 的维度为 (15258, 26728)。这是一个具有 6 个标签的多类决策问题。使用 .ravel 返回长度为 6*26728=160368 的平面数组。这意味着所有高于 26728 的索引都超出了轴 1 的范围。这是一个标签的顶部和底部索引:

i[ 0. 0. 0.07465654 ... -0.02112607  0. -0.13656274]
Top [39336 35593 29445 29715 36418 28631 28332 40843 34760 35887 48455 27753
33291 54136 36067 33961 34644 38816 36407 35781]

i[ 0. 0. 0.07465654 ... -0.02112607 0. -0.13656274]
Bot [39397 40215 34521 39392 34586 32206 36526 42766 48373 31783 35404 30296
33165 29964 50325 53620 34805 32596 34807 40895]

“顶部”列表中的第一个条目具有索引 39336。这等于词汇表中的条目 39337-26728=12608。我需要更改代码中的哪些内容才能使其适用?

编辑:

X_train = sparse.hstack([training_sentences,entities1train,predictionstraining_entity1,entities2train,predictionstraining_entity2,graphpath_training,graphpathlength_training])
y_train = DFTrain["R"]


X_test = sparse.hstack([testing_sentences,entities1test,predictionstest_entity1,entities2test,predictionstest_entity2,graphpath_testing,graphpathlength_testing])
y_test = DFTest["R"]

尺寸: (15258, 26728) (15258, 26728) (0, 0) 1 ... (15257, 0) 1 (15258, 26728) (0, 0) 1 ... (15257, 0) 1 (15258, 26728) (15258L, 1L)

File "TwoFeat.py", line 708, in plot_coefficients
colors = ["red" if c < 0 else "blue" for c in i[top_coefficients]]
MemoryError

最佳答案

首先,您是否必须使用ravel()

LinearSVC(或实际上任何其他具有 coef_ 的分类器)以形状给出 coef_:

coef_ : array, shape = [n_features] if n_classes == 2 else [n_classes, n_features]

Weights assigned to the features (coefficients in the primal problem).

所以它的行数等于类数,列数等于特征数。对于每个类(class),您只需要访问右行。类的顺序可从 classifier.classes_ 属性获得。

其次,您的代码缩进是错误的。 plot 中的代码应该在 for 循环内,以便为每个类绘制。目前它在 for 循环的范围之外,所以只会为最后一节课打印。

纠正这两件事,这里有一个示例可重现代码,用于绘制每个类的顶部和底部特征。

def plot_coefficients(classifier, feature_names, top_features=20):

# Access the coefficients from classifier
coef = classifier.coef_

# Access the classes
classes = classifier.classes_

# Iterate the loop for number of classes
for i in range(len(classes)):


print(classes[i])

# Access the row containing the coefficients for this class
class_coef = coef[i]


# Below this, I have just replaced 'i' in your code with 'class_coef'
# Pass this to get top and bottom features
top_positive_coefficients = np.argsort(class_coef)[-top_features:]
top_negative_coefficients = np.argsort(class_coef)[:top_features]

# Concatenate the above two
top_coefficients = np.hstack([top_negative_coefficients,
top_positive_coefficients])
# create plot
plt.figure(figsize=(10, 3))

colors = ["red" if c < 0 else "blue" for c in class_coef[top_coefficients]]
plt.bar(np.arange(2 * top_features), class_coef[top_coefficients], color=colors)
feature_names = np.array(feature_names)

# Here I corrected the start to 0 (Your code has 1, which shifted the labels)
plt.xticks(np.arange(0, 1 + 2 * top_features),
feature_names[top_coefficients], rotation=60, ha="right")
plt.show()

现在只要你喜欢这个方法就可以了:

import numpy as np
from matplotlib import pyplot as plt
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC

categories = [
'alt.atheism',
'talk.religion.misc',
'comp.graphics',
'sci.space']

dataset = fetch_20newsgroups(subset='all', categories=categories,
shuffle=True, random_state=42)
vectorizer = CountVectorizer()




# Just to replace classes from integers to their actual labels,
# you can use anything as you like in y
y = []
mapping_dict = dict(enumerate(dataset.target_names))
for i in dataset.target:
y.append(mapping_dict[i])

# Learn the words from data
X = vectorizer.fit_transform(dataset.data)

clf = LinearSVC(random_state=42)
clf.fit(X, y)

plot_coefficients(clf, vectorizer.get_feature_names())

以上代码的输出:

'另类无神论' 'alt.atheism'

'comp.graphics' 'comp.graphics'

'科学空间' 'sci.space'

'talk.religion.misc' 'talk.religion.misc'

关于python - 将 coef 拆分为适用于多类的数组,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52042843/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com