gpt4 book ai didi

python - 如何绘制文本簇?

转载 作者:行者123 更新时间:2023-12-04 14:08:20 25 4
gpt4 key购买 nike

我已经开始用 Python 和 sklearn 学习聚类图书馆。我写了一个简单的代码来聚类文本数据。
我的目标是找到相似句子的组/集群。
我试图绘制它们,但我失败了。

问题是文本数据,我总是收到这个错误:

ValueError: setting an array element with a sequence.

相同的方法适用于数字数据,但不适用于文本数据。
有没有办法绘制相似句子的组/集群?
另外,有没有办法查看这些组是什么,这些组代表什么,我如何识别它们?
我打印了 labels = kmeans.predict(x)但这些只是数字列表,它们代表什么?
import pandas as pd
import re

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt


x = ['this is very good show' , 'i had a great time on my school trip', 'such a boring movie', 'Springbreak was amazing', 'You are wrong', 'This food is so tasty', 'I had so much fun last night', 'This is crap', 'I had a bad time last month',
'i love this product' , 'this is an amazing item', 'this food is delicious', 'I had a great time last night', 'thats right',
'this is my favourite restaurant' , 'i love this food, its so good', 'skiing is the best sport', 'what is this', 'this product has a lot of bugs',
'I love basketball, its very dynamic' , 'its a shame that you missed the trip', 'game last night was amazing', 'Party last night was so boring',
'such a nice song' , 'this is the best movie ever', 'hawaii is the best place for trip','how that happened','This is my favourite band',
'I cant believe that you did that', 'Why are you doing that, I do not gete it', 'this is tasty', 'this song is amazing']

cv = CountVectorizer(analyzer = 'word', max_features = 5000, lowercase=True, preprocessor=None, tokenizer=None, stop_words = 'english')
x = cv.fit_transform(x)
#x_test = cv.transform(x_test)


my_list = []

for i in range(1,11):

kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 0)
kmeans.fit(x)
my_list.append(kmeans.inertia_)
labels = kmeans.predict(x) #this prints the array of numbers
print(labels)

plt.plot(range(1,11),my_list)
plt.show()



kmeans = KMeans(n_clusters = 5, init = 'k-means++', random_state = 0)
y_kmeans = kmeans.fit_predict(x)

plt.scatter(x[y_kmeans == 0,0], x[y_kmeans==0,1], s = 15, c= 'red', label = 'Cluster_1')
plt.scatter(x[y_kmeans == 1,0], x[y_kmeans==1,1], s = 15, c= 'blue', label = 'Cluster_2')
plt.scatter(x[y_kmeans == 2,0], x[y_kmeans==2,1], s = 15, c= 'green', label = 'Cluster_3')
plt.scatter(x[y_kmeans == 3,0], x[y_kmeans==3,1], s = 15, c= 'cyan', label = 'Cluster_4')
plt.scatter(x[y_kmeans == 4,0], x[y_kmeans==4,1], s = 15, c= 'magenta', label = 'Cluster_5')

plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], s = 100, c = 'black', label = 'Centroids')
plt.show()

最佳答案

这个问题有几个动人的部分:

  • 如何将文本向量化为 kmeans 聚类可以理解的数据
  • 如何在二维空间中绘制簇
  • 如何按源句标记图

  • 我的解决方案遵循一种非常常见的方法,即使用 kmeans 标签作为散点图的颜色。 (拟合后的kmeans值只有0、1、2、3、4,表示每个句子被分配到哪个任意组。输出与原始样本的顺序相同。)关于如何将点一分为二维空间,我使用主成分分析(PCA)。请注意,我对完整数据而不是降维输出执行 kmeans 聚类。然后我使用 matplotlib 的 ax.annotate() 用原始句子装饰我的情节。 (我还使图表更大,因此点之间有空间。)我可以根据要求进一步评论。
    import pandas as pd
    import re
    from sklearn.decomposition import PCA
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.cluster import KMeans
    import matplotlib.pyplot as plt

    x = ['this is very good show' , 'i had a great time on my school trip', 'such a boring movie', 'Springbreak was amazing', 'You are wrong', 'This food is so tasty', 'I had so much fun last night', 'This is crap', 'I had a bad time last month',
    'i love this product' , 'this is an amazing item', 'this food is delicious', 'I had a great time last night', 'thats right',
    'this is my favourite restaurant' , 'i love this food, its so good', 'skiing is the best sport', 'what is this', 'this product has a lot of bugs',
    'I love basketball, its very dynamic' , 'its a shame that you missed the trip', 'game last night was amazing', 'Party last night was so boring',
    'such a nice song' , 'this is the best movie ever', 'hawaii is the best place for trip','how that happened','This is my favourite band',
    'I cant believe that you did that', 'Why are you doing that, I do not gete it', 'this is tasty', 'this song is amazing']

    cv = CountVectorizer(analyzer = 'word', max_features = 5000, lowercase=True, preprocessor=None, tokenizer=None, stop_words = 'english')
    vectors = cv.fit_transform(x)
    kmeans = KMeans(n_clusters = 5, init = 'k-means++', random_state = 0)
    kmean_indices = kmeans.fit_predict(vectors)

    pca = PCA(n_components=2)
    scatter_plot_points = pca.fit_transform(vectors.toarray())

    colors = ["r", "b", "c", "y", "m" ]

    x_axis = [o[0] for o in scatter_plot_points]
    y_axis = [o[1] for o in scatter_plot_points]
    fig, ax = plt.subplots(figsize=(20,10))

    ax.scatter(x_axis, y_axis, c=[colors[d] for d in kmean_indices])

    for i, txt in enumerate(x):
    ax.annotate(txt, (x_axis[i], y_axis[i]))

    enter image description here

    关于python - 如何绘制文本簇?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57626286/

    25 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com