python - 如何绘制文本簇？-6ren

python - 如何绘制文本簇？

转载作者：行者123 更新时间：2023-12-04 14:08:20

25

4

我已经开始用 Python 和 sklearn 学习聚类图书馆。我写了一个简单的代码来聚类文本数据。
我的目标是找到相似句子的组/集群。
我试图绘制它们，但我失败了。

问题是文本数据，我总是收到这个错误:

ValueError: setting an array element with a sequence.

相同的方法适用于数字数据，但不适用于文本数据。
有没有办法绘制相似句子的组/集群？
另外，有没有办法查看这些组是什么，这些组代表什么，我如何识别它们？
我打印了 labels = kmeans.predict(x)但这些只是数字列表，它们代表什么？

import pandas as pd
import re

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt


x = ['this is very good show' , 'i had a great time on my school trip', 'such a boring movie', 'Springbreak was amazing', 'You are wrong', 'This food is so tasty', 'I had so much fun last night', 'This is crap', 'I had a bad time last month',
    'i love this product' , 'this is an amazing item', 'this food is delicious', 'I had a great time last night', 'thats right',
     'this is my favourite restaurant' , 'i love this food, its so good', 'skiing is the best sport', 'what is this', 'this product has a lot of bugs',
     'I love basketball, its very dynamic' , 'its a shame that you missed the trip', 'game last night was amazing', 'Party last night was so boring',
     'such a nice song' , 'this is the best movie ever', 'hawaii is the best place for trip','how that happened','This is my favourite band',
     'I cant believe that you did that', 'Why are you doing that, I do not gete it', 'this is tasty', 'this song is amazing']

cv = CountVectorizer(analyzer = 'word', max_features = 5000, lowercase=True, preprocessor=None, tokenizer=None, stop_words = 'english')  
x = cv.fit_transform(x)
#x_test = cv.transform(x_test)


my_list = []

for i in range(1,11):

    kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 0)
    kmeans.fit(x)
    my_list.append(kmeans.inertia_)
    labels = kmeans.predict(x) #this prints the array of numbers
    print(labels)

plt.plot(range(1,11),my_list)
plt.show()



kmeans = KMeans(n_clusters = 5, init = 'k-means++', random_state = 0)
y_kmeans = kmeans.fit_predict(x)

plt.scatter(x[y_kmeans == 0,0], x[y_kmeans==0,1], s = 15, c= 'red', label = 'Cluster_1')
plt.scatter(x[y_kmeans == 1,0], x[y_kmeans==1,1], s = 15, c= 'blue', label = 'Cluster_2')
plt.scatter(x[y_kmeans == 2,0], x[y_kmeans==2,1], s = 15, c= 'green', label = 'Cluster_3')
plt.scatter(x[y_kmeans == 3,0], x[y_kmeans==3,1], s = 15, c= 'cyan', label = 'Cluster_4')
plt.scatter(x[y_kmeans == 4,0], x[y_kmeans==4,1], s = 15, c= 'magenta', label = 'Cluster_5')

plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], s = 100, c = 'black', label = 'Centroids')
plt.show()

最佳答案

这个问题有几个动人的部分:

如何将文本向量化为 kmeans 聚类可以理解的数据

如何在二维空间中绘制簇

如何按源句标记图

我的解决方案遵循一种非常常见的方法，即使用 kmeans 标签作为散点图的颜色。 (拟合后的kmeans值只有0、1、2、3、4，表示每个句子被分配到哪个任意组。输出与原始样本的顺序相同。)关于如何将点一分为二维空间，我使用主成分分析(PCA)。请注意，我对完整数据而不是降维输出执行 kmeans 聚类。然后我使用 matplotlib 的 ax.annotate() 用原始句子装饰我的情节。 (我还使图表更大，因此点之间有空间。)我可以根据要求进一步评论。

import pandas as pd
import re
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

x = ['this is very good show' , 'i had a great time on my school trip', 'such a boring movie', 'Springbreak was amazing', 'You are wrong', 'This food is so tasty', 'I had so much fun last night', 'This is crap', 'I had a bad time last month',
    'i love this product' , 'this is an amazing item', 'this food is delicious', 'I had a great time last night', 'thats right',
     'this is my favourite restaurant' , 'i love this food, its so good', 'skiing is the best sport', 'what is this', 'this product has a lot of bugs',
     'I love basketball, its very dynamic' , 'its a shame that you missed the trip', 'game last night was amazing', 'Party last night was so boring',
     'such a nice song' , 'this is the best movie ever', 'hawaii is the best place for trip','how that happened','This is my favourite band',
     'I cant believe that you did that', 'Why are you doing that, I do not gete it', 'this is tasty', 'this song is amazing']

cv = CountVectorizer(analyzer = 'word', max_features = 5000, lowercase=True, preprocessor=None, tokenizer=None, stop_words = 'english')  
vectors = cv.fit_transform(x)
kmeans = KMeans(n_clusters = 5, init = 'k-means++', random_state = 0)
kmean_indices = kmeans.fit_predict(vectors)

pca = PCA(n_components=2)
scatter_plot_points = pca.fit_transform(vectors.toarray())

colors = ["r", "b", "c", "y", "m" ]

x_axis = [o[0] for o in scatter_plot_points]
y_axis = [o[1] for o in scatter_plot_points]
fig, ax = plt.subplots(figsize=(20,10))

ax.scatter(x_axis, y_axis, c=[colors[d] for d in kmean_indices])

for i, txt in enumerate(x):
    ax.annotate(txt, (x_axis[i], y_axis[i]))

关于python - 如何绘制文本簇？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/57626286/

25

4

0

文章推荐： raku - 刷新预编译的 perl6 模块的最佳方法是什么？

文章推荐： wcf - 如何在 HttpTransportBindingElement 上添加 cookie

文章推荐： python - 减少输入矩阵系数所需的时间

c++ - 使用 SDL_Renderer 绘制 2D 内容，使用 SDL_GLContext 绘制 OpenGL 内容
我学习 SDL 二维编程已有一段时间了，现在我想创建一个结合使用 SDL 和 OpenGL 的程序。我是这样设置的: SDL_Init(SDL_INIT_VIDEO); window = SDL_Cr
绘制 map 投影类型
尝试查找可在地块中使用的不同类型项目的列表来自不同样本的投影类型: projection = list(type = "equirectangular") projection = list(typ
Java 绘制 GIF
我正在尝试使用 Java Graphics API 绘制 GIF，但无法使用下面的代码成功绘制 GIF。仅绘制 GIF 的第一张图像或缩略图，但不播放。 public void paintCompon
Java JFrame 绘制
我目前正在使用 JFrame 并尝试绘制一个矩形，但我不知道如何执行代码 paint(Graphics g)，如何获取 Graphics 对象？ package com.raggaer.frame;
java - 绘制 ImageView
这个领域的新手，希望得到一些帮助。我有一个"Missile.java" 类，我在那里画东西。我想绘制一个 ImageView，我正在使用以下代码: ImageView v = (ImageView)
HTML5 Canvas - 绘制
下面列出了圆形的例子这是我的 JavaScript 代码。最佳答案假设您的 randomColor 是正确的，您只需要: 从 canvas.onclick 中移除 context.clearR
Android在ImageView上缩放、拖动、绘制
我在绘制和缩放 ImageView 时遇到问题。请帮帮我.. 当我画一些东西然后拖动或缩放图像时 - 绘图保留在原处，如您在屏幕截图中所见。而且我只需要简单地在图片上绘图，并且可以缩放和拖动这张图片。
c# - 绘制/绘制外部形式
我们可以在形式之外绘制图像和文本...我的意思是在字面上... 我知道问这个问题很愚蠢但是我们能不能... 最佳答案您可以通过创建表单并将其 TransparentColor 属性设置为背景色来“作
java - 绘制/布局期间的对象分配？
我在绘制/布局期间收到 3 个对象分配警告 super.onDraw(canvas); canvas.drawColor(Color.WHITE); Paint textPaint = new Pai
python - 绘制 Pandas 时间序列数据框的线性回归线的置信区间
我有一个示例时间序列数据框: df = pd.DataFrame({'year':'1990','1991','1992','1993','1994','1995','1996',
r - 绘制 R 数据框中所有列的分布
我试图想出一种简洁的方法来绘制 R 数据框中所有列的 GridView 。问题是我的数据框中既有离散值又有数值。为简单起见，我们可以使用 R 提供的名为 iris 的示例数据集。我会使用 par(mf
r - 绘制 "list"的密度
我有一个由 10 列和 50 行组成的 data.frame。我使用 apply 函数逐列计算密度函数。现在我想绘制我一次计算的密度。换句话说，而不是绘图... plot(den[[1]]) plo
r - 绘制 PCA 的所有组件
我想知道我们如何才能在第一个和第二个组件之外绘制个人，如下所示: 最佳答案这可能有效: pc.cr <- princomp(USArrests, cor = TRUE) pairs(pc.cr$lo
pandas - 绘制 Pandas DataFrame时缺少xticklabels的第一个值
我是Pandas和matplotlib的新手，想绘制此DataFrame season won team matches pct_won 0 20
python - 绘制 Distplot 子图
我正在尝试为 distplot 子图编写一个 for 循环。我有一个包含许多不同长度列的数据框。 (不包括 NaN 值) fig = make_subplots( rows=len(asse
r - 绘制 3d 密度
我想创建一个具有密度的 3d 图。我使用函数 density 首先为特定的 x 值创建一个二维图，然后该函数创建密度并将它们放入 y 变量中。现在我有第二组 x 值并将其再次放入密度函数中，然后我得
python - 绘制 OpenStreetMap 关系不会生成连续线
全部，我一直在研究全局所有 MTB 步道的索引。我是 Python 人，所以对于所有涉及的步骤，我都尝试使用 Python 模块。我能够像这样从 OSM 立交桥 API 中获取关系: from O
r - 绘制 SVM 分类图时出错
我正在使用 e1071 包中的支持向量机对我的数据进行分类，并希望可视化机器实际如何进行分类。但是，在使用 plot.svm 函数时，出现无法解决的错误。脚本: library("e1071") d
r - 绘制 XTS 对象时的变化
我制作了以下图表，它是使用 xts 对象创建的。我使用的代码很简单 plot(graphTS1$CCLL, type = "l", las = 2, ylab = "(c)\nCC for I
uml - 绘制 UML 状态图
在绘制状态图时，您如何知道哪些状态放在框中，哪些状态用于转换箭头？我注意到转换也是状态。我正在查看 this page 上的图 1 : 最佳答案转换不是状态。转换是将对象从一种状态移动到下一种状态

首页

博学

6Ren·AI

商城

python - 如何绘制文本簇？