python - 单词聚类列表列表-6ren

python - 单词聚类列表列表

转载作者：行者123 更新时间：2023-11-30 09:17:08

例如，假设我有一个单词列表列表

[['apple','banana'],
 ['apple','orange'],
 ['banana','orange'],
 ['rice','potatoes','orange'],
 ['potatoes','rice']]

集合要大得多。我想对通常存在在一起的单词进行聚类，这些单词将具有相同的聚类。因此，在本例中，簇将是 ['apple', 'banana', 'orange'] 和 ['rice','potatoes']。
归档此类聚类的最佳方法是什么？

最佳答案

我认为将问题视为图表更为自然。

例如，您可以假设 apple 是节点 0，banana 是节点 1，并且第一个列表指示 0 到 1 之间存在一条边。

所以首先将标签转换为数字:

from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
le.fit(['apple','banana','orange','rice','potatoes'])

现在:

l=[['apple','banana'],
 ['apple','orange'],
 ['banana','orange'],
 ['rice','potatoes'], #I deleted orange as edge is between 2 points, you can  transform the triple to 3 pairs or think of different solution
 ['potatoes','rice']]

将标签转换为数字:

edges=[le.transform(x) for x in l]

>>edges

[array([0, 1], dtype=int64),
array([0, 2], dtype=int64),
array([1, 2], dtype=int64),
array([4, 3], dtype=int64),
array([3, 4], dtype=int64)]

现在，开始构建图表并添加边:

import networkx as nx #graphs package
G=nx.Graph() #create the graph and add edges
for e in edges:
    G.add_edge(e[0],e[1])

现在您可以使用connected_component_subgraphs函数来分析连接的顶点。

components = nx.connected_component_subgraphs(G) #analyze connected subgraphs
comp_dict = {idx: comp.nodes() for idx, comp in enumerate(components)}
print(comp_dict)

输出:

{0: [0, 1, 2], 1: [3, 4]}

或

print([le.inverse_transform(v) for v in comp_dict.values()])

输出:

[array(['苹果', '香蕉', '橙子']), array(['土 bean ', '大米'])]

这就是您的 2 个集群。

关于python - 单词聚类列表列表，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/52764682/

文章推荐： javascript - 操作对象数组

文章推荐： R:XGBoost 和特征哈希。 MError不断增加

文章推荐： javascript - Openlayers如何更新 map 移动事件上的叠加位置

文章推荐： python - 线性回归是否适用于分类自变量和连续因变量？

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 单词聚类列表列表