gpt4 book ai didi

python-3.x - 提取每个簇的顶部单词

转载 作者:行者123 更新时间:2023-11-30 09:16:16 25 4
gpt4 key购买 nike

我已经对文本数据进行了K-means聚类

#K-means clustering
from sklearn.cluster import KMeans
num_clusters = 4
km = KMeans(n_clusters=num_clusters)
%time km.fit(features)
clusters = km.labels_.tolist()

其中 features 是 tf-idf 向量

#preprocessing text - converting to a tf-idf vector form

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=0.01,max_df=0.75, norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')
features = tfidf.fit_transform(df.keywrds).toarray()
labels = df.CD

然后我将聚类标签添加到原始数据集中

df['clusters'] = clusters

并按簇索引数据帧

pd.DataFrame(df,index = [clusters])

如何获取每个集群的热门单词?

最佳答案

这实际上并不是每个集群中最常见的单词,而是按最常见的单词对它们进行排序。然后你可以只将第一个单词作为单词组而不是簇编号。

构建了一个包含所有功能名称和 tfidf 分数的字典

for f, w in zip(tfidf.get_feature_names(), tfidf.idf_):
featurenames[len(f.split(' '))].append((f, w))
featurenames = dict(featurenames[1])

四舍五入功能 idf 值,因为它们有点长

featurenames = dict(zip(featurenames.keys(), [round(v, 4) for v in featurenames.values()]))

将 dict 转换为 df

dffeatures = pd.DataFrame.from_dict(featurenames, orient='index').reset_index() \
.rename(columns={'index': 'featurename',0:'featureid'})
dffeatures = dffeatures.round(4)

将特征词与id结合起来,创建了一个新的词典。我这样做是为了适应重复的 id。

dffeatures['combined'] = dffeatures.apply(lambda x:'%s:%s' % (x['featureid'],x['featurename']),axis=1)
featurenamesnew = pd.Series(dffeatures.combined.values, index=dffeatures.featurename).to_dict()

{'cat': '2.3863:cat', 'cow': '3.0794:cow', 'dog': '2.674:dog'....}

在 df 中创建一个新列,并用 idf:feature value 替换所有单词

df['temp'] = df['inputdata'].replace(featurenamesnew, regex=True)

对 df idf:feature 值进行升序排序,以便最常见的单词首先出现

df['temp'] = df['temp'].str.split().apply(lambda x: sorted(set(x), reverse=False)).str.join(' ').to_frame()

反向映射 idf:featurevalue 并带有单词

inv_map = {v: k for k, v in featurenamesnew.items()}
df['cluster_top_n_words'] = df['temp'].replace(inv_map, regex=True)

最终在新的 df 列中保留前 n 个单词

df['cluster_top_n_words'] = df['cluster_top_n_words'].apply(lambda x: ' '.join(x.split()[:3]))

关于python-3.x - 提取每个簇的顶部单词,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55293010/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com