gpt4 book ai didi

python - 查找每类具有最高 TF-IDF 分数的前 n 个术语

转载 作者:行者123 更新时间:2023-12-04 12:40:10 26 4
gpt4 key购买 nike

假设我在 pandas 中有一个包含两列的数据框类似于以下内容:

    text                                label
0 This restaurant was amazing Positive
1 The food was served cold Negative
2 The waiter was a bit rude Negative
3 I love the view from its balcony Positive

然后我正在使用 TfidfVectorizer来自 sklearn在这个数据集上。

就每类的 TF-IDF 分数词汇而言,找到前 n 个最有效的方法是什么?

显然,我的实际数据帧包含的数据行比上面的 4 行多得多。

我的帖子的重点是找到适用于与上述类似的任何数据帧的代码; 4 行数据帧或 1M 行数据帧。

我认为我的帖子与以下帖子有很大关系:
  • Scikit Learn TfidfVectorizer : How to get top n terms with highest tf-idf score
  • How to see top n entries of term-document matrix after tfidf in scikit-learn
  • 最佳答案

    在下面,您可以找到我三年多前为类似目的编写的一段代码。我不确定这是否是做你要做的事情的最有效的方式,但据我所知,它对我有用。

    # X: data points
    # y: targets (data points` label)
    # vectorizer: TFIDF vectorizer created by sklearn
    # n: number of features that we want to list for each class
    # target_list: the list of all unique labels (for example, in my case I have two labels: 1 and -1 and target_list = [1, -1])
    # --------------------------------------------
    # splitting X vectors based on target classes
    for label in target_list:
    # listing the most important words in each class
    indices = []
    current_dict = {}

    # finding indices the of rows (data points) for the current class
    for i in range(0, len(X.toarray())):
    if y[i] == label:
    indices.append(i)

    # get rows of the current class from tf-idf vectors matrix and calculating the mean of features values
    vectors = np.mean(X[indices, :], axis=0)

    # creating a dictionary of features with their corresponding values
    for i in range(0, X.shape[1]):
    current_dict[X.indices[i]] = vectors.item((0, i))

    # sorting the dictionary based on values
    sorted_dict = sorted(current_dict.items(), key=operator.itemgetter(1), reverse=True)

    # printing the features textual and numeric values
    index = 1
    for element in sorted_dict:
    for key_, value_ in vectorizer.vocabulary_.items():
    if element[0] == value_:
    print(str(index) + "\t" + str(key_) + "\t" + str(element[1]))
    index += 1
    if index == n:
    break
    else:
    continue
    break

    关于python - 查找每类具有最高 TF-IDF 分数的前 n 个术语,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56703244/

    26 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com