gpt4 book ai didi

python LightGBM 文本分类与 Tfidf

转载 作者:行者123 更新时间:2023-12-01 01:55:30 24 4
gpt4 key购买 nike

我正在尝试引入 LightGBM 进行文本多分类。pandas 数据框中有 2 列,其中“类别”和“内容”设置如下。

数据框:

    contents               category  
1 this is example1... A
2 this is example2... B
3 this is example3... C

*Actual data frame consists of approx 600 rows and 2 columns.

在此,我尝试将文本分为以下 3 类。

代码:

import pandas as pd
import numpy as np

from nltk.corpus import stopwords
stopwords1 = set(stopwords.words('english'))

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV

import lightgbm as lgbm
from lightgbm import LGBMClassifier, LGBMRegressor


#--main code--#
X_train, X_test, Y_train, Y_test = train_test_split(df['contents'], df['category'], random_state = 0, test_size=0.3, shuffle=True)

count_vect = CountVectorizer(ngram_range=(1,2), stop_words=stopwords1)
X_train_counts = count_vect.fit_transform(X_train)

tfidf_transformer = TfidfTransformer(use_idf=True, smooth_idf=True, norm='l2', sublinear_tf=True)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

lgbm_train = lgbm.Dataset(X_train_tfidf, Y_train)
lgbm_eval = lgbm.Dataset(count_vect.transform(X_test), Y_test, reference=lgbm_train)

params = {
'boosting_type':'gbdt',
'objective':'multiclass',
'learning_rate': 0.02,
'num_class': 3,
'early_stopping': 100,
'num_iteration': 2000,
'num_leaves': 31,
'is_enable_sparse': 'true',
'tree_learner': 'data',
'max_depth': 4,
'n_estimators': 50
}

clf_gbm = lgbm.train(params, valid_sets=lgbm_eval)
predicted_LGBM = clf_gbm.predict(count_vect.transform(X_test))

print(accuracy_score(Y_test, predicted_LGBM))

然后我得到一个错误:

ValueError: could not convert string to float: 'b'  

我还将“类别”列 ['a', 'b', 'c'] 转换为 int 作为 [0, 1, 2] 但出现错误

TypeError: Expected np.float32 or np.float64, met type(int64).

我的代码有什么问题?
任何意见/建议将不胜感激。
提前致谢。

最佳答案

我设法解决了这个问题。非常简单,但在此注明以供引用。

由于 LightGBM 需要 float32/64 输入,因此“类别”应该是数字,而不是字符串。输入数据应使用 .astype() 转换为 float32/64。

更改1:
在后面添加了以下 4 行 X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

 X_train_tfidf = X_train_tfidf.astype('float32')
X_test_counts = X_test_counts.astype('float32')
Y_train = Y_train.astype('float32')
Y_test = Y_test.astype('float32')

更改2:
只需将“类别”列从 [A, B, C, ...] 转换为 [0.0, 1.0, 2.0, ...]

在这种情况下,也许只需将属性分配为 TfidfVecotrizer(dtype=np.float32) 即可。
并且将矢量化数据放入 LGBMClassifier 中会简单得多。

更新
使用 TfidfVectorizer 更简单:

tfidf_vec = TfidfVectorizer(dtype=np.float32, sublinear_tf=True, use_idf=True, smooth_idf=True)
X_data_tfidf = tfidf_vec.fit_transform(df['contents'])
X_train_tfidf = tfidf_vec.transform(X_train)
X_test_tfidf = tfidf_vec.transform(X_test)

clf_LGBM = lgbm.LGBMClassifier(objective='multiclass', verbose=-1, learning_rate=0.5, max_depth=20, num_leaves=50, n_estimators=120, max_bin=2000,)
clf_LGBM.fit(X_train_tfidf, Y_train, verbose=-1)
predicted_LGBM = clf_LGBM.predict(X_test_tfidf)

关于python LightGBM 文本分类与 Tfidf,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50250432/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com