gpt4 book ai didi

python - 如果我对分类数据进行 LabelEncode,在创建 LightGBM 数据集时是否还需要使用 categorical_feature?

转载 作者:行者123 更新时间:2023-12-04 14:18:57 31 4
gpt4 key购买 nike

我正在尝试使用两个特征在 lightgbm 中创建一个简单模型,一个是分类特征,另一个是距离特征。我正在关注一个教程 ( https://sefiks.com/2018/10/13/a-gentle-introduction-to-lightgbm-for-applied-machine-learning/ ),其中指出即使在 LabelEncoding 之后,我仍然需要告诉 lightgbm 我的编码特征本质上是分类的。但是,当我尝试这样做时,我收到了这些系列的警告消息:

UserWarning: Using categorical_feature in Dataset.
warnings.warn('Using categorical_feature in Dataset.')
UserWarning: categorical_feature in Dataset is overridden.
New categorical_feature is ['type']
'New categorical_feature is
{}'.format(sorted(list(categorical_feature))))
categorical_feature in param dict is overridden.
warnings.warn('categorical_feature in param dict is overridden.')

我想知道 lightgbm 是否确实理解该列本质上是绝对的。看起来确实如此,但我不确定为什么本教程明确指出它没有。下面是我的代码:

trainDataProc = pd.read_csv('trainDataPrepared.csv', header=0)

le=prep.LabelEncoder()

num_columns=trainDataProc.shape[1]

for i in range(0, num_columns):
column_name=trainDataProc.columns[i]
column_type=trainDataProc[column_name].dtypes
if column_type == 'object':
le.fit(trainDataProc[column_name])
encoded_feature=le.transform(trainDataProc[column_name])
trainDataProc[column_name]=pd.DataFrame(encoded_feature)

# Prepare train X and Y column names.
trainColumnsX = ['type', 'dist']
cat_feat=['type']
trainColumnsY = ['scalar']

# Perform K-fold split.
kfold = mls.KFold(n_splits=5, shuffle=True, random_state=0)
result = next(kfold.split(trainDataProc), None)
train = trainDataProc.iloc[result[0]]
test = trainDataProc.iloc[result[1]]

# Train model via lightGBM.
lgbTrain = lgb.Dataset(train[trainColumnsX], label=train[trainColumnsY],
categorical_feature=cat_feat)
lgbEval = lgb.Dataset(test[trainColumnsX], label=test[trainColumnsY])

# Model parameters.
params = {
'boosting_type': 'gbdt',
'objective': 'regression',
'metric': {'mae'},
'num_leaves': 25,
'learning_rate': 0.0001,
'feature_fraction': 0.9,
'bagging_fraction': 0.8,
'bagging_freq': 5,
'verbose': 0
}

# Set up training.
gbm = lgb.train(params,
lgbTrain,
num_boost_round=200,
valid_sets=lgbEval,
early_stopping_rounds=50)

最佳答案

我也遇到了类似的警告信息并查看了 scikit-learn 文档。

如果您使用具有分类特征(和编码为整数的标签)的 pandas Dataframe,则不需要单独定义。

LGBM 的参数“categorical_feature”的默认值为“Auto”,这确保自动使用 pandas 分类列。

关于python - 如果我对分类数据进行 LabelEncode,在创建 LightGBM 数据集时是否还需要使用 categorical_feature?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57121543/

31 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com