gpt4 book ai didi

python-2.7 - 如何在 python scikit-learn 中预测字典向量化后的单个新样本?

转载 作者:行者123 更新时间:2023-11-30 08:47:03 27 4
gpt4 key购买 nike

我正在使用逻辑回归分类器来预测种族类别标签 0、1。我的数据被分为测试样本和训练样本,并被字典向量化为稀疏矩阵。

以下是工作代码,我在其中预测和验证 X_train 和 X_test,它们是矢量化功能的一部分:

for i in mass[k]:
df = df_temp # reset df before each loop
#$$
if 1==1:
count+=1
ethnicity_tar = str(i)
############################################
############################################

def ethnicity_target(row):
try:
if row[ethnicity_var] == ethnicity_tar:
return 1
else:
return 0
except: return None
df['ethnicity_scan'] = df.apply(ethnicity_target, axis=1)
print '1=', ethnicity_tar
print '0=', 'non-'+ethnicity_tar

# Random sampling a smaller dataframe for debugging
rows = df.sample(n=subsample_size, random_state=seed) # Seed gives fixed randomness
df = DataFrame(rows)
print 'Class count:'
print df['ethnicity_scan'].value_counts()

# Assign X and y variables
X = df.raw_name.values
X2 = df.name.values
X3 = df.gender.values
X4 = df.location.values
y = df.ethnicity_scan.values

# Feature extraction functions
def feature_full_name(nameString):
try:
full_name = nameString
if len(full_name) > 1: # not accept name with only 1 character
return full_name
else: return '?'
except: return '?'

def feature_full_last_name(nameString):
try:
last_name = nameString.rsplit(None, 1)[-1]
if len(last_name) > 1: # not accept name with only 1 character
return last_name
else: return '?'
except: return '?'

def feature_full_first_name(nameString):
try:
first_name = nameString.rsplit(' ', 1)[0]
if len(first_name) > 1: # not accept name with only 1 character
return first_name
else: return '?'
except: return '?'

# Transform format of X variables, and spit out a numpy array for all features
my_dict = [{'last-name': feature_full_last_name(i)} for i in X]
my_dict5 = [{'first-name': feature_full_first_name(i)} for i in X]

all_dict = []
for i in range(0, len(my_dict)):
temp_dict = dict(
my_dict[i].items() + my_dict5[i].items()
)
all_dict.append(temp_dict)

newX = dv.fit_transform(all_dict)

# Separate the training and testing data sets
X_train, X_test, y_train, y_test = cross_validation.train_test_split(newX, y, test_size=testTrainSplit)

# Fitting X and y into model, using training data
classifierUsed2.fit(X_train, y_train)

# Making predictions using trained data
y_train_predictions = classifierUsed2.predict(X_train)
y_test_predictions = classifierUsed2.predict(X_test)

但是,我只想预测一个名字,例如“John Carter”并预测种族标签。我用以下行替换了 y_train_predictions = classifierUsed2.predict(X_train)y_train_predictions = classifierUsed2.predict(X_train) 但导致错误:

print classifierUsed2.predict(["John Carter"])

#error
Error: X has 1 features per sample; expecting 103916

最佳答案

您需要以与训练数据完全相同的方式转换数据,因此类似于(如果您的输入数据只是字符串列表)

classifierUsed2.predict(dv.transform(["John Carter"])) 

关于python-2.7 - 如何在 python scikit-learn 中预测字典向量化后的单个新样本?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35710055/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com