gpt4 book ai didi

python - 基于单个特征集的分类精度

转载 作者:行者123 更新时间:2023-11-30 09:43:55 25 4
gpt4 key购买 nike

我正在尝试根据预先指定的标签对数据进行分类。

有两列,如下所示:

room_class                     room_cluster
Standard single sea view Standard
Deluxe twin Single Deluxe
Suite Superior room ocean view Suite
Superior Double twin Superior
Deluxe Double room Deluxe

如上面标签集中的 room_cluster 所示。

代码片段如下:

le = preprocessing.LabelEncoder()

datar = df

#### Separate data into feature and Labels
x = datar.room_class
y = datar.room_cluster


#### Using Label encoder to change string onto 'int'
le.fit(x)
addv = le.transform(x)
asb = addv.reshape(-1,1)


#### Splitting into training and testing set adn then using Knn
x_train,x_test,y_train,y_test=train_test_split(asb,y,test_size=0.40)
classifier=neighbors.KNeighborsClassifier(n_neighbors=3)
classifier.fit(x_train,y_train)
predictions = classifier.predict(x_test)


#### Checking the accuracy
print(accuracy_score(y_test,predictions))

我在测试数据上得到的准确度只有 78%,代码中是否有问题影响了准确度水平。

如何使用此模型来预测自定义特征,例如:

输入:“单人海景套房”
输出:'套房'

输入:“高级套房双床”
输出:“优秀”

最佳答案

import random
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import numpy as np

##Based on your data
initial_room=["Standard single sea view","Deluxe twin Single","Suite Superior room ocean view","Superior Double twin","Deluxe Double room"]


##Based on your data created 100 data points
##Its repeating
room_class=[initial_room[random.randint(0,len(initial_room)-1)] for i in range(100)]

##Based on room_cluster
initial_cluster=["Standard","Deluxe","Suite","Superior"]

##Find intersection between room_class and room_cluster the matching word is the Y_Label
room_cluster=[''.join(list(set(each_room.split()).intersection(set(initial_cluster)))[0]) for each_room in room_class]


##Helps to embed
embedding={}
index=0


##For each unique word in the total room_class assign a unique number.
for each_room in room_class:
for each_word in each_room.split():
if each_word not in embedding:
embedding[each_word]=index
index+=1

##Find max_len of the room name
max_len=max([len(i.split()) for i in room_class])

##Needed for embedding the matrix
embedded_rooms=[]


##For each room in room_class
for each_room in room_class:
embedded_room=[]
for each_word in each_room.split():
##Each word assign that unique number
embedded_room.append(embedding[each_word])

#Get the length of the row
room_len=len(embedded_room)

##If it is length max_len pad it with -1
##Single for embedding I have already used 0 so I cant use it
while(room_len<max_len):
embedded_room.append(-1)
room_len+=1
##Append it to embedded rooms
embedded_rooms.append(embedded_room)

Y=[]

##Embed Y based on same technique
for each_cluster in room_cluster:
Y.append(embedding[each_cluster])


X=np.array(embedded_rooms)


##Apply KNN
classifier = KNeighborsClassifier(n_neighbors=3)
classifier.fit(X,Y)

##Data for testing goes within this list
test=["Single Standard"]
test_label=["Standard"]


embed_tests=[]
##Convert the test to embedding
#Use the same embedding
for each_test in test:
embed_test=[]
for each_word in each_test.split():
embed_test.append(embedding[each_word])
##Again Padding the data
n=len(embed_test)
while(n<max_len):
embed_test.append(-1)
n+=1
embed_tests.append(embed_test)

#Predict the X_test
X_test=np.array(embed_tests)
predictions = classifier.predict(X_test)

##Convert class_labels to encoding
embed_test_label=[]
for each_class in test_label:
embed_test_label.append(embedding[each_class])

##Print out the accuracy
print(accuracy_score(embed_test_label,predictions))

我已经粗略地编码了它,所以请耐心等待。

引用文献:

  1. Padding

关于python - 基于单个特征集的分类精度,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55076069/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com