- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
我正在制作一个程序,可以根据文本中的数据预测相应的业务部门。我已经设置了一个词汇表来查找文本中与某个单元相对应的单词出现情况,但我不确定如何使用该数据与机器学习模型进行预测。
它可以预测四个单位:MicrosoftTech、JavaTech、Pythoneers 和 JavascriptRoots。在词汇表中,我放入了表示某些单位的单词。例如JavaTech:Java、Spring、Android;微软技术:.Net、csharp;等等。现在,我使用词袋模型和自定义词汇表来查找这些单词出现的频率。
这是我获取字数数据的代码:
def bagOfWords(description, vocabulary):
bag = np.zeros(len(vocabulary)).astype(int)
for sw in description:
for i,word in enumerate(vocabulary):
if word == sw:
bag[i] += 1
print("Bag: ", bag)
return bag
假设词汇表是:[java、spring、.net、csharp、python、numpy、nodejs、javascript]
。描述是:“X 公司正在寻找一名 Java 开发人员。要求:曾使用过 Java。拥有 3 年以上 Java、Maven 和 Spring 经验。”
运行代码将输出以下内容:Bag: [3,1,0,0,0,0,0,0]
如何使用这些数据通过机器学习算法进行预测?
到目前为止我的代码:
import pandas as pd
import numpy as np
import warnings
import tkinter as tk
from tkinter import filedialog
from nltk.tokenize import TweetTokenizer
warnings.filterwarnings("ignore", category=FutureWarning)
root= tk.Tk()
canvas1 = tk.Canvas(root, width = 300, height = 300, bg = 'lightsteelblue')
canvas1.pack()
def getExcel ():
global df
vocabularysheet = pd.read_excel (r'Filepath\filename.xlsx')
vocabularydf = pd.DataFrame(vocabularysheet, columns = ['Word'])
vocabulary = vocabularydf.values.tolist()
unitlabelsdf = pd.DataFrame(vocabularysheet, columns = ['Unit'])
unitlabels = unitlabelsdf.values.tolist()
for voc in vocabulary:
index = vocabulary.index(voc)
voc = vocabulary[index][0]
vocabulary[index] = voc
for label in unitlabels:
index = unitlabels.index(label)
label = unitlabels[index][0]
unitlabels[index] = label
import_file_path = filedialog.askopenfilename()
testdatasheet = pd.read_excel (import_file_path)
descriptiondf = pd.DataFrame(testdatasheet, columns = ['Description'])
descriptiondf = descriptiondf.replace('\n',' ', regex=True).replace('\xa0',' ', regex=True).replace('•', ' ', regex=True).replace('u200b', ' ', regex=True)
description = descriptiondf.values.tolist()
tokenized_description = tokanize(description)
for x in tokenized_description:
index = tokenized_description.index(x)
tokenized_description[index] = bagOfWords(x, vocabulary)
def tokanize(description):
for d in description:
index = description.index(d)
tknzr = TweetTokenizer()
tokenized_description = list(tknzr.tokenize((str(d).lower())))
description[index] = tokenized_description
return description
def wordFilter(tokenized_description):
bad_chars = [';', ':', '!', "*", ']', '[', '.', ',', "'", '"']
if(tokenized_description in bad_chars):
return False
else:
return True
def bagOfWords(description, vocabulary):
bag = np.zeros(len(vocabulary)).astype(int)
for sw in description:
for i,word in enumerate(vocabulary):
if word == sw:
bag[i] += 1
print("Bag: ", bag)
return bag
browseButton_Excel = tk.Button(text='Import Excel File', command=getExcel, bg='green', fg='white', font=('helvetica', 12, 'bold'))
predictionButton = tk.Button(text='Button', command=getExcel, bg='green', fg='white', font=('helvetica', 12, 'bold'))
canvas1.create_window(150, 150, window=browseButton_Excel)
root.mainloop()
最佳答案
您已经知道如何准备训练数据集。
这是我用来解释的一个例子:
voca = ["java", "spring", "net", "csharp", "python", "numpy", "nodejs", "javascript"]
units = ["MicrosoftTech", "JavaTech", "Pythoneers", "JavascriptRoots"]
desc1 = "Company X is looking for a Java Developer. Requirements: Has worked with Java. 3+ years experience with Java, Maven and Spring."
desc2 = "Company Y is looking for a csharp Developer. Requirements: Has wored with csharp. 5+ years experience with csharp, Net."
x_train = []
y_train = []
x_train.append(bagOfWords(desc1, voca))
y_train.append(units.index("JavaTech"))
x_train.append(bagOfWords(desc2, voca))
y_train.append(units.index("MicrosoftTech"))
并且,我们得到了 2 个训练数据集:
[array([3, 1, 0, 0, 0, 0, 0, 0]), array([0, 0, 1, 3, 0, 0, 0, 0])] [1, 0]
array([3, 1, 0, 0, 0, 0, 0, 0]) => 1 (It means JavaTech)
array([0, 0, 1, 3, 0, 0, 0, 0]) => 0 (It means MicrosoftTech)
并且,模型需要预测您定义的 4 个单位中的一个单位。所以,我们需要一个分类机器学习模型。分类机器学习模型需要“softmax”作为输出层的激活函数。并且,需要“交叉熵”损失函数。这是一个非常简单的深度学习模型,由tensorflow的keras apis编写。
import tensorflow as tf
import numpy as np
units = ["MicrosoftTech", "JavaTech", "Pythoneers", "JavascriptRoots"]
x_train = np.array([[3, 1, 0, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 0, 0, 0, 0],
[0, 0, 2, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 2, 1, 0, 0],
[0, 0, 0, 0, 1, 2, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 1],
[0, 0, 0, 0, 0, 0, 1, 0]])
y_train = np.array([0, 0, 1, 1, 2, 2, 3, 3])
并且,该模型由一个包含 256 个单元的隐藏层和一个包含 4 个单元的输出层组成。
model = tf.keras.models.Sequential([
tf.keras.layers.Dense(256, activation=tf.nn.relu),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(len(units), activation=tf.nn.softmax)])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
我将 epochs 设置为 50。您需要在运行时查看 loss 和 acc 来学习。事实上,10个还不够。我会开始学习。
model.fit(x_train, y_train, epochs=50)
而且,这是预测的一部分。 newSample 只是我制作的示例。
newSample = np.array([[2, 2, 0, 0, 0, 0, 0, 0]])
prediction = model.predict(newSample)
print (prediction)
print (units[np.argmax(prediction)])
最后我得到的结果如下:
[[0.96280855 0.00981709 0.0102595 0.01711495]]
MicrosoftTech
表示每个单元的可能性。而可能性最大的是 MicrosoftTech。
MicrosoftTech : 0.96280855
JavaTech : 0.00981709
....
而且,这是学习步骤的结果。您可以看到损失正在持续减少。因此,我增加了纪元数。
Epoch 1/50
8/8 [==============================] - 0s 48ms/step - loss: 1.3978 - acc: 0.0000e+00
Epoch 2/50
8/8 [==============================] - 0s 356us/step - loss: 1.3618 - acc: 0.1250
Epoch 3/50
8/8 [==============================] - 0s 201us/step - loss: 1.3313 - acc: 0.3750
Epoch 4/50
8/8 [==============================] - 0s 167us/step - loss: 1.2965 - acc: 0.7500
Epoch 5/50
8/8 [==============================] - 0s 139us/step - loss: 1.2643 - acc: 0.8750
........
........
Epoch 45/50
8/8 [==============================] - 0s 122us/step - loss: 0.3500 - acc: 1.0000
Epoch 46/50
8/8 [==============================] - 0s 140us/step - loss: 0.3376 - acc: 1.0000
Epoch 47/50
8/8 [==============================] - 0s 134us/step - loss: 0.3257 - acc: 1.0000
Epoch 48/50
8/8 [==============================] - 0s 137us/step - loss: 0.3143 - acc: 1.0000
Epoch 49/50
8/8 [==============================] - 0s 141us/step - loss: 0.3032 - acc: 1.0000
Epoch 50/50
8/8 [==============================] - 0s 177us/step - loss: 0.2925 - acc: 1.0000
关于python - 如何将 ML 算法与词袋中的特征向量数据结合使用?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55848136/
我是一名优秀的程序员,十分优秀!