gpt4 book ai didi

python - 在 scikit-learn 中实现 K Neighbors Classifier 和 Linear SVM 以进行词义消歧

转载 作者:太空狗 更新时间:2023-10-30 01:14:37 25 4
gpt4 key购买 nike

我正在尝试使用线性 SVM 和 K Neighbors 分类器来进行词义消歧 (WSD)。这是我用来训练数据的一段数据:

<corpus lang="English">

<lexelt item="activate.v">


<instance id="activate.v.bnc.00024693" docsrc="BNC">
<answer instance="activate.v.bnc.00024693" senseid="38201"/>
<context>
Do you know what it is , and where I can get one ? We suspect you had seen the Terrex Autospade , which is made by Wolf Tools . It is quite a hefty spade , with bicycle - type handlebars and a sprung lever at the rear , which you step on to <head>activate</head> it . Used correctly , you should n't have to bend your back during general digging , although it wo n't lift out the soil and put in a barrow if you need to move it ! If gardening tends to give you backache , remember to take plenty of rest periods during the day , and never try to lift more than you can easily cope with .
</context>
</instance>


<instance id="activate.v.bnc.00044852" docsrc="BNC">
<answer instance="activate.v.bnc.00044852" senseid="38201"/>
<answer instance="activate.v.bnc.00044852" senseid="38202"/>
<context>
For neurophysiologists and neuropsychologists , the way forward in understanding perception has been to correlate these dimensions of experience with , firstly , the material properties of the experienced object or event ( usually regarded as the stimulus ) and , secondly , the patterns of discharges in the sensory system . Qualitative Aspects of Experience The quality or modality of the experience depends less upon the quality of energy reaching the nervous system than upon which parts of the sensory system are <head>activated</head> : stimulation of the retinal receptors causes an experience of light ; stimulation of the receptors in the inner ear gives rise to the experience of sound ; and so on . Muller 's nineteenth - century doctrine of specific energies formalized the ordinary observation that different sense organs are sensitive to different physical properties of the world and that when they are stimulated , sensations specific to those organs are experienced . It was proposed that there are endings ( or receptors ) within the nervous system which are attuned to specific types of energy , For example , retinal receptors in the eye respond to light energy , cochlear endings in the ear to vibrations in the air , and so on .
</context>
</instance>
.....

训练数据和测试数据的区别在于测试数据没有“answer”标签。我已经建立了一个字典来存储窗口大小为 10 的每个实例的“head”词的邻居词。当一个实例有多个时,我只考虑第一个 .我还建立了一个集合来记录训练文件中的所有词汇,这样我就可以为每个实例计算一个向量。 例如,如果总词汇量为 [a,b,c,d,e],并且一个实例包含单词 [a,a,d,d,e],则该实例的结果向量将为 [2 ,0,0,2,1]。这是我为每个单词构建的字典的一部分:

{
"activate.v": {
"activate.v.bnc.00024693": {
"instanceId": "activate.v.bnc.00024693",
"senseId": "38201",
"vocab": {
"although": 1,
"back": 1,
"bend": 1,
"bicycl": 1,
"correct": 1,
"dig": 1,
"general": 1,
"handlebar": 1,
"hefti": 1,
"lever": 1,
"nt": 2,
"quit": 1,
"rear": 1,
"spade": 1,
"sprung": 1,
"step": 1,
"type": 1,
"use": 1,
"wo": 1
}
},
"activate.v.bnc.00044852": {
"instanceId": "activate.v.bnc.00044852",
"senseId": "38201",
"vocab": {
"caus": 1,
"ear": 1,
"energi": 1,
"experi": 1,
"inner": 1,
"light": 1,
"nervous": 1,
"part": 1,
"qualiti": 1,
"reach": 1,
"receptor": 2,
"retin": 1,
"sensori": 1,
"stimul": 2,
"system": 2,
"upon": 2
}
},
......

现在,我只需要从 scikit-learn 向 K Neighbors Classifier 和 Linear SVM 提供输入来训练分类器。 但我不确定应该如何为每个构建特征向量和标签。我的理解是标签应该是“答案”中的实例标签和 senseid 标签的元组。但是我不确定当时的特征向量。我是否应该将来自同一个单词的所有向量分组,在“答案”中具有相同的实例标签和 senseid 标签?但是大约有 100 个单词,每个单词有数百个实例,我应该如何处理?

此外,矢量是一个特征,我需要在以后添加更多特征,例如同义词集、上位词、下位词等。我应该怎么做?

提前致谢!

最佳答案

机器学习问题是一种优化任务,您没有预定义的最佳算法,而是使用不同的方法、参数和数据预处理来摸索最佳结果。因此,您从最简单的任务开始绝对是正确的 - 只使用一个词和它的几种含义。

But I am just not sure how should I build the feature vector and label for each.

您可以只将这些值作为矢量分量。枚举矢量词并在每个文本中写下此类词的编号。如果单词不存在,则输入空值。我稍微修改了您的示例以阐明这个想法:

vocab_38201= {
"although": 1,
"back": 1,
"bend": 1,
"bicycl": 1,
"correct": 1,
"dig": 1,
"general": 1,
"handlebar": 1,
"hefti": 1,
"lever": 1,
"nt": 2,
"quit": 1,
"rear": 1,
"spade": 1,
"sprung": 1,
"step": 1,
"type": 1,
"use": 1,
"wo": 1
}

vocab_38202 = {
"caus": 1,
"ear": 1,
"energi": 1,
"experi": 1,
"inner": 1,
"light": 1,
"nervous": 1,
"part": 1,
"qualiti": 1,
"reach": 1,
"receptor": 2,
"retin": 1,
"sensori": 1,
"stimul": 2,
"system": 2,
"upon": 2,
"wo": 1 ### added so they have at least one common word
}

让我们将其转换为特征向量。枚举所有单词并标记该单词在词汇表中出现了多少次。

from collections import defaultdict
words = []

def get_components(vect_dict):
vect_components = defaultdict(int)
for word, num in vect_dict.items():
try:
ind = words.index(word)
except ValueError:
ind = len(words)
words.append(word)
vect_components[ind] += num
return vect_components


#
vect_comps_38201 = get_components(vocab_38201)
vect_comps_38202 = get_components(vocab_38202)

让我们看看:

>>> print(vect_comps_38201)
defaultdict(<class 'int'>, {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 2, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1})

>>> print(vect_comps_38202)
defaultdict(<class 'int'>, {32: 1, 33: 2, 34: 1, 7: 1, 19: 2, 20: 2, 21: 1, 22: 1, 23: 1, 24: 1, 25: 1, 26: 1, 27: 2, 28: 1, 29: 1, 30: 1, 31: 1})

>>> vect_38201=[vect_comps_38201.get(i,0) for i in range(len(words))]
>>> print(vect_38201)
[1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

>>> vect_38202=[vect_comps_38202.get(i,0) for i in range(len(words))]
>>> print(vect_38202)
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1]

这些 vect_38201 和 vect38202 是您可以在拟合模型中使用的向量:

from sklearn.svm import SVC
X = [vect_38201, vect_38202]
y = [38201, 38202]
clf = SVC()
clf.fit(X, y)
clf.predict([[0, 0, 1, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 2, 1]])

输出:

array([38202])

当然这是一个非常非常简单的例子,只是展示概念。

你能做些什么来改进它?

  1. 归一化向量坐标。

  2. 使用优秀的工具 Tf-Idf vectorizer从文本中提取数据特征。

  3. 添加更多数据。

祝你好运!

关于python - 在 scikit-learn 中实现 K Neighbors Classifier 和 Linear SVM 以进行词义消歧,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29189865/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com