gpt4 book ai didi

python - Conceptnet Numberbatch(多语言)OOV 词

转载 作者:行者123 更新时间:2023-12-05 03:47:50 34 4
gpt4 key购买 nike

我正在处理一个文本分类问题(在法语语料库上),并且正在试验不同的词嵌入。我对 ConceptNet 提供的内容非常感兴趣,所以我决定试一试。

我无法为我的特定任务找到专门的教程,所以我听取了他们的建议 blog :

How do I use ConceptNet Numberbatch?

To make it as straightforward as possible:

Work through any tutorial on machine learning for NLP that usessemantic vectors. Get to the part where they tell you to use word2vec.(A particularly enlightened tutorial may tell you to use GloVe 1.2.)

Get the ConceptNet Numberbatch data, and use it instead. Get betterresults that also generalize to other languages.

您可能会在下面找到我的方法(请注意,'numberbatch.txt' 是包含推荐的多语言版本的文件:ConceptNet Numberbatch 19.08):

embeddings_index = dict()

f = open('numberbatch.txt')

for line in f:
values = line.split()
word = values[0]
coefs = asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()

print('Loaded %s word vectors.' % len(embeddings_index))

我首先测试一个词是否存在:

word = 'fille'
missingWords = 0
if word not in embeddings_index:
missingWords += 1
print(missingWords)

令我惊讶的是,找不到像“fille”(法语中的女孩)这样的简单词。然后我创建了一个函数来打印我的语料库中的所有 OOV 词。分析结果时更让我吃惊的是:超过22k的词没有找到(包括'nous'(we),'être'(to是)等)。

我还尝试了 GitHub page 上提出的方法对于 OOV 词(结果相同):

Out-of-vocabulary strategy

ConceptNet Numberbatch is evaluated with an out-of-vocabulary strategythat helps its performance in the presence of unfamiliar words. Thestrategy is implemented in the ConceptNet code base. It can besummarized as follows:

Given an unknown word whose language is not English, try looking upthe equivalently-spelled word in the English embeddings (becauseEnglish words tend to end up in text of all languages).

Given anunknown word, remove a letter from the end, and see if that is aprefix of known words. If so, average the embeddings of those knownwords.

If the prefix is still unknown, continue removing letters fromthe end until a known prefix is found. Give up when a singlecharacter remains.

我的方法有问题吗?

最佳答案

您是否考虑了 ConceptNet Numberbatch 的格式?如图project's GitHub ,它看起来像这样:

/c/en/absolute_value -0.0847 -0.1316 -0.0800 -0.0708 -0.2514 -0.1687 -...

/c/en/absolute_zero 0.0056 -0.0051 0.0332 -0.1525 -0.0955 -0.0902 0.07...

这种格式意味着 fille 不会被找到,但是 /c/fr/fille 会被找到。

关于python - Conceptnet Numberbatch(多语言)OOV 词,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/64717185/

34 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com