python - Conceptnet Numberbatch(多语言)OOV 词-6ren

python - Conceptnet Numberbatch(多语言)OOV 词

转载作者：行者123 更新时间：2023-12-05 03:47:50

34

4

我正在处理一个文本分类问题(在法语语料库上)，并且正在试验不同的词嵌入。我对 ConceptNet 提供的内容非常感兴趣，所以我决定试一试。

我无法为我的特定任务找到专门的教程，所以我听取了他们的建议 blog :

How do I use ConceptNet Numberbatch?

To make it as straightforward as possible:

Work through any tutorial on machine learning for NLP that usessemantic vectors. Get to the part where they tell you to use word2vec.(A particularly enlightened tutorial may tell you to use GloVe 1.2.)

Get the ConceptNet Numberbatch data, and use it instead. Get betterresults that also generalize to other languages.

您可能会在下面找到我的方法(请注意，'numberbatch.txt' 是包含推荐的多语言版本的文件:ConceptNet Numberbatch 19.08):

embeddings_index = dict()

f = open('numberbatch.txt')

for line in f:
    values = line.split()
    word = values[0]
    coefs = asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Loaded %s word vectors.' % len(embeddings_index))

我首先测试一个词是否存在:

word = 'fille'
missingWords = 0
if word not in embeddings_index:
    missingWords += 1
print(missingWords)

令我惊讶的是，找不到像“fille”(法语中的女孩)这样的简单词。然后我创建了一个函数来打印我的语料库中的所有 OOV 词。分析结果时更让我吃惊的是:超过22k的词没有找到(包括'nous'(we)，'être'(to是)等)。

我还尝试了 GitHub page 上提出的方法对于 OOV 词(结果相同):

Out-of-vocabulary strategy

ConceptNet Numberbatch is evaluated with an out-of-vocabulary strategythat helps its performance in the presence of unfamiliar words. Thestrategy is implemented in the ConceptNet code base. It can besummarized as follows:

Given an unknown word whose language is not English, try looking upthe equivalently-spelled word in the English embeddings (becauseEnglish words tend to end up in text of all languages).

Given anunknown word, remove a letter from the end, and see if that is aprefix of known words. If so, average the embeddings of those knownwords.

If the prefix is still unknown, continue removing letters fromthe end until a known prefix is found. Give up when a singlecharacter remains.

我的方法有问题吗？

最佳答案

您是否考虑了 ConceptNet Numberbatch 的格式？如图project's GitHub ，它看起来像这样:

/c/en/absolute_value -0.0847 -0.1316 -0.0800 -0.0708 -0.2514 -0.1687 -...

/c/en/absolute_zero 0.0056 -0.0051 0.0332 -0.1525 -0.0955 -0.0902 0.07...

这种格式意味着 fille 不会被找到，但是 /c/fr/fille 会被找到。

关于python - Conceptnet Numberbatch(多语言)OOV 词，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/64717185/

34

4

0

文章推荐： google-api - Google 日历事件描述的字符限制是多少？

nlp - 如何消除 Conceptnet 中的单词歧义
Conceptnet包含两种基本类型的节点，单词(例如/c/en/cat)和语义(例如/c/en/cat/n/domestic_cat)。不幸的是，绝大多数边都使用单词节点。这使得推断变得困难，因为我
nlp - 哪个更好？ OpenCyc 还是 ConceptNet？
我正在做一个 NLP 项目，我需要识别句子中的概念以找到其他类似的概念。我这样做是为了从已有的列表中推断单词的价数。我开始使用 WordNet，但它给出了许多矛盾的结果。我所说的矛盾结果是指具有矛盾价
java - ConceptNet 数据库与 Java 的连接
有人知道如何用 Java 连接 ConceptNet 数据库吗？我搜索了不同的教程，检查了不同的论坛，但我仍然找不到正确的方法。此外，如何使用 Java 向 ConceptNet 获取数据或从 Co
python - Conceptnet Numberbatch(多语言)OOV 词
我正在处理一个文本分类问题(在法语语料库上)，并且正在试验不同的词嵌入。我对 ConceptNet 提供的内容非常感兴趣，所以我决定试一试。我无法为我的特定任务找到专门的教程，所以我听取了他们的建议
python - 使用 apache solr 设置 Conceptnet
我正在尝试使用常识推理。在其中我遇到了这个很酷的东西，叫做 Conceptnet http://conceptnet5.media.mit.edu/ 。还有一个搜索页面用于搜索不同的概念。这是Sear
java - ConceptNet 5.5 和 JSON 格式
以前的 ConceptNet 5.4 API 版本返回纯文本 JSON 格式 ( http://conceptnet5.media.mit.edu/data/5.4/c/en/library )。有

首页

博学

6Ren·AI

商城

python - Conceptnet Numberbatch(多语言)OOV 词