gpt4 book ai didi

python-3.x - 单词不在词汇表中

转载 作者:行者123 更新时间:2023-12-02 02:13:47 26 4
gpt4 key购买 nike

第一次使用 word2vec,我正在处理的文件是 XML 格式。我想遍历专利以找到每个标题,然后应用 word2vec 查看是否有相似的词(以指示相似的标题)。

到目前为止,我已经使用元素树解析了 XML 文件以检索每个标题,然后我应用了 sent_tokenizer,然后是 tweet tokenizer 以返回一个句子列表,其中每个单词已被标记化(不确定这是否是最好的方法)。然后,我将标记化的句子放入我的 word2vec 模型中,并用一个词进行测试,看它是否返回了一个向量。这似乎只适用于第一句话中的一个词。我不确定它能识别所有句子吗?

    import numpy as np
import pandas as pd
import gensim
import nltk
import xml.etree.ElementTree as ET
from gensim.models.word2vec import Word2Vec
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer, sent_tokenize

tree = ET.parse('6785.xml')
root = tree.getroot()

for child in root.iter("Title"):
Patent_Title = child.text
sentence = Patent_Title
stopWords = set(stopwords.words('english'))
tokens = nltk.sent_tokenize(sentence)
print(tokens)

tokenizer_words = TweetTokenizer()
tokens_sentences = [tokenizer_words.tokenize(t) for t in tokens]
#print(tokens_sentences)

model = gensim.models.Word2Vec(tokens_sentences, min_count=1,size=32)
words = list(model.wv.vocab)
print(words)
print(model['Solar'])

我希望它能够识别句子中的“太阳能”一词并打印出向量,然后我可以寻找相似的词。我收到了错误:

词汇中没有“太阳能”这个词"

最佳答案

仅在第一次循环发生时将错误作为异常处理。

# print(model['Solar'])
try:
print(model['Solar'])
except Exception as e:
pass

工作代码:

import numpy as np
import pandas as pd
import gensim
import nltk
import xml.etree.ElementTree as ET
from gensim.models.word2vec import Word2Vec
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer, sent_tokenize

tree = ET.parse('6785.xml')
root = tree.getroot()

for child in root.iter("Title"):
Patent_Title = child.text
sentence = Patent_Title
stopWords = set(stopwords.words('english'))
tokens = nltk.sent_tokenize(sentence)
print(tokens)

tokenizer_words = TweetTokenizer()
tokens_sentences = [tokenizer_words.tokenize(t) for t in tokens]
#print(tokens_sentences)

model = gensim.models.Word2Vec(tokens_sentences, min_count=1,size=32)
words = list(model.wv.vocab)
print(words)
try:
print(model['Solar'])
except Exception as e:
pass

关于python-3.x - 单词不在词汇表中,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56524948/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com