python - 如何使用 Gensim 在葡萄牙语中生成词嵌入？-6ren

python - 如何使用 Gensim 在葡萄牙语中生成词嵌入？

转载作者：行者123 更新时间：2023-12-01 22:10:28

我有以下问题:

在英语语言中，我的代码使用 Gensim 生成了成功的词嵌入，考虑到余弦距离，相似的短语彼此接近:

“Response time and error measurement”和“Relation of user perceived response time to error measurement”之间的角度很小，因此它们是集合中最相似的短语。

但是，当我在葡萄牙语中使用相同的短语时，它不起作用:

我的代码如下:

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
import matplotlib.pyplot as plt
from gensim import corpora
documents = ["Interface máquina humana para aplicações computacionais de laboratório abc",
          "Um levantamento da opinião do usuário sobre o tempo de resposta do sistema informático",
           "O sistema de gerenciamento de interface do usuário EPS",
           "Sistema e testes de engenharia de sistemas humanos de EPS",
           "Relação do tempo de resposta percebido pelo usuário para a medição de erro",
           "A geração de árvores não ordenadas binárias aleatórias",
           "O gráfico de interseção dos caminhos nas árvores",
           "Gráfico de menores IV Largura de árvores e bem quase encomendado",
           "Gráficos menores Uma pesquisa"]

stoplist = set('for a of the and to in on'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in documents]
texts

from collections import defaultdict
frequency = defaultdict(int)

for text in texts:
    for token in text:
        frequency[token] += 1
frequency

from nltk import tokenize  
texts=[tokenize.word_tokenize(documents[i], language='portuguese') for i in range(0,len(documents))]

from pprint import pprint
pprint(texts)

dictionary = corpora.Dictionary(texts)
dictionary.save('/tmp/deerwester.dict')
print(dictionary)

print(dictionary.token2id)


# VECTOR
new_doc = "Tempo de resposta e medição de erro"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec)

## VETOR OF PHRASES
corpus = [dictionary.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize('/tmp/deerwester.mm', corpus)  
print(corpus)

from gensim import corpora, models, similarities
tfidf = models.TfidfModel(corpus) # step 1 -- initialize a model

### PHRASE COORDINATES
frase=tfidf[new_vec]
print(frase)

corpus_tfidf = tfidf[corpus]
for doc in corpus_tfidf:
    print(doc)

lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)
corpus_lsi = lsi[corpus_tfidf]

lsi.print_topics(2)

## TEXT COORDINATES
todas=[]
for doc in corpus_lsi:
    todas.append(doc)
todas

from gensim import corpora, models, similarities
dictionary = corpora.Dictionary.load('/tmp/deerwester.dict')
corpus = corpora.MmCorpus('/tmp/deerwester.mm')
print(corpus)

lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)

doc = new_doc
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]
print(vec_lsi)

p=[]
for i in range(0,len(documents)):
    doc1 = documents[i]
    vec_bow2 = dictionary.doc2bow(doc1.lower().split())
    vec_lsi2 = lsi[vec_bow2]
    p.append(vec_lsi2)

p

index = similarities.MatrixSimilarity(lsi[corpus])

index.save('/tmp/deerwester.index')
index = similarities.MatrixSimilarity.load('/tmp/deerwester.index')

sims = index[vec_lsi]
print(list(enumerate(sims)))

sims = sorted(enumerate(sims), key=lambda item: -item[1])
print(sims) 

#################

import gensim
import numpy as np
import matplotlib.colors as colors
import matplotlib.cm as cmx
import matplotlib as mpl

matrix1 = gensim.matutils.corpus2dense(p, num_terms=2)
matrix3=matrix1.T
matrix3[0]
ss=[]
for i in range(0,9):
    ss.append(np.insert(matrix3[i],0,[0,0]))
matrix4=ss
matrix4

matrix2 = gensim.matutils.corpus2dense([vec_lsi], num_terms=2)
matrix2=np.insert(matrix2,0,[0,0])
matrix2

DATA=np.insert(matrix4,0,matrix2)
DATA=DATA.reshape(10,4)
DATA

names=np.array(documents)
names=np.insert(names,0,new_doc)
new_doc
cmap = plt.cm.jet

cNorm  = colors.Normalize(vmin=np.min(DATA[:,3])+.2, vmax=np.max(DATA[:,3]))

scalarMap = cmx.ScalarMappable(norm=cNorm,cmap=cmap)
len(DATA[:,1])

plt.subplots()
plt.figure(figsize=(12,9))
plt.scatter(matrix1[0],matrix1[1],s=60)
plt.scatter(matrix2[2],matrix2[3],color='r',s=95)
for idx in range(0,len(DATA[:,1])):
    colorVal = scalarMap.to_rgba(DATA[idx,3])
    plt.arrow(DATA[idx,0],
          DATA[idx,1], 
          DATA[idx,2], 
          DATA[idx,3], 
          color=colorVal,head_width=0.002, head_length=0.001)
for i,names in enumerate (names):
    plt.annotate(names, (DATA[i][2],DATA[i][3]),va='top')
plt.title("PHRASE SIMILARITY - WORD2VEC with GENSIM library")
plt.xlim(min(DATA[:,2]-.2),max(DATA[:,2]+1))
plt.ylim(min(DATA[:,3]-.2),max(DATA[:,3]+.3))
plt.show()

我的问题是:Gensim 是否有任何额外的设置来生成正确的葡萄牙语词嵌入，或者 Gensim 不支持这种语言？

最佳答案

一年零十个月后，我自己得到了回复:在 PyTorch 中使用 BERT embeddings:

短语:

我在 https://github.com/ethanjperez/pytorch-pretrained-BERT/blob/master/examples/extract_features.py 处改编了 PyTorch extract_features.py

class Main:
    def main(self,input_file,output_file):
        self.input_file=input_file
        self.output_file=output_file
        self.bert_model='bert-base-multilingual-uncased'
        self.do_lower_case=True
        self.layers="-1"
        self.max_seq_length=128
        self.batch_size=32
        self.local_rank=-1
        self.no_cuda=False

        if self.local_rank == -1 or self.no_cuda:
            device = torch.device("cuda" if torch.cuda.is_available() and not self.no_cuda else "cpu")
            n_gpu = torch.cuda.device_count()
        else:
            device = torch.device("cuda", self.local_rank)
            n_gpu = 1
            # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
            torch.distributed.init_process_group(backend='nccl')
        logger.info("device: {} n_gpu: {} distributed training: {}".format(device, n_gpu, bool(self.local_rank != -1)))

        layer_indexes = [int(x) for x in self.layers.split(",")]

        tokenizer = BertTokenizer.from_pretrained(self.bert_model, do_lower_case=self.do_lower_case)

        examples = read_examples(self.input_file)

        features = convert_examples_to_features(
            examples=examples, seq_length=self.max_seq_length, tokenizer=tokenizer)

        unique_id_to_feature = {}
        for feature in features:
            unique_id_to_feature[feature.unique_id] = feature

        model = BertModel.from_pretrained(self.bert_model)
        model.to(device)

        if self.local_rank != -1:
            model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[self.local_rank],
                                                            output_device=self.local_rank)
        elif n_gpu > 1:
            model = torch.nn.DataParallel(model)

        all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
        all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
        all_example_index = torch.arange(all_input_ids.size(0), dtype=torch.long)

        eval_data = TensorDataset(all_input_ids, all_input_mask, all_example_index)
        if self.local_rank == -1:
            eval_sampler = SequentialSampler(eval_data)
        else:
            eval_sampler = DistributedSampler(eval_data)
        eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=self.batch_size)

        model.eval()
        with open(self.output_file, "w", encoding='utf-8') as writer:
            for input_ids, input_mask, example_indices in eval_dataloader:
                input_ids = input_ids.to(device)
                input_mask = input_mask.to(device)

                all_encoder_layers, _ = model(input_ids, token_type_ids=None, attention_mask=input_mask)
                all_encoder_layers = all_encoder_layers

                for b, example_index in enumerate(example_indices):
                    feature = features[example_index.item()]
                    unique_id = int(feature.unique_id)
                    # feature = unique_id_to_feature[unique_id]
                    output_json = collections.OrderedDict()
                    output_json["linex_index"] = unique_id
                    all_out_features = []
                    for (i, token) in enumerate(feature.tokens):
                        all_layers = []
                        for (j, layer_index) in enumerate(layer_indexes):
                            layer_output = all_encoder_layers[int(layer_index)].detach().cpu().numpy()
                            layer_output = layer_output[b]
                            layers = collections.OrderedDict()
                            layers["index"] = layer_index
                            print(layer_output.shape)
                            layers["values"] = [
                                round(x.item(), 6) for x in layer_output[i]
                            ]
                            all_layers.append(layers)
                        out_features = collections.OrderedDict()
                        out_features["token"] = token
                        out_features["layers"] = all_layers
                        all_out_features.append(out_features)
                    output_json["features"] = all_out_features
                    writer.write(json.dumps(output_json) + "\n")

然后运行:

embeddings=extrair.Main()
embeddings.main(input_file='gensim.csv',output_file='gensim.json')

解析 JSON 文件:

import json
from pprint import pprint
import numpy as np

data = [json.loads(line) for line in open('gensim.json', 'r')]

xx=[]
for parte in range(0,len(data)):
    xx.append(np.mean([data[parte]['features'][i]['layers'][0]['values'] for i in range(0,len(data[parte]['features']))],axis=0))

from scipy.spatial.distance import cosine as cos

for i in range(0,len(xx)):
    print(cos(xx[2],xx[i]))

获取输出:

关于python - 如何使用 Gensim 在葡萄牙语中生成词嵌入？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/48225845/

文章推荐： python - 在django模板中将str转换为int

文章推荐： php - 为 DELETE 操作生成预签名的 s3 Url

文章推荐： WebGL 计算着色器和 VBO/UBO

文章推荐： reactjs - 如何在 React 中禁用 props.children 的点击事件？

javascript - 使用 WebScriptEndpoint 使用 javascript 使用 WCF 服务
我在网上搜索但没有找到任何合适的文章解释如何使用 javascript 使用 WCF 服务，尤其是 WebScriptEndpoint。任何人都可以对此给出任何指导吗？谢谢最佳答案这是一篇关于
c - 没有结果!!使用 fork() 使用 dup2 使用 2 个管道运行 execlp()
我正在编写一个将运行 Linux 命令的 C 程序，例如: cat/etc/passwd | grep 列表 |剪切-c 1-5 我没有任何结果 *这里 parent 等待第一个 child (chi
python - 处理文件上传，使用 Pillow 调整大小，使用 SQLAlchemy 存储，使用 Flask 提供文件
所以我正在尝试处理文件上传，然后将该文件作为二进制文件存储到数据库中。在我存储它之后，我尝试在给定的 URL 上提供文件。我似乎找不到适合这里的方法。我需要使用数据库，因为我使用 Google 应用引
excel - 使用 IF 使用 VBA 在单元格中添加公式的问题
我正在尝试制作一个宏，将下面的公式添加到单元格中，然后将其拖到整个列中并在 H 列中复制相同的公式我想在 F 和 H 列中输入公式的数据 Range("F1").formula = "=IF(ISE
使用 OperatorPrecedenceParser 使用 FParsec 解析函数应用程序？
问题类似于this one ，但我想使用 OperatorPrecedenceParser 解析带有函数应用程序的表达式在 FParsec . 这是我的 AST: type Expression =
sql - 使用 sequelize 使用 where 查询编码计数
我想通过使用 sequelize 和 node.js 将这个查询更改为代码取决于在哪里 select COUNT(gender) as genderCount from customers where
bash - 使用 “let”分配Bash失败，使用 “/”
我正在使用GNU bash，版本5.0.3(1)-发行版(x86_64-pc-linux-gnu)，我想知道为什么简单的赋值语句会出现语法错误: #/bin/bash var1=/tmp
javascript - 使用 JavaScript 使用 FOR OF 数组循环时出现错误？
这里，为什么我的代码在 IE 中不起作用。我的代码适用于所有浏览器。没有问题。但是当我在 IE 上运行我的项目时，它发现错误。而且我的 jquery 类和 insertadjacentHTMl 也不
javascript - 使用 javascript 使用 for 属性更改表单标签内容
我正在尝试更改标签的innerHTML。我无权访问该表单，因此无法编辑 HTML。标签具有的唯一标识符是“for”属性。这是输入和标签的结构:
javascript - 使用 jquery 使用 .on() 将事件附加到页面上的动态插入按钮
我有一个页面，我可以在其中返回用户帖子，可以使用一些 jquery 代码对这些帖子进行即时评论，在发布新评论后，我在帖子下插入新评论以及删除按钮。问题是 Delete 按钮在新插入的元素上不起作用，
使用 awk 使用 sha1sum 进行散列
我有一个大约有 20 列的“管道分隔”文件。我只想使用 sha1sum 散列第一列，它是一个数字，如帐号，并按原样返回其余列。使用 awk 或 sed 执行此操作的最佳方法是什么？ Accounti
mysql - 使用 insert into 使用 mysql
我需要将以下内容插入到我的表中...我的用户表有五列 id、用户名、密码、名称、条目。 (我还没有提交任何东西到条目中，我稍后会使用 php 来做)但由于某种原因我不断收到这个错误:#1054 - U
jquery - 将输入字段值修剪为仅字母数字字符/使用 .使用 jQuery
所以我试图有一个输入字段，我可以在其中输入任何字符，但然后将输入的值小写，删除任何非字母数字字符，留下“。”而不是空格。例如，如果我输入: 地球的 70% 是水，-!*#$^^ & 30% 土地输
javascript - 使用 .innerHTML 使用 DOM
我正在尝试做一些我认为非常简单的事情，但出于某种原因我没有得到想要的结果？我是 javascript 的新手，但对 java 有经验，所以我相信我没有使用某种正确的规则。这是一个获取输入值、检查选择
php - 使用 angularjs 使用 where 子句从数据库获取数据
我想使用 angularjs 从 mysql 数据库加载数据。这就是应用程序的工作原理；用户登录，他们的用户名存储在 cookie 中。该用户名显示在主页上我想获取这个值并通过 angularjs
ios - 使用 UITableViewCell 使用 AutoLayout
我正在使用 autoLayout，我想在 UITableViewCell 上放置一个 UIlabel，它应该始终位于单元格的右侧和右侧的中心。这就是我想要实现的目标所以在这里你可以看到我正在谈论的
mysql - 使用 ElasticSearch 使用 or 和运算符搜索多个字段
我需要与 MySql 等效的 elasticsearch 查询。我的 sql 查询: SELECT DISTINCT t.product_id AS id FROM tbl_sup_price t
ios - 使用 Swift 使用 JSON
我正在实现代码以使用 JSON。 func setup() { if let flickrURL = NSURL(string: "https://api.flickr.com/
javascript - 使用 JavaScript 使用 for 循环声明变量
我尝试使用for循环声明变量，然后测试cols和rols是否相同。如果是，它将运行递归函数。但是，我在 javascript 中执行 do 时遇到问题。有人可以帮忙吗？现在，在比较 col.1 和
jquery - 使用 :after 使用 jquery 更改样式
我举了一个我正在处理的问题的简短示例。 HTML代码: 1 2 3 CSS 代码: .BB a:hover{ color: #000; } .BB > li:after {

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 如何使用 Gensim 在葡萄牙语中生成词嵌入？