- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
我使用 BiobertEmbedding python 模块 ( https://pypi.org/project/biobert-embedding/ ) 的 sentence_vector() 方法对句子进行矢量化。对于某些句子组我没有问题,但对于其他一些句子我有以下错误消息:
File"/home/nobunaga/.local/lib/python3.6/site-packages/biobert_embedding/embedding.py",line 133, in sentence_vectorencoded_layers = self.eval_fwdprop_biobert(tokenized_text) File "/home/nobunaga/.local/lib/python3.6/site-packages/biobert_embedding/embedding.py",line 82, in eval_fwdprop_biobertencoded_layers, _ = self.model(tokens_tensor, segments_tensors) File"/home/nobunaga/.local/lib/python3.6/site-packages/torch/nn/modules/module.py",line 547, in __call__result = self.forward(*input, **kwargs) File "/home/nobunaga/.local/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling.py",line 730, in forwardembedding_output = self.embeddings(input_ids, token_type_ids) File"/home/nobunaga/.local/lib/python3.6/site-packages/torch/nn/modules/module.py",line 547, in __call__result = self.forward(*input, **kwargs) File "/home/nobunaga/.local/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling.py",line 268, in forwardposition_embeddings = self.position_embeddings(position_ids) File"/home/nobunaga/.local/lib/python3.6/site-packages/torch/nn/modules/module.py",line 547, in __call__result = self.forward(*input, **kwargs) File "/home/nobunaga/.local/lib/python3.6/site-packages/torch/nn/modules/sparse.py",line 114, in forwardself.norm_type, self.scale_grad_by_freq, self.sparse) File "/home/nobunaga/.local/lib/python3.6/site-packages/torch/nn/functional.py",line 1467, in embeddingreturn torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: index out of range: Tried toaccess index 512 out of table with 511 rows. at/pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:237
我发现对于某些句子组,问题与 <tb>
等标签有关例如。但是对于其他人来说,即使删除了标签,错误信息仍然存在。
(不幸的是,出于保密原因,我不能分享代码)
您对可能出现的问题有什么想法吗?
提前致谢
编辑:你是对的 cronoik,举个例子会更好。
示例:
sentences = ["This is the first sentence.", "This is the second sentence.", "This is the third sentence."
biobert = BiobertEmbedding(model_path='./biobert_v1.1_pubmed_pytorch_model')
vectors = [biobert.sentence_vector(doc) for doc in sentences]
在我看来,这最后一行代码是导致错误消息的原因。
最佳答案
问题是 biobert-embedding 模块没有处理最大序列长度 512(标记而不是单词!)。这是相关的 source code 。查看下面的示例以强制执行您收到的错误:
from biobert_embedding.embedding import BiobertEmbedding
#sentence has 385 words
sentence = "The near-ubiquity of ASCII was a great help, but failed to address international and linguistic concerns. The dollar-sign was not so useful in England, and the accented characters used in Spanish, French, German, and many other languages were entirely unavailable in ASCII (not to mention characters used in Greek, Russian, and most Eastern languages). Many individuals, companies, and countries defined extra characters as needed—often reassigning control characters, or using value in the range from 128 to 255. Using values above 128 conflicts with using the 8th bit as a checksum, but the checksum usage gradually died out. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data The near-ubiquity of ASCII was a great help, but failed to address international and linguistic concerns. The dollar-sign was not so useful in England, and the accented characters used in Spanish, French, German, and many other languages were entirely unavailable in ASCII (not to mention characters used in Greek, Russian, and most Eastern languages). Many individuals, companies, and countries defined extra characters as needed—often reassigning control"
longersentence = sentence + ' some'
biobert = BiobertEmbedding()
print('sentence has {} tokens'.format(len(biobert.process_text(sentence))))
#works
biobert.sentence_vector(sentence)
print('longersentence has {} tokens'.format(len(biobert.process_text(longersentence))))
#didn't work
biobert.sentence_vector(longersentence)
输出:
sentence has 512 tokens
longersentence has 513 tokens
#your error message....
你应该做的是实现一个 sliding window approach 来处理这些文本:
import torch
from biobert_embedding.embedding import BiobertEmbedding
maxtokens = 512
startOffset = 0
docStride = 200
sentence = "The near-ubiquity of ASCII was a great help, but failed to address international and linguistic concerns. The dollar-sign was not so useful in England, and the accented characters used in Spanish, French, German, and many other languages were entirely unavailable in ASCII (not to mention characters used in Greek, Russian, and most Eastern languages). Many individuals, companies, and countries defined extra characters as needed—often reassigning control characters, or using value in the range from 128 to 255. Using values above 128 conflicts with using the 8th bit as a checksum, but the checksum usage gradually died out. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data The near-ubiquity of ASCII was a great help, but failed to address international and linguistic concerns. The dollar-sign was not so useful in England, and the accented characters used in Spanish, French, German, and many other languages were entirely unavailable in ASCII (not to mention characters used in Greek, Russian, and most Eastern languages). Many individuals, companies, and countries defined extra characters as needed—often reassigning control"
longersentence = sentence + ' some'
sentences = [sentence, longersentence, 'small test sentence']
vectors = []
biobert = BiobertEmbedding()
#https://github.com/Overfitter/biobert_embedding/blob/b114e3456de76085a6cf881ff2de48ce868e6f4b/biobert_embedding/embedding.py#L127
def sentence_vector(tokenized_text, biobert):
encoded_layers = biobert.eval_fwdprop_biobert(tokenized_text)
# `encoded_layers` has shape [12 x 1 x 22 x 768]
# `token_vecs` is a tensor with shape [22 x 768]
token_vecs = encoded_layers[11][0]
# Calculate the average of all 22 token vectors.
sentence_embedding = torch.mean(token_vecs, dim=0)
return sentence_embedding
for doc in sentences:
#tokenize your text
docTokens = biobert.process_text(doc)
while startOffset < len(docTokens):
print(startOffset)
length = min(len(docTokens) - startOffset, maxtokens)
#now we calculate the sentence_vector for the document slice
vectors.append(sentence_vector(
docTokens[startOffset:startOffset+length]
, biobert)
)
#stop when the whole document is processed (document has less than 512
#or the last document slice was processed)
if startOffset + length == len(docTokens):
break
startOffset += min(length, docStride)
startOffset = 0
P.S.:您删除 <tb>
的部分成功是可能的,因为删除 <tb>
将删除 4 个标记('<'、't'、'##b'、'>')。
关于python-3.x - torch 错误 "RuntimeError: index out of range: Tried to access index 512 out of table with 511 rows",我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62598130/
我想向 Torch 添加一个损失函数,用于计算预测值和目标值之间的编辑距离。 有没有一种简单的方法来实现这个想法? 还是我必须编写自己的具有向后和向前功能的类? 最佳答案 如果您的标准可以表示为现有模
我如何沿着 torch 中的列求和?我有一个 128*1024 的张量,我想通过对所有行求和得到一个 1*1024 的张量。 例如:一个: 1 2 3 4 5 6 我想要b 5 7 9 最佳答案 为此
阅读pytorch文档后,我仍然需要帮助来理解torch.mm、torch.matmul和torch.mul之间的区别.由于我不完全理解它们,我无法简明扼要地解释这一点。 B = torch.tens
minibatch = torch.Tensor(5, 2, 3,5) m = nn.View(-1):setNumInputDims(1) m:forward(minibatch) 给出一个大小
有两个 PyTorch 存储库: https://github.com/hughperkins/pytorch https://github.com/pytorch/pytorch 第一个显然需要 T
晚上好, 我刚刚安装了 PyTorch 0.4.0,我正在尝试执行第一个教程“什么是 PyTorch?” 我编写了一个 Tutorial.py 文件,我尝试使用 Visual Studio Code
我有一个浮点值列表(或一个 numpy 数组)。我想创建一个包含所有这些值的一维 torch 张量。我可以创建 torch 张量并运行循环来存储值。 但我想知道有没有什么办法,我可以使用列表或数组中的
这是我在将 convertinf DQN 转换为 Double DQN 来解决 cartpole 问题时遇到的问题。我快要弄清楚了。 tensor([0.1205, 0.1207, 0.1197, 0
鉴于: x_batch = torch.tensor([[-0.3, -0.7], [0.3, 0.7], [1.1, -0.7], [-1.1, 0.7]]) 然后申请 torch.sigmoid(
我正在学习一门类(class),该类(class)使用已弃用的 PyTorch 版本,该版本不会根据需要将 torch.int64 更改为 torch.LongTensor。当前引发错误的代码部分是:
我正在尝试从 this repo 运行代码.我通过将 main.py 中的第 39/40 行从更改为禁用了 cuda parser.add_argument('--type', default='to
从 0.4.0 版本开始,可以使用 torch.tensor 和 torch.Tensor 有什么区别?提供这两个非常相似且令人困惑的替代方案的原因是什么? 最佳答案 在 PyTorch 中,torc
用于强化学习的 OpenAI REINFORCE 和 actor-critic 示例具有以下代码: REINFORCE : policy_loss = torch.cat(policy_loss).s
我在装有 CentOS Linux 7.3.1611(核心)操作系统的计算机上使用 Python 3.5.1。 我正在尝试使用 PyTorch 并开始使用 this tutorial . 不幸的是,示
我正在尝试使用 torch.load 加载预训练模型。 我收到以下错误: ModuleNotFoundError: No module named 'utils' 我已通过从命令行打开它来检查我使用的
这篇文章与我之前的 How to define a Python Class which uses R code, but called from rTorch? 有关. 我在 R ( https:/
是否torch.manual_seed包括torch.cuda.manual_seed_all的操作? 如果是,我们可以使用 torch.manual_seed设置种子。否则我们应该调用这两个函数。
我们可以使用 torch.Tensor([1., 2.], device='cuda') 在 GPU 上分配张量.使用这种方式而不是torch.cuda.Tensor([1., 2.])有什么不同吗?
我正在尝试深入了解 PyTorch 张量内存模型的工作原理。 # input numpy array In [91]: arr = np.arange(10, dtype=float32).resha
我同时安装了 python38,37 和 anaconda,操作系统 - win10,x64。 我无法在 py38,37 中安装 torch - 但在 anaconda 中安装了它。 系统环境变量“路
我是一名优秀的程序员,十分优秀!