python-3.x - torch 错误 "RuntimeError: index out of range: Tried to access index 512 out of table with 511 rows"-6ren

python-3.x - torch 错误 "RuntimeError: index out of range: Tried to access index 512 out of table with 511 rows"

转载作者：行者123 更新时间：2023-12-05 02:52:28

我使用 BiobertEmbedding python 模块 ( https://pypi.org/project/biobert-embedding/ ) 的 sentence_vector() 方法对句子进行矢量化。对于某些句子组我没有问题，但对于其他一些句子我有以下错误消息:

File"/home/nobunaga/.local/lib/python3.6/site-packages/biobert_embedding/embedding.py",line 133, in sentence_vectorencoded_layers = self.eval_fwdprop_biobert(tokenized_text) File "/home/nobunaga/.local/lib/python3.6/site-packages/biobert_embedding/embedding.py",line 82, in eval_fwdprop_biobertencoded_layers, _ = self.model(tokens_tensor, segments_tensors) File"/home/nobunaga/.local/lib/python3.6/site-packages/torch/nn/modules/module.py",line 547, in __call__result = self.forward(*input, **kwargs) File "/home/nobunaga/.local/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling.py",line 730, in forwardembedding_output = self.embeddings(input_ids, token_type_ids) File"/home/nobunaga/.local/lib/python3.6/site-packages/torch/nn/modules/module.py",line 547, in __call__result = self.forward(*input, **kwargs) File "/home/nobunaga/.local/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling.py",line 268, in forwardposition_embeddings = self.position_embeddings(position_ids) File"/home/nobunaga/.local/lib/python3.6/site-packages/torch/nn/modules/module.py",line 547, in __call__result = self.forward(*input, **kwargs) File "/home/nobunaga/.local/lib/python3.6/site-packages/torch/nn/modules/sparse.py",line 114, in forwardself.norm_type, self.scale_grad_by_freq, self.sparse) File "/home/nobunaga/.local/lib/python3.6/site-packages/torch/nn/functional.py",line 1467, in embeddingreturn torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: index out of range: Tried toaccess index 512 out of table with 511 rows. at/pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:237

我发现对于某些句子组，问题与 <tb> 等标签有关例如。但是对于其他人来说，即使删除了标签，错误信息仍然存在。
(不幸的是，出于保密原因，我不能分享代码)

您对可能出现的问题有什么想法吗？

提前致谢

编辑:你是对的 cronoik，举个例子会更好。

示例:

sentences = ["This is the first sentence.", "This is the second sentence.", "This is the third sentence."

biobert = BiobertEmbedding(model_path='./biobert_v1.1_pubmed_pytorch_model')

vectors = [biobert.sentence_vector(doc) for doc in sentences]

在我看来，这最后一行代码是导致错误消息的原因。

最佳答案

问题是 biobert-embedding 模块没有处理最大序列长度 512(标记而不是单词!)。这是相关的 source code 。查看下面的示例以强制执行您收到的错误:

from biobert_embedding.embedding import BiobertEmbedding
#sentence has 385 words
sentence = "The near-ubiquity of ASCII was a great help, but failed to address international and linguistic concerns. The dollar-sign was not so useful in England, and the accented characters used in Spanish, French, German, and many other languages were entirely unavailable in ASCII (not to mention characters used in Greek, Russian, and most Eastern languages). Many individuals, companies, and countries defined extra characters as needed—often reassigning control characters, or using value in the range from 128 to 255. Using values above 128 conflicts with using the 8th bit as a checksum, but the checksum usage gradually died out. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data The near-ubiquity of ASCII was a great help, but failed to address international and linguistic concerns. The dollar-sign was not so useful in England, and the accented characters used in Spanish, French, German, and many other languages were entirely unavailable in ASCII (not to mention characters used in Greek, Russian, and most Eastern languages). Many individuals, companies, and countries defined extra characters as needed—often reassigning control"
longersentence = sentence + ' some'

biobert = BiobertEmbedding()
print('sentence has {} tokens'.format(len(biobert.process_text(sentence))))
#works
biobert.sentence_vector(sentence)
print('longersentence has {} tokens'.format(len(biobert.process_text(longersentence))))
#didn't work
biobert.sentence_vector(longersentence)

输出:

sentence has 512 tokens
longersentence has 513 tokens
#your error message....

你应该做的是实现一个 sliding window approach 来处理这些文本:

import torch
from biobert_embedding.embedding import BiobertEmbedding

maxtokens = 512
startOffset = 0
docStride = 200

sentence = "The near-ubiquity of ASCII was a great help, but failed to address international and linguistic concerns. The dollar-sign was not so useful in England, and the accented characters used in Spanish, French, German, and many other languages were entirely unavailable in ASCII (not to mention characters used in Greek, Russian, and most Eastern languages). Many individuals, companies, and countries defined extra characters as needed—often reassigning control characters, or using value in the range from 128 to 255. Using values above 128 conflicts with using the 8th bit as a checksum, but the checksum usage gradually died out. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data The near-ubiquity of ASCII was a great help, but failed to address international and linguistic concerns. The dollar-sign was not so useful in England, and the accented characters used in Spanish, French, German, and many other languages were entirely unavailable in ASCII (not to mention characters used in Greek, Russian, and most Eastern languages). Many individuals, companies, and countries defined extra characters as needed—often reassigning control"
longersentence = sentence + ' some'

sentences = [sentence, longersentence, 'small test sentence']
vectors = []
biobert = BiobertEmbedding()

#https://github.com/Overfitter/biobert_embedding/blob/b114e3456de76085a6cf881ff2de48ce868e6f4b/biobert_embedding/embedding.py#L127
def sentence_vector(tokenized_text, biobert):
    encoded_layers = biobert.eval_fwdprop_biobert(tokenized_text)

    # `encoded_layers` has shape [12 x 1 x 22 x 768]
    # `token_vecs` is a tensor with shape [22 x 768]
    token_vecs = encoded_layers[11][0]

    # Calculate the average of all 22 token vectors.
    sentence_embedding = torch.mean(token_vecs, dim=0)
    return sentence_embedding


for doc in sentences:
    #tokenize your text
    docTokens = biobert.process_text(doc)
    
    while startOffset < len(docTokens):
        print(startOffset)
        length = min(len(docTokens) - startOffset, maxtokens)

        #now we calculate the sentence_vector for the document slice
        vectors.append(sentence_vector(
                        docTokens[startOffset:startOffset+length]
                        , biobert)
                      )
        #stop when the whole document is processed (document has less than 512
        #or the last document slice was processed)
        if startOffset + length == len(docTokens):
            break
        startOffset += min(length, docStride)
    startOffset = 0

P.S.:您删除 <tb> 的部分成功是可能的，因为删除 <tb> 将删除 4 个标记('<'、't'、'##b'、'>')。

关于python-3.x - torch 错误 "RuntimeError: index out of range: Tried to access index 512 out of table with 511 rows"，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/62598130/

文章推荐： javascript - BST 的空间复杂度是多少？

Excel vba : possible bug when creating a range as a sub-range of another range
我创建了以下 sub 来简单地说明问题。我将事件工作表的范围 A2:E10 分配给范围变量。然后，对于另一个范围变量，我将这个范围的子范围，单元格 (1, 1) 分配给 (3, 3)。我原以为这将包
vba - 将 Range.Text 关联到 Range.Start 和 Range.End
我使用正则表达式来搜索以下属性返回的纯文本: namespace Microsoft.Office.Interop.Word { public class Range {
excel - If Not (range) or (range) = Nothing Then
我正在开发一个宏来突出显示某些行/单元格以供进一步审查。一些值/空白将以红色突出显示，其他以橙色突出显示，而整行应为黄色。我从上一个问题中得到了一些帮助，并添加了更多细节，它工作得几乎完美，但我被困在
python - "for/range"range 大的时候会不会很耗内存？
这个问题在这里已经有了答案: What is the difference between range and xrange functions in Python 2.X? (28 个答案) 关闭
Python - 不支持的类型 : range and range
我在尝试运行脚本时遇到这个奇怪的错误，代码似乎是正确的，但似乎 python (3) 不喜欢这部分: def function(x): if int
c++ - 为什么从采用 std::ranges::output_range 的算法返回 std::ranges::safe_iterator_t 而不是 std::ranges::safe_subrange_t
我正在编写一种算法，将一些数据写入提供的输出范围(问题的初始文本包括具体细节，这将评论中的讨论转向了错误的方向)。我希望它在 API 中尽可能接近标准库中的其他范围算法。我查看了 std::rang
c++ - 在 range v3 库中，为什么 ranges::copy 不适用于 ranges::views::chunk 的输出？
这按预期工作: #include #include int main() { auto chunklist = ranges::views::ints(1, 13) | ranges::vie
string - 无法将类型 'Range' 的值转换为预期的参数类型 'Range'(又名 'Range')
我这里有一个字符串，我正在尝试对其进行子字符串化。 let desc = "Hello world. Hello World." var stringRange = 1..' 的值转换为预期的参数类型
.net - MySQL时间偏移(Range from within a Range)
我有一个高级搜索功能，可以根据日期和时间查询记录。我想返回日期时间范围内的所有记录，然后从该范围内返回我想将结果缩小到一个小时范围(例如 2012 年 5 月 1 日 - 2012 年 5 月 7 日
function - range 函数和 range 关键字有什么区别？
Go 中的 range 函数和 range 关键字有什么区别？ func main(){ s := []int{10, 20, 30, 40, 50, 60, 70, 80, 90}
scala - 将 Scala Range 拆分为大小均匀的连续子 Range
如果我有一个范围，如何将其拆分为一系列连续的子范围，其中指定了子范围(存储桶)的数量？如果没有足够的元素，则应省略空桶。例如: splitRange(1 to 6, 3) == Seq(Range(
Excel开发: How to detect that a Range overlaps another Range?
我正在开发 VSTO Excel 项目，但在管理 Range 对象时遇到一些问题。实际上，我需要知道当前选定的范围是否与我存储在列表中的另一个范围重叠。所以基本上，我有 2 个 Range 实例，我
c++ - 满足 std::ranges::range 概念
在即将推出的 C++20 系列中，将有 range concept具有以下定义: template concept range = __RangeImpl; // exposition-only de
range - VHDL 'range => ' 0' 命令
希望有人能回答我的问题。我在 VHDL 代码中遇到了这个命令，但不确定它到底做了什么。有人可以澄清以下内容吗？ if ( element1 = (element1'range => '0')) the
python - Python 中 range() 中的嵌套 range()
可以将范围嵌套在范围中吗？使用范围内的变量？因为我想取得一些效果。为了说明这个问题，我有以下伪代码: for i in range(str(2**i) for i in range(1,2)):
python : range between dates when range field has time
我想在 2 个日期之间创建一个范围，并且我的范围字段有时间 damage_list = Damage.objects.filter(entry_date__range=(fdate, tdate))
c++ - 使用基于自动的 Ranged for 循环与使用基于对的 Ranged for 循环
在下面的代码中 #include #include #include int main() { std::unordered_mapm; m["1"]=1; m["2"]=2
excel - 循环不更新 range.row 或 range.column
我试图为我的电子表格做一个简单的循环，它循环遍历一个范围并检查该行是否为空，如果不是，则循环遍历一系列列并检查它们是否为空，如果是则它设置一个消息。问题是每次它通过循环 ro.value 和 col
VBA Excel : Assigning range values to a new range
我在将一个工作簿范围中的值分配给当前工作簿中的某个范围时遇到问题。当我使用 Range("A1:C1") 分配我的范围时，此代码工作正常，但是当我使用 Range(Cells(1,1),Cells(1
vba - Range.Cells.Count 与 Range.Count
我改写了原来的问题。 Sub s() Dim r As Range Set r = ActiveSheet.Range("B2:D5") Debug.Print r.Rows.Count

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python-3.x - torch 错误 "RuntimeError: index out of range: Tried to access index 512 out of table with 511 rows"