- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
我使用 BiobertEmbedding python 模块 ( https://pypi.org/project/biobert-embedding/ ) 的 sentence_vector() 方法对句子进行矢量化。对于某些句子组我没有问题,但对于其他一些句子我有以下错误消息:
File"/home/nobunaga/.local/lib/python3.6/site-packages/biobert_embedding/embedding.py",line 133, in sentence_vectorencoded_layers = self.eval_fwdprop_biobert(tokenized_text) File "/home/nobunaga/.local/lib/python3.6/site-packages/biobert_embedding/embedding.py",line 82, in eval_fwdprop_biobertencoded_layers, _ = self.model(tokens_tensor, segments_tensors) File"/home/nobunaga/.local/lib/python3.6/site-packages/torch/nn/modules/module.py",line 547, in __call__result = self.forward(*input, **kwargs) File "/home/nobunaga/.local/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling.py",line 730, in forwardembedding_output = self.embeddings(input_ids, token_type_ids) File"/home/nobunaga/.local/lib/python3.6/site-packages/torch/nn/modules/module.py",line 547, in __call__result = self.forward(*input, **kwargs) File "/home/nobunaga/.local/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling.py",line 268, in forwardposition_embeddings = self.position_embeddings(position_ids) File"/home/nobunaga/.local/lib/python3.6/site-packages/torch/nn/modules/module.py",line 547, in __call__result = self.forward(*input, **kwargs) File "/home/nobunaga/.local/lib/python3.6/site-packages/torch/nn/modules/sparse.py",line 114, in forwardself.norm_type, self.scale_grad_by_freq, self.sparse) File "/home/nobunaga/.local/lib/python3.6/site-packages/torch/nn/functional.py",line 1467, in embeddingreturn torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: index out of range: Tried toaccess index 512 out of table with 511 rows. at/pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:237
我发现对于某些句子组,问题与 <tb>
等标签有关例如。但是对于其他人来说,即使删除了标签,错误信息仍然存在。
(不幸的是,出于保密原因,我不能分享代码)
您对可能出现的问题有什么想法吗?
提前致谢
编辑:你是对的 cronoik,举个例子会更好。
示例:
sentences = ["This is the first sentence.", "This is the second sentence.", "This is the third sentence."
biobert = BiobertEmbedding(model_path='./biobert_v1.1_pubmed_pytorch_model')
vectors = [biobert.sentence_vector(doc) for doc in sentences]
在我看来,这最后一行代码是导致错误消息的原因。
最佳答案
问题是 biobert-embedding 模块没有处理最大序列长度 512(标记而不是单词!)。这是相关的 source code 。查看下面的示例以强制执行您收到的错误:
from biobert_embedding.embedding import BiobertEmbedding
#sentence has 385 words
sentence = "The near-ubiquity of ASCII was a great help, but failed to address international and linguistic concerns. The dollar-sign was not so useful in England, and the accented characters used in Spanish, French, German, and many other languages were entirely unavailable in ASCII (not to mention characters used in Greek, Russian, and most Eastern languages). Many individuals, companies, and countries defined extra characters as needed—often reassigning control characters, or using value in the range from 128 to 255. Using values above 128 conflicts with using the 8th bit as a checksum, but the checksum usage gradually died out. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data The near-ubiquity of ASCII was a great help, but failed to address international and linguistic concerns. The dollar-sign was not so useful in England, and the accented characters used in Spanish, French, German, and many other languages were entirely unavailable in ASCII (not to mention characters used in Greek, Russian, and most Eastern languages). Many individuals, companies, and countries defined extra characters as needed—often reassigning control"
longersentence = sentence + ' some'
biobert = BiobertEmbedding()
print('sentence has {} tokens'.format(len(biobert.process_text(sentence))))
#works
biobert.sentence_vector(sentence)
print('longersentence has {} tokens'.format(len(biobert.process_text(longersentence))))
#didn't work
biobert.sentence_vector(longersentence)
输出:
sentence has 512 tokens
longersentence has 513 tokens
#your error message....
你应该做的是实现一个 sliding window approach 来处理这些文本:
import torch
from biobert_embedding.embedding import BiobertEmbedding
maxtokens = 512
startOffset = 0
docStride = 200
sentence = "The near-ubiquity of ASCII was a great help, but failed to address international and linguistic concerns. The dollar-sign was not so useful in England, and the accented characters used in Spanish, French, German, and many other languages were entirely unavailable in ASCII (not to mention characters used in Greek, Russian, and most Eastern languages). Many individuals, companies, and countries defined extra characters as needed—often reassigning control characters, or using value in the range from 128 to 255. Using values above 128 conflicts with using the 8th bit as a checksum, but the checksum usage gradually died out. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data The near-ubiquity of ASCII was a great help, but failed to address international and linguistic concerns. The dollar-sign was not so useful in England, and the accented characters used in Spanish, French, German, and many other languages were entirely unavailable in ASCII (not to mention characters used in Greek, Russian, and most Eastern languages). Many individuals, companies, and countries defined extra characters as needed—often reassigning control"
longersentence = sentence + ' some'
sentences = [sentence, longersentence, 'small test sentence']
vectors = []
biobert = BiobertEmbedding()
#https://github.com/Overfitter/biobert_embedding/blob/b114e3456de76085a6cf881ff2de48ce868e6f4b/biobert_embedding/embedding.py#L127
def sentence_vector(tokenized_text, biobert):
encoded_layers = biobert.eval_fwdprop_biobert(tokenized_text)
# `encoded_layers` has shape [12 x 1 x 22 x 768]
# `token_vecs` is a tensor with shape [22 x 768]
token_vecs = encoded_layers[11][0]
# Calculate the average of all 22 token vectors.
sentence_embedding = torch.mean(token_vecs, dim=0)
return sentence_embedding
for doc in sentences:
#tokenize your text
docTokens = biobert.process_text(doc)
while startOffset < len(docTokens):
print(startOffset)
length = min(len(docTokens) - startOffset, maxtokens)
#now we calculate the sentence_vector for the document slice
vectors.append(sentence_vector(
docTokens[startOffset:startOffset+length]
, biobert)
)
#stop when the whole document is processed (document has less than 512
#or the last document slice was processed)
if startOffset + length == len(docTokens):
break
startOffset += min(length, docStride)
startOffset = 0
P.S.:您删除 <tb>
的部分成功是可能的,因为删除 <tb>
将删除 4 个标记('<'、't'、'##b'、'>')。
关于python-3.x - torch 错误 "RuntimeError: index out of range: Tried to access index 512 out of table with 511 rows",我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62598130/
我创建了以下 sub 来简单地说明问题。我将事件工作表的范围 A2:E10 分配给范围变量。然后,对于另一个范围变量,我将这个范围的子范围,单元格 (1, 1) 分配给 (3, 3)。 我原以为这将包
我使用正则表达式来搜索以下属性返回的纯文本: namespace Microsoft.Office.Interop.Word { public class Range {
我正在开发一个宏来突出显示某些行/单元格以供进一步审查。一些值/空白将以红色突出显示,其他以橙色突出显示,而整行应为黄色。我从上一个问题中得到了一些帮助,并添加了更多细节,它工作得几乎完美,但我被困在
这个问题在这里已经有了答案: What is the difference between range and xrange functions in Python 2.X? (28 个答案) 关闭
我在尝试运行脚本时遇到这个奇怪的错误,代码似乎是正确的,但似乎 python (3) 不喜欢这部分: def function(x): if int
我正在编写一种算法,将一些数据写入提供的输出范围(问题的初始文本包括具体细节,这将评论中的讨论转向了错误的方向)。我希望它在 API 中尽可能接近标准库中的其他范围算法。 我查看了 std::rang
这按预期工作: #include #include int main() { auto chunklist = ranges::views::ints(1, 13) | ranges::vie
我这里有一个字符串,我正在尝试对其进行子字符串化。 let desc = "Hello world. Hello World." var stringRange = 1..' 的值转换为预期的参数类型
我有一个高级搜索功能,可以根据日期和时间查询记录。我想返回日期时间范围内的所有记录,然后从该范围内返回我想将结果缩小到一个小时范围(例如 2012 年 5 月 1 日 - 2012 年 5 月 7 日
Go 中的 range 函数和 range 关键字有什么区别? func main(){ s := []int{10, 20, 30, 40, 50, 60, 70, 80, 90}
如果我有一个范围,如何将其拆分为一系列连续的子范围,其中指定了子范围(存储桶)的数量?如果没有足够的元素,则应省略空桶。 例如: splitRange(1 to 6, 3) == Seq(Range(
我正在开发 VSTO Excel 项目,但在管理 Range 对象时遇到一些问题。 实际上,我需要知道当前选定的范围是否与我存储在列表中的另一个范围重叠。所以基本上,我有 2 个 Range 实例,我
在即将推出的 C++20 系列中,将有 range concept具有以下定义: template concept range = __RangeImpl; // exposition-only de
希望有人能回答我的问题。我在 VHDL 代码中遇到了这个命令,但不确定它到底做了什么。有人可以澄清以下内容吗? if ( element1 = (element1'range => '0')) the
可以将范围嵌套在范围中吗?使用范围内的变量?因为我想取得一些效果。为了说明这个问题,我有以下伪代码: for i in range(str(2**i) for i in range(1,2)):
我想在 2 个日期之间创建一个范围,并且我的范围字段有时间 damage_list = Damage.objects.filter(entry_date__range=(fdate, tdate))
在下面的代码中 #include #include #include int main() { std::unordered_mapm; m["1"]=1; m["2"]=2
我试图为我的电子表格做一个简单的循环,它循环遍历一个范围并检查该行是否为空,如果不是,则循环遍历一系列列并检查它们是否为空,如果是则它设置一个消息。 问题是每次它通过循环 ro.value 和 col
我在将一个工作簿范围中的值分配给当前工作簿中的某个范围时遇到问题。当我使用 Range("A1:C1") 分配我的范围时,此代码工作正常,但是当我使用 Range(Cells(1,1),Cells(1
我改写了原来的问题。 Sub s() Dim r As Range Set r = ActiveSheet.Range("B2:D5") Debug.Print r.Rows.Count
我是一名优秀的程序员,十分优秀!