- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
我读过解释滑动窗口如何工作的帖子,但我找不到任何关于它是如何实际实现的信息。
据我了解,如果输入太长,可以使用滑动窗口来处理文本。
如果我错了,请纠正我。
假设我有一个文本 “2017 年 6 月 Kaggle 宣布它通过了 100 万注册用户” 。
给定一些 stride
和 max_len
,可以将输入分成具有重叠单词的块(不考虑填充)。
In June 2017 Kaggle announced that # chunk 1
announced that it passed 1 million # chunk 2
1 million registered users # chunk 3
如果我的问题是
“Kaggle 什么时候发布公告” 和
“有多少注册用户” 我可以使用
chunk 1
和
chunk 3
和
在模型中根本不使用
chunk 2
不确定我是否仍然应该使用 chunk 2
来训练模型[CLS]when did Kaggle make the announcement[SEP]In June 2017 Kaggle announced that[SEP]
和[CLS]how many registered users[SEP]1 million registered users[SEP]
[CLS]can pigs fly[SEP]In June 2017 Kaggle announced that[SEP]
[CLS]can pigs fly[SEP]announced that it passed 1 million[SEP]
[CLS]can pigs fly[SEP]1 million registered users[SEP]
squad_convert_example_to_features
( source code ) 来调查我上面遇到的问题,但它似乎不起作用,也没有任何文档。似乎来自 Huggingface 的 run_squad.py
使用 squad_convert_example_to_features
和 s
中的 example
。from transformers.data.processors.squad import SquadResult, SquadV1Processor, SquadV2Processor, squad_convert_example_to_features
from transformers import AutoTokenizer, AutoConfig, squad_convert_examples_to_features
FILE_DIR = "."
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
processor = SquadV2Processor()
examples = processor.get_train_examples(FILE_DIR)
features = squad_convert_example_to_features(
example=examples[0],
max_seq_length=384,
doc_stride=128,
max_query_length=64,
is_training=True,
)
我得到了错误。100%|██████████| 1/1 [00:00<00:00, 159.95it/s]
Traceback (most recent call last):
File "<input>", line 25, in <module>
sub_tokens = tokenizer.tokenize(token)
NameError: name 'tokenizer' is not defined
该错误表明没有 tokenizers
但它不允许我们传递 tokenizer
。虽然如果我在 Debug模式下的函数内添加标记器,它确实有效。那么我究竟如何使用 squad_convert_example_to_features
函数呢?
最佳答案
我认为您选择的示例存在问题。 squad_convert_examples_to_features 和 squad_convert_example_to_features 都实现了滑动窗口方法,因为 squad_convert_examples_to_features
只是 squad_convert_example_to_features
的并行化包装器。但是让我们看一下单个示例函数。首先,您需要调用 squad_convert_example_to_features_init 以使标记器全局化(这是在 squad_convert_examples_to_features
中为您自动完成的):
from transformers.data.processors.squad import SquadResult, SquadV1Processor, SquadV2Processor, squad_convert_examples_to_features, squad_convert_example_to_features_init
from transformers import AutoTokenizer, AutoConfig, squad_convert_examples_to_features
FILE_DIR = "."
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
squad_convert_example_to_features_init(tokenizer)
processor = SquadV2Processor()
examples = processor.get_train_examples(FILE_DIR)
features = squad_convert_example_to_features(
example=examples[0],
max_seq_length=384,
doc_stride=128,
max_query_length=64,
is_training=True,
)
print(len(features))
输出:
1
您可能会说这个函数没有使用滑动窗口方法,但这是错误的,因为您的示例不需要拆分:
print(len(examples[0].question_text.split()) + len(examples[0].doc_tokens))
输出:
115
这比您设置为 384 的 max_seq_length 小。现在让我们尝试一个不同的:
print(len(examples[129603].question_text.split()) + len(examples[129603].doc_tokens))
features = squad_convert_example_to_features(
example=examples[129603],
max_seq_length=384,
doc_stride=128,
max_query_length=64,
is_training=True,
)
print(len(features))
输出:
454
3
您现在可以与原始样本进行比较:
print('[CLS]' + examples[129603].question_text + '[SEP]' + ' '.join(examples[129603].doc_tokens) + '[SEP]')
for idx, f in enumerate(features):
print('Split {}'.format(idx))
print(' '.join(f.tokens))
输出:
[CLS]How often is hunting occurring in Delaware each year?[SEP]There is a very active tradition of hunting of small to medium-sized wild game in Trinidad and Tobago. Hunting is carried out with firearms, and aided by the use of hounds, with the illegal use of trap guns, trap cages and snare nets. With approximately 12,000 sport hunters applying for hunting licences in recent years (in a very small country of about the size of the state of Delaware at about 5128 square kilometers and 1.3 million inhabitants), there is some concern that the practice might not be sustainable. In addition there are at present no bag limits and the open season is comparatively very long (5 months - October to February inclusive). As such hunting pressure from legal hunters is very high. Added to that, there is a thriving and very lucrative black market for poached wild game (sold and enthusiastically purchased as expensive luxury delicacies) and the numbers of commercial poachers in operation is unknown but presumed to be fairly high. As a result, the populations of the five major mammalian game species (red-rumped agouti, lowland paca, nine-banded armadillo, collared peccary, and red brocket deer) are thought to be quite low (although scientifically conducted population studies are only just recently being conducted as of 2013). It appears that the red brocket deer population has been extirpated on Tobago as a result of over-hunting. Various herons, ducks, doves, the green iguana, the gold tegu, the spectacled caiman and the common opossum are also commonly hunted and poached. There is also some poaching of 'fully protected species', including red howler monkeys and capuchin monkeys, southern tamanduas, Brazilian porcupines, yellow-footed tortoises, Trinidad piping guans and even one of the national birds, the scarlet ibis. Legal hunters pay very small fees to obtain hunting licences and undergo no official basic conservation biology or hunting-ethics training. There is presumed to be relatively very little subsistence hunting in the country (with most hunting for either sport or commercial profit). The local wildlife management authority is under-staffed and under-funded, and as such very little in the way of enforcement is done to uphold existing wildlife management laws, with hunting occurring both in and out of season, and even in wildlife sanctuaries. There is some indication that the government is beginning to take the issue of wildlife management more seriously, with well drafted legislation being brought before Parliament in 2015. It remains to be seen if the drafted legislation will be fully adopted and financially supported by the current and future governments, and if the general populace will move towards a greater awareness of the importance of wildlife conservation and change the culture of wanton consumption to one of sustainable management.[SEP]
Split 0
[CLS] how often is hunting occurring in delaware each year ? [SEP] there is a very active tradition of hunting of small to medium - sized wild game in trinidad and tobago . hunting is carried out with firearms , and aided by the use of hounds , with the illegal use of trap guns , trap cages and s ##nare nets . with approximately 12 , 000 sport hunters applying for hunting licence ##s in recent years ( in a very small country of about the size of the state of delaware at about 512 ##8 square kilometers and 1 . 3 million inhabitants ) , there is some concern that the practice might not be sustainable . in addition there are at present no bag limits and the open season is comparatively very long ( 5 months - october to february inclusive ) . as such hunting pressure from legal hunters is very high . added to that , there is a thriving and very lucrative black market for po ##ache ##d wild game ( sold and enthusiastically purchased as expensive luxury del ##ica ##cies ) and the numbers of commercial po ##ache ##rs in operation is unknown but presumed to be fairly high . as a result , the populations of the five major mammalian game species ( red - rum ##ped ago ##uti , lowland pac ##a , nine - banded arm ##adi ##llo , collar ##ed pe ##cca ##ry , and red brock ##et deer ) are thought to be quite low ( although scientific ##ally conducted population studies are only just recently being conducted as of 2013 ) . it appears that the red brock ##et deer population has been ex ##ti ##rp ##ated on tobago as a result of over - hunting . various heron ##s , ducks , dove ##s , the green i ##gua ##na , the gold te ##gu , the spectacle ##d cai ##man and the common op ##oss ##um are also commonly hunted and po ##ache ##d . there is also some po ##achi ##ng of ' fully protected species ' , including red howl ##er monkeys and cap ##uchi ##n monkeys , southern tam ##and ##ua ##s , brazilian por ##cup ##ines , yellow - footed tor ##to ##ises , [SEP]
Split 1
[CLS] how often is hunting occurring in delaware each year ? [SEP] october to february inclusive ) . as such hunting pressure from legal hunters is very high . added to that , there is a thriving and very lucrative black market for po ##ache ##d wild game ( sold and enthusiastically purchased as expensive luxury del ##ica ##cies ) and the numbers of commercial po ##ache ##rs in operation is unknown but presumed to be fairly high . as a result , the populations of the five major mammalian game species ( red - rum ##ped ago ##uti , lowland pac ##a , nine - banded arm ##adi ##llo , collar ##ed pe ##cca ##ry , and red brock ##et deer ) are thought to be quite low ( although scientific ##ally conducted population studies are only just recently being conducted as of 2013 ) . it appears that the red brock ##et deer population has been ex ##ti ##rp ##ated on tobago as a result of over - hunting . various heron ##s , ducks , dove ##s , the green i ##gua ##na , the gold te ##gu , the spectacle ##d cai ##man and the common op ##oss ##um are also commonly hunted and po ##ache ##d . there is also some po ##achi ##ng of ' fully protected species ' , including red howl ##er monkeys and cap ##uchi ##n monkeys , southern tam ##and ##ua ##s , brazilian por ##cup ##ines , yellow - footed tor ##to ##ises , trinidad pip ##ing gu ##ans and even one of the national birds , the scarlet ib ##is . legal hunters pay very small fees to obtain hunting licence ##s and undergo no official basic conservation biology or hunting - ethics training . there is presumed to be relatively very little subsistence hunting in the country ( with most hunting for either sport or commercial profit ) . the local wildlife management authority is under - staffed and under - funded , and as such very little in the way of enforcement is done to uphold existing wildlife management laws , with hunting occurring both in and out of season , and even in wildlife san ##ct ##uaries . there is some indication that the government is beginning to [SEP]
Split 2
[CLS] how often is hunting occurring in delaware each year ? [SEP] being conducted as of 2013 ) . it appears that the red brock ##et deer population has been ex ##ti ##rp ##ated on tobago as a result of over - hunting . various heron ##s , ducks , dove ##s , the green i ##gua ##na , the gold te ##gu , the spectacle ##d cai ##man and the common op ##oss ##um are also commonly hunted and po ##ache ##d . there is also some po ##achi ##ng of ' fully protected species ' , including red howl ##er monkeys and cap ##uchi ##n monkeys , southern tam ##and ##ua ##s , brazilian por ##cup ##ines , yellow - footed tor ##to ##ises , trinidad pip ##ing gu ##ans and even one of the national birds , the scarlet ib ##is . legal hunters pay very small fees to obtain hunting licence ##s and undergo no official basic conservation biology or hunting - ethics training . there is presumed to be relatively very little subsistence hunting in the country ( with most hunting for either sport or commercial profit ) . the local wildlife management authority is under - staffed and under - funded , and as such very little in the way of enforcement is done to uphold existing wildlife management laws , with hunting occurring both in and out of season , and even in wildlife san ##ct ##uaries . there is some indication that the government is beginning to take the issue of wildlife management more seriously , with well drafted legislation being brought before parliament in 2015 . it remains to be seen if the drafted legislation will be fully adopted and financially supported by the current and future governments , and if the general populace will move towards a greater awareness of the importance of wildlife conservation and change the culture of want ##on consumption to one of sustainable management . [SEP]
If my questions were "when did Kaggle make the announcement" and "howmany registered users" I can use chunk 1 and chunk 3 and not use chunk2 at all in the model. Not quiet sure if I should still use chunk 2 totrain the model
关于nlp - BERT 中长文本的滑动窗口用于问答,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62978957/
我正在尝试从 BERT 模型中的隐藏状态中获取句子向量。看着拥抱脸 BertModel 说明 here ,其中说: from transformers import BertTokenizer, Be
我正在将 Huggingface BERT 用于 NLP 任务。我的文本包含被分成子词的公司名称。 tokenizer = BertTokenizerFast.from_pretrained('ber
对于 Transformer 模型,尤其是 BERT,以编程方式禁止模型以特殊标记作为预测结果是否有意义(并且在统计上是否正确)?在最初的实现中情况如何?在收敛过程中,模型必须学会不预测这些,但这种干
我有一个包含段落的数据集,我需要将其分为两类。这些段落通常有 3-5 句话长。其中绝大多数的长度不到 500 字。我想利用BERT来解决这个问题。 我想知道我应该如何使用 BERT 来生成这些段落的向
我想在特定域上微调 BERT。我在文本文件中有该域的文本。我如何使用这些来微调 BERT? 我在找 here目前。 我的主要目标是使用 BERT 获得句子嵌入。 最佳答案 这里要做出的重要区别是您是否
我想针对未标记数据的特定域微调 BERT,并让输出层检查它们之间的相似性。我该怎么做?我是否需要先微调分类器任务(或问题答案等)并获得嵌入?或者我可以只使用预先训练好的 Bert 模型而无需执行任务并
我遇到了这个page 1)我想在微调完成后获得句子级嵌入(由[CLS] token 给出的嵌入)。我怎么能做到? 2)我还注意到该页面上的代码需要花费大量时间才能返回测试数据的结果。这是为什么?当我训
我读过解释滑动窗口如何工作的帖子,但我找不到任何关于它是如何实际实现的信息。 据我了解,如果输入太长,可以使用滑动窗口来处理文本。 如果我错了,请纠正我。 假设我有一个文本 “2017 年 6 月 K
我正在尝试使用 BERT 微调模型(使用 transformers 库),但我对优化器和调度器有点不确定。 首先,我明白我应该使用 transformers.AdamW而不是 Pytorch 的版本。
我在 Tensorflow 中使用 BERT,有一个细节我不太明白。根据文档( https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1 ),合并
我正在阅读 BERT paper并且不清楚 transformer 的输入编码器和解码器。 对于学习掩码语言模型(Cloze 任务),论文称 15% 的标记是被掩码的,并且训练网络来预测被掩码的标记。
我想使用 Bert 训练一个21 类 文本分类模型。但是我的训练数据很少,所以下载了一个类似的数据集,其中包含 5 类 和 200 万个样本。t并使用 bert 提供的 uncased 预训练模型对下
我正在训练一个在 BERT 之上使用自定义层的分类模型。在此期间,该模型的训练性能随着纪元的增加而下降(在第一个纪元之后)..我不确定这里要修复什么 - 是模型还是数据? (对于数据来说,它是二进制标
我是初学者..我正在和伯特一起工作。但出于公司网络的安全考虑,以下代码并没有直接接收bert模型。 tokenizer = BertTokenizer.from_pretrained('bert-ba
如何卡住上述预训练模型中的最后两层(dropout 和分类器层)?这样当模型运行时,我将得到一个致密层作为输出。 最佳答案 我想指出 BertForSequenceClassification 的定义
我正在使用 Huggingface 进一步训练 BERT 模型。我使用两种方法保存模型:步骤 (1) 使用此代码保存整个模型:model.save_pretrained(save_location),
我收到以下错误: AssertionError:文本输入必须为 str(单个示例)、List[str](批处理或单个预标记示例)或 List[List[str]](预标记示例批处理)类型。,当我运行
我想构建一个多类分类模型,我将对话数据作为 BERT 模型的输入(使用 bert-base-uncased)。 QUERY: I want to ask a question. ANSWER: Sur
我很感兴趣如何从 BERT 模型中获得不同句子中词嵌入的相似性(实际上,这意味着词在不同场景中具有不同的含义)。 例如: sent1 = 'I like living in New York.' se
众所周知,BERT 模型的词嵌入能力可能优于 word2vec 和任何其他模型。 我想在 BERT 词嵌入上创建一个模型来生成同义词或相似词。就像我们在 Gensim Word2Vec 中所做的一样。
我是一名优秀的程序员,十分优秀!