gpt4 book ai didi

python - 使用 spaCy 为 URL 定制标签和词法

转载 作者:太空宇宙 更新时间:2023-11-04 08:37:10 26 4
gpt4 key购买 nike

考虑句子

msg = 'I got this URL https://stackoverflow.com/questions/47637005/handmade-estimator-modifies-parameters-in-init/47637293?noredirect=1#comment82268544_47637293 freed'

接下来,我使用开箱即用的英语 spaCy 处理句子:

import spacy
nlp = spacy.load('en')
doc = nlp(msg)

让我们回顾一下输出:[(t, t.lemma_, t.pos_, t.tag_, t.dep_) for t in doc]:

[(I, '-PRON-', 'PRON', 'PRP', 'nsubj'),
(got, 'get', 'VERB', 'VBD', 'ROOT'),
(this, 'this', 'DET', 'DT', 'det'),
(URL, 'url', 'NOUN', 'NN', 'compound'),
(https://stackoverflow.com/questions/47637005/handmade-estimator-modifies-parameters-in-init/47637293?noredirect=1#comment82268544_47637293,
'https://stackoverflow.com/questions/47637005/handmade-estimator-modifies-parameters-in-init/47637293?noredirect=1#comment82268544_47637293',
'NOUN',
'NN',
'nsubj'),
(freed, 'free', 'VERB', 'VBN', 'ccomp')]

我想改进对 URL 片段的处理。特别是,我想:

  1. 将其 lemma 设置为 stackoverflow.com
  2. 标签设置为URL

如何使用 spaCy 来实现?我想使用正则表达式(如建议的 here )来确定字符串是否为 URL 并获取域。到目前为止,我没能找到方法。

编辑 我想我需要的是自定义组件。但是,似乎没有办法将基于正则表达式(或任何其他)的可调用项放置为 patterns。 .

最佳答案

URL 的自定义正则表达式

您可以使用自定义分词器指定 URL 正则表达式,例如来自 https://spacy.io/usage/linguistic-features#native-tokenizers

import regex as re
from spacy.tokenizer import Tokenizer

prefix_re = re.compile(r'''^[\[\("']''')
suffix_re = re.compile(r'''[\]\)"']$''')
infix_re = re.compile(r'''[-~]''')
simple_url_re = re.compile(r'''^https?://''')

def custom_tokenizer(nlp):
return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
suffix_search=suffix_re.search,
infix_finditer=infix_re.finditer,
token_match=simple_url_re.match)

nlp = spacy.load('en')
nlp.tokenizer = custom_tokenizer(nlp)

msg = 'I got this URL https://stackoverflow.com/questions/47637005/handmade-estimator-modifies-parameters-in-init/47637293?noredirect=1#comment82268544_47637293 freed'

for i, token in enumerate(nlp(msg)):
print(i, ':\t', token)

[输出]:

0 :  I
1 : got
2 : this
3 : URL
4 : https://stackoverflow.com/questions/47637005/handmade-estimator-modifies-parameters-in-init/47637293?noredirect=1#comment82268544_47637293
5 : freed

检查token是否为URL

您可以检查 token 是否类似于 URL,例如

for i, token in enumerate(nlp(msg)):
print(token.like_url, ':\t', token.lemma_)

[输出]:

False :  -PRON-
False : get
False : this
False : url
True : https://stackoverflow.com/questions/47637005/handmade-estimator-modifies-parameters-in-init/47637293?noredirect=1#comment82268544_47637293
False : free

如果 LIKE_URL 更改标签

doc = nlp(msg)

for i, token in enumerate(doc):
if token.like_url:
token.tag_ = 'URL'

print([token.tag_ for token in doc])

[输出]:

['PRP', 'VBD', 'DT', 'NN', 'URL', 'VBN']

用自定义的词条替换 URL 的词条

使用正则表达式 https://regex101.com/r/KfjQ1G/1 :

doc = nlp(msg)

for i, token in enumerate(doc):
if re.match(r'(?:http[s]:\/\/)stackoverflow.com.*', token.lemma_):
token.lemma_ = 'stackoverflow.com'

print([token.lemma_ for token in doc])

[输出]:

['-PRON-', 'get', 'this', 'url', 'stackoverflow.com', 'free']

关于python - 使用 spaCy 为 URL 定制标签和词法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48112057/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com