gpt4 book ai didi

python - 缩写词和带连字符的单词的打印

转载 作者:太空宇宙 更新时间:2023-11-04 11:11:10 25 4
gpt4 key购买 nike

我需要先识别句子中的所有缩写词和带连字符的单词。它们需要在被识别时打印出来。对于此标识,我的代码似乎无法正常运行。

import re

sentence_stream2=df1['Open End Text']
for sent in sentence_stream2:
abbs_ = re.findall(r'(?:[A-Z]\.)+', sent) #abbreviations
hypns_= re.findall(r'\w+(?:-\w+)*', sent) #hyphenated words

print("new sentence:")
print(sent)
print(abbs_)
print(hypns_)

我语料库中的一个句子是:使用云数据分析环境的 API 和事件驱动架构的 DevOps 自助服务 BI

这个的输出是:

new sentence:
DevOps with APIs & event-driven architecture using cloud Data Analytics environment Self-service BI
[]
['DevOps', 'with', 'APIs', 'event-driven', 'architecture', 'using', 'cloud', 'Data', 'Analytics', 'environment', 'Self-service', 'BI']

预期的输出是:

new sentence:
DevOps with APIs & event-driven architecture using cloud Data Analytics environment Self-service BI
['APIs','BI']
['event-driven','Self-service']

最佳答案

您的缩写 规则不匹配。你想找到超过 1 个连续大写字母的单词,你可以使用的规则是:

abbs_ = re.findall(r'(?:[A-Z]{2,}s?\.?)', sent) #abbreviations

这将匹配 API 和 BI。

t = "DevOps with APIs & event-driven architecture using cloud Data Analytics environment Self-service BI"

import re

abbs_ = re.findall(r'(?:[A-Z]\.)+', t) #abbreviations
cap_ = re.findall(r'(?:[A-Z]{2,}s?\.?)', t) #abbreviations
hypns_= re.findall(r'\w+-\w+', t) #hyphenated words fixed

print("new sentence:")
print(t)
print(abbs_)
print(cap_)
print(hypns_)

输出:

DevOps with APIs & event-driven architecture using cloud Data Analytics environment Self-service BI
[] # your abbreviation rule - does not find any capital letter followed by .
['APIs', 'BI'] # cap_ rule
['event-driven', 'Self-service'] # fixed hyphen rule

这很可能不会找到所有缩写,例如

t = "Prof. Dr. S. Quakernack"

因此您可能需要使用更多数据和 f.e. 对其进行调整。 http://www.regex101.com

关于python - 缩写词和带连字符的单词的打印,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58178639/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com