gpt4 book ai didi

Python re.split() 将分隔符的一部分保留为第一个字符串的一部分,将其他部分保留为第二个字符串的一部分,等等

转载 作者:行者123 更新时间:2023-11-28 20:35:48 25 4
gpt4 key购买 nike

我有一种情况,我想将一大段文本拆分成句子。我有一段工作代码可以按照我的意愿拆分字符串,但是它删除了分隔符(我知道它会)。现在,我希望能够将这些定界符保留为输出字符串的一部分(适本地重新分配)。

我的例子是这样的:

import re

strings = ['UT Arlington 1st - Berthiaume reached on a fielding error by ss (0-0). O. Salinas fouled out to 1b (2-1 KBB). Q. Rohrbaugh flied out to cf (2-0 BB). B. Cox fouled out to lf (2-2 KBBKF)',
'Southeast Mo. State 1st - EZELL, T. lined out to 2b (2-2 FBBKFFF). HOLST, D. flied out to lf (0-2 FK). GAGAN, T. struck out swinging (1-2 BKKS).',
'UT Arlington 3rd - J. Minjarez hit by pitch (0-0); RJ Williams advanced to second. Berthiaume popped up to 1b (0-2 KF). O. Salinas flied out to cf to right center (2-1 KBB); RJ Williams advanced to third.']

for s in strings:
header = re.split(r'[ ][-][ ]', s)
print(header[0])
text = re.split(r'([a-z][.][ ][A-Z]|[)][.][ ][A-Z])', header[-1])
print(text)

当前输出:

UT Arlington 1st
['Berthiaume reached on a fielding error by ss (0-0', '). O', '. Salinas fouled out to 1b (2-1 KBB', '). Q', '. Rohrbaugh flied out to cf (2-0 BB', '). B', '. Cox fouled out to lf (2-2 KBBKF)']
Southeast Mo. State 1st
['EZELL, T. lined out to 2b (2-2 FBBKFFF', '). H', 'OLST, D. flied out to lf (0-2 FK', '). G', 'AGAN, T. struck out swinging (1-2 BKKS).']
UT Arlington 3rd
['J. Minjarez hit by pitch (0-0); RJ Williams advanced to secon', 'd. B', 'erthiaume popped up to 1b (0-2 KF', '). O', '. Salinas flied out to cf to right center (2-1 KBB); RJ Williams advanced to third.']

我想要的输出:

UT Arlington 1st
['Berthiaume reached on a fielding error by ss (0-0)', 'O. Salinas fouled out to 1b (2-1 KBB)', 'Q. Rohrbaugh flied out to cf (2-0 BB)', 'B. Cox fouled out to lf (2-2 KBBKF)']
Southeast Mo. State 1st
['EZELL, T. lined out to 2b (2-2 FBBKFFF)', 'HOLST, D. flied out to lf (0-2 FK)', 'GAGAN, T. struck out swinging (1-2 BKKS).']
UT Arlington 3rd
['J. Minjarez hit by pitch (0-0); RJ Williams advanced to second', 'Berthiaume popped up to 1b (0-2 KF)', 'O. Salinas flied out to cf to right center (2-1 KBB); RJ Williams advanced to third.']

最佳答案

除了使用正则表达式,您可能还想看看 nltk :

from nltk import sent_tokenize

strings = ['UT Arlington 1st - Berthiaume reached on a fielding error by ss (0-0). O. Salinas fouled out to 1b (2-1 KBB). Q. Rohrbaugh flied out to cf (2-0 BB). B. Cox fouled out to lf (2-2 KBBKF)',
'Southeast Mo. State 1st - EZELL, T. lined out to 2b (2-2 FBBKFFF). HOLST, D. flied out to lf (0-2 FK). GAGAN, T. struck out swinging (1-2 BKKS).',
'UT Arlington 3rd - J. Minjarez hit by pitch (0-0); RJ Williams advanced to second. Berthiaume popped up to 1b (0-2 KF). O. Salinas flied out to cf to right center (2-1 KBB); RJ Williams advanced to third.']

needle = " - "
for string in strings:
pos = string.find(needle)
header = string[:pos]
text = string[pos + len(needle):]
print(header)
print(sent_tokenize(text))

产生:

UT Arlington 1st
['Berthiaume reached on a fielding error by ss (0-0).', 'O. Salinas fouled out to 1b (2-1 KBB).', 'Q. Rohrbaugh flied out to cf (2-0 BB).', 'B. Cox fouled out to lf (2-2 KBBKF)']
Southeast Mo. State 1st
['EZELL, T. lined out to 2b (2-2 FBBKFFF).', 'HOLST, D. flied out to lf (0-2 FK).', 'GAGAN, T. struck out swinging (1-2 BKKS).']
UT Arlington 3rd
['J. Minjarez hit by pitch (0-0); RJ Williams advanced to second.', 'Berthiaume popped up to 1b (0-2 KF).', 'O. Salinas flied out to cf to right center (2-1 KBB); RJ Williams advanced to third.']

通过字符串函数 (.find()) 提取标题,然后通过 sent_tokenize() 分析句子。

关于Python re.split() 将分隔符的一部分保留为第一个字符串的一部分,将其他部分保留为第二个字符串的一部分,等等,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46226623/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com