gpt4 book ai didi

python - 字符串到单词元组

转载 作者:太空宇宙 更新时间:2023-11-04 10:43:42 24 4
gpt4 key购买 nike

我将单词定义为可能还包含撇号的字符序列(从 a 到 Z)。我希望将一个句子拆分成单词,并从单词中删除撇号。

我目前正在执行以下操作以从一段文本中获取单词。

import re
text = "Don't ' thread \r\n on \nme ''\n "
words_iter = re.finditer(r'(\w|\')+', text)
words = (word.group(0).lower() for word in words_iter)
for i in words:
print(i)

这给了我:

don't
'
thread
on
me
''

但我不想要的是:

dont
thread
on
me

如何更改我的代码以实现此目的?

请注意,我的输出中没有'

我也希望 words 成为一个生成器。

最佳答案

这看起来像是 Regex 的工作。

import re

text = "Don't ' thread \r\n on \nme ''\n "

# Define a function so as to make a generator
def get_words(text):

# Find each block, separated by spaces
for section in re.finditer("[^\s]+", text):

# Get the text from the selection, lowercase it
# (`.lower()` for Python 2 or if you hate people who use Unicode)
section = section.group().casefold()

# Filter so only letters are kept and yield
section = "".join(char for char in section if char.isalpha())
if section:
yield section

list(get_words(text))
#>>> ['dont', 'thread', 'on', 'me']

正则解释:

[^    # An "inverse set" of characters, matches anything that isn't in the set
\s # Any whitespace character
]+ # One or more times

所以这匹配任何非空白字符 block 。

关于python - 字符串到单词元组,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/18961548/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com