["The", "sailor", "is", "sick"]-6ren">
gpt4 book ai didi

python - 从列表中删除非字母词 A-Z a-z 的正则表达式(异常(exception))

转载 作者:行者123 更新时间:2023-11-28 20:41:11 24 4
gpt4 key购买 nike

我正在尝试从包含非字母字符的字符串列表中删除单词,例如:

["The", "sailor", "is", "sick", "."] -> ["The", "sailor", "is", "sick"]

但我不能简单地随意删除包含非字母字符的单词,因为可能出现这样的情况:

["The", "U.S.", "is", "big", "."] -> ["The", "U.S.", "is", "big"] (acronym kept but period is removed)

我需要想出一个正则表达式或一些类似的方法来处理像这样的简单情况(所有类型的标点符号):

["And", ",", "there", "she", "is", "."] -> ["And", "there", "she", "is"]

我使用自然语言包装器类将句子转换为左侧的列表,但有时列表要复杂得多:

string:   "round up the "blonde bombshells' a all (well almost all)"
list: ["round", "up", "the", "''", "blonde", "bombshell", "\\",
"a", "all", "-lrb-", "well", "almost", "all", "-rrb-"]

如您所见,包装器转换或删除了一些字符,例如括号和撇号。我想摆脱所有这些无关的子字符串,使其看起来更干净:

list: ["round", "up", "the", "blonde", "bombshell",
"a", "all", "well", "almost", "all"]

我是 python 的新手,我的印象是正则表达式是我最好的方法,但不知道如何将第一个列表转换为清理后的第二个列表,希望能提供任何帮助!

最佳答案

这似乎符合您的描述:

cases=[
["The", "sailor", "is", "sick", "."],
["The", "U.S.", "is", "big", "."],
["round", "up", "the", "''", "blonde", "bombshell", "\\",
"a", "all", "-lrb-", "well", "almost", "all", "-rrb-"],
]

import re

for li in cases:
print '{}\n\t->{}'.format(li, [w for w in li if re.search(r'^[a-zA-Z]', w)])

打印:

['The', 'sailor', 'is', 'sick', '.']
->['The', 'sailor', 'is', 'sick']
['The', 'U.S.', 'is', 'big', '.']
->['The', 'U.S.', 'is', 'big']
['round', 'up', 'the', "''", 'blonde', 'bombshell', '\\', 'a', 'all', '-lrb-', 'well', 'almost', 'all', '-rrb-']
->['round', 'up', 'the', 'blonde', 'bombshell', 'a', 'all', 'well', 'almost', 'all']

如果正确的话,你完全可以不用正则表达式:

for li in cases:
print '{}\n\t->{}'.format(li, [w for w in li if w[0].isalpha()])

关于python - 从列表中删除非字母词 A-Z a-z 的正则表达式(异常(exception)),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34007676/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com