gpt4 book ai didi

python - 找出句子中所有的小写单词

转载 作者:太空宇宙 更新时间:2023-11-04 07:14:48 30 4
gpt4 key购买 nike

我必须使用 Python 找出句子中的所有小写单词。我考虑过如下使用正则表达式:

import re
re.findall(r'\b[^A-Z()\s\d]+\b', 'A word, TWO words')

除了我有的情况外,它都有效,例如,Aword。我该如何解决?

一般来说,正则表达式应该匹配以下情况:

Aword --> output: word
A word --> output: word
A word word --> output [word, word]
A(word) AND A pers --> output [word, pers]
AwordWOrd --> output [word, rd]

最佳答案

您实际上不需要此任务的正则表达式,您可以使用str 方法。基于正则表达式的方法非常快,但可以使用 str.translate 更快地完成它.

这是我找到的最快的解决方案。我们创建一个翻译表(字典),将每个非小写 ASCII 字符映射到一个空格。然后我们使用 str.split将结果字符串拆分成一个列表; str.split() 拆分任何空格,并丢弃空格,只留下所需的单词。

# Create a translation table that maps all ASCII chars
# except lowercase letters to space
bad = bytes(set(range(128)) - set(ascii_lowercase.encode()))
table = dict.fromkeys(bad, ' ')

def find_lower(s):
""" Translate non-lowercase chars to space """
return s.translate(table).split()

下面是一些比较各种方法的测试代码,包括 Ajax1234 的正则表达式解决方案,以及 sopython 中的一些常规建议。聊天室,包括Kevinuser3483203 .

该代码的测试数据由包含datalen个词的字符串组成,datalen的范围从32到1024。每个词由8个随机字符组成;随机词生成器大多选择小写字母。

作为the timeit.Timer.repeat docs提到这些结果中的重要数字是最小(每个列表中的第一个),其他数字仅表示由于系统负载变化对结果的影响。

#! /usr/bin/env python3

""" Find all "words" of lowercase chars in a string

Speed tests, using the timeit module, of various approaches

See https://stackoverflow.com/q/51710087

Written by Ajax1234, PM 2Ring, Kevin, and user3483203
2018.08.07
"""

import re
from string import ascii_lowercase, printable
from timeit import Timer
from random import seed, choice

seed(17)

# A collection of chars with lots of lowercase
# letters to use for making random words
test_chars = 5 * ascii_lowercase + printable

def randword(n):
""" Make a random "word" of n chars."""
return ''.join([choice(test_chars) for _ in range(n)])

# Create a translation table that maps all ASCII chars
# except lowercase letters to space
bad = bytes(set(range(128)) - set(ascii_lowercase.encode()))
table = dict.fromkeys(bad, ' ')
def find_lower_pm2r(s, table=table):
""" Translate non-lowercase chars to space """
return s.translate(table).split()

def find_lower_pm2r_byte(s):
""" Convert to bytes & test the ASCII code to see if it's in range """
return bytes(b if 97 <= b <= 122 else 32 for b in s.encode()).decode().split()

def find_lower_ajax(s):
""" Use a regex """
return re.findall('[a-z]+', s)

def find_lower_kevin(s):
""" Use the str.islower method """
return "".join([c if c.islower() else " " for c in s]).split()

lwr = set(ascii_lowercase)
def find_lower_3483203(s, lwr=lwr):
""" Test using a set """
return ''.join([i if i in lwr else ' ' for i in s]).split()

functions = (
find_lower_ajax,
find_lower_pm2r,
find_lower_pm2r_byte,
find_lower_kevin,
find_lower_3483203,
)

def verify(data, verbose=False):
""" Check that all functions give the same results """
if verbose:
print('Verifying:', repr(data))
results = []
for func in functions:
result = func(data)
results.append(result)
if verbose:
print('{:20} : {}'.format(func.__name__, result))
head, *tail = results
return all(u == head for u in tail)

def time_test(loops, data):
""" Perform the timing tests """
timings = []
for func in functions:
t = Timer(lambda: func(data))
result = sorted(t.repeat(3, loops))
timings.append((result, func.__name__))
timings.sort()
for result, name in timings:
print('{:20} : {:.6f}, {:.6f}, {:.6f}'.format(name, *result))
print()

# Check that all functions perform correctly
datalen = 8
data = ' '.join([randword(8) for _ in range(datalen)])
print(verify(data, True), '\n')

# Time it!
loops = 1024
datalen = 32
for _ in range(6):
data = ' '.join([randword(8) for _ in range(datalen)])
print('loops', loops, 'len', datalen, verify(data, False))
time_test(loops, data)
loops //= 2
datalen *= 2

输出

Verifying: '3c/zpws% OO8Dtcgl u;Zdm{y. dx]JTyjb pj;+ ym\t O6d.Jbg8 f\tRxrbau z`rxnkI:'
find_lower_ajax : ['c', 'zpws', 'tcgl', 'u', 'dm', 'y', 'dx', 'yjb', 'pj', 'ym', 'd', 'bg', 'f', 'xrbau', 'z', 'rxnk']
find_lower_pm2r : ['c', 'zpws', 'tcgl', 'u', 'dm', 'y', 'dx', 'yjb', 'pj', 'ym', 'd', 'bg', 'f', 'xrbau', 'z', 'rxnk']
find_lower_pm2r_byte : ['c', 'zpws', 'tcgl', 'u', 'dm', 'y', 'dx', 'yjb', 'pj', 'ym', 'd', 'bg', 'f', 'xrbau', 'z', 'rxnk']
find_lower_kevin : ['c', 'zpws', 'tcgl', 'u', 'dm', 'y', 'dx', 'yjb', 'pj', 'ym', 'd', 'bg', 'f', 'xrbau', 'z', 'rxnk']
find_lower_3483203 : ['c', 'zpws', 'tcgl', 'u', 'dm', 'y', 'dx', 'yjb', 'pj', 'ym', 'd', 'bg', 'f', 'xrbau', 'z', 'rxnk']
True

loops 1024 len 32 True
find_lower_pm2r : 0.038420, 0.075005, 0.082880
find_lower_ajax : 0.065296, 0.083511, 0.117944
find_lower_3483203 : 0.136276, 0.139128, 0.139208
find_lower_kevin : 0.225619, 0.241822, 0.250794
find_lower_pm2r_byte : 0.249634, 0.257480, 0.268771

loops 512 len 64 True
find_lower_pm2r : 0.026582, 0.026888, 0.027445
find_lower_ajax : 0.059608, 0.061116, 0.074781
find_lower_3483203 : 0.129526, 0.130411, 0.163533
find_lower_kevin : 0.217885, 0.219185, 0.219834
find_lower_pm2r_byte : 0.237033, 0.237225, 0.237880

loops 256 len 128 True
find_lower_pm2r : 0.020133, 0.020144, 0.020194
find_lower_ajax : 0.059215, 0.060153, 0.076451
find_lower_3483203 : 0.125678, 0.125989, 0.127963
find_lower_kevin : 0.215228, 0.215832, 0.218419
find_lower_pm2r_byte : 0.234180, 0.237770, 0.240791

loops 128 len 256 True
find_lower_pm2r : 0.017107, 0.017151, 0.017376
find_lower_ajax : 0.061019, 0.062389, 0.074479
find_lower_3483203 : 0.123576, 0.123802, 0.126174
find_lower_kevin : 0.212917, 0.213197, 0.214432
find_lower_pm2r_byte : 0.231248, 0.232049, 0.233519

loops 64 len 512 True
find_lower_pm2r : 0.014723, 0.014752, 0.014787
find_lower_ajax : 0.054442, 0.055595, 0.068130
find_lower_3483203 : 0.121101, 0.121847, 0.122723
find_lower_kevin : 0.210416, 0.211491, 0.211810
find_lower_pm2r_byte : 0.232548, 0.232655, 0.234670

loops 32 len 1024 True
find_lower_pm2r : 0.013886, 0.014000, 0.014106
find_lower_ajax : 0.051643, 0.052614, 0.065182
find_lower_3483203 : 0.121135, 0.121708, 0.124333
find_lower_kevin : 0.210581, 0.212073, 0.212232
find_lower_pm2r_byte : 0.245451, 0.251015, 0.252851

结果是针对 Python 3.6.0 的,在我运行 Linux 的 Debian 衍生版本的古老单核 32 位 2GHz 机器上。 YMMV。


user3483203 添加了一些 Pandas and matplotlib codetimeit 结果生成图表。

Graph of timeit results

关于python - 找出句子中所有的小写单词,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51710087/

30 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com