gpt4 book ai didi

python - 如何重写简单的分词器以使用正则表达式?

转载 作者:行者123 更新时间:2023-12-01 05:56:57 25 4
gpt4 key购买 nike

这是最初编写的分词器的优化版本,并且运行得相当好。辅助标记器可以解析此函数的输出,以创建更具体的分类标记。

def tokenize(source):
return (token for token in (token.strip() for line
in source.replace('\r\n', '\n').replace('\r', '\n').split('\n')
for token in line.split('#', 1)[0].split(';')) if token)

我的问题是:如何简单地使用 re 模块编写此代码?以下是我无效的尝试。

def tokenize2(string):
search = re.compile(r'^(.+?)(?:;(.+?))*?(?:#.+)?$', re.MULTILINE)
for match in search.finditer(string):
for item in match.groups():
yield item

编辑:这是我正在从标记生成器中查找的输出类型。解析文本应该很容易。

>>> def tokenize(source):
return (token for token in (token.strip() for line
in source.replace('\r\n', '\n').replace('\r', '\n').split('\n')
for token in line.split('#', 1)[0].split(';')) if token)

>>> for token in tokenize('''\
a = 1 + 2; b = a - 3 # create zero in b
c = b * 4; d = 5 / c # trigger div error

e = (6 + 7) * 8
# try a boolean operation
f = 0 and 1 or 2
a; b; c; e; f'''):
print(repr(token))


'a = 1 + 2'
'b = a - 3 '
'c = b * 4'
'd = 5 / c '
'e = (6 + 7) * 8'
'f = 0 and 1 or 2'
'a'
'b'
'c'
'e'
'f'
>>>

最佳答案

我可能离这里很远-

>>> def tokenize(source):
... search = re.compile(r'^(.+?)(?:;(.+?))*?(?:#.+)?$', re.MULTILINE)
... return (token.strip() for line in source.split('\n') if search.match(line)
... for token in line.split('#', 1)[0].split(';') if token)
...
>>>
>>>
>>> for token in tokenize('''\
... a = 1 + 2; b = a - 3 # create zero in b
... c = b * 4; d = 5 / c # trigger div error
...
... e = (6 + 7) * 8
... # try a boolean operation
... f = 0 and 1 or 2
... a; b; c; e; f'''):
... print(repr(token))
...
'a = 1 + 2'
'b = a - 3'
'c = b * 4'
'd = 5 / c'
'e = (6 + 7) * 8'
'f = 0 and 1 or 2'
'a'
'b'
'c'
'e'
'f'
>>>

如果适用,我会将 re.compile 保留在 def 范围之外。

关于python - 如何重写简单的分词器以使用正则表达式?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/11993606/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com