gpt4 book ai didi

python - 完全标记化句子,包括标点符号、缩略语和带连字符的单词

转载 作者:行者123 更新时间:2023-12-04 03:38:56 25 4
gpt4 key购买 nike

我想把一句话完全标记化:“半衰期最长的元素是铀234”教授说。

我想要这个输出:

['"', 'The', 'element', 'with', 'the', 'longests', 'half-life', 'isn't', 'Uranium-234', '"', 'said', 'the', 'professor', '.']

这里所有的标点符号都是分开的,但是像“isn't”和“doesn't”这样的词是一个标记。带连字符的词也被视为一个标记,这就是我想要的。

目前我正在使用它来标记它:

p = re.compile(r"\w+(?:'\w+)?|[^\w\s]")
p.findall(s)

这给了我输出:

['"', 'The', 'element', 'with', 'the', 'longest', 'half', '-', 'life', 'isn't', 'Uranium', '-', '234', '"', 'said', 'the', 'professor', "."]

有了这个,我无法将带连字符的单词标记为一个标记。

最佳答案

使用 ['-] 字符类,你忘记了下划线:

\w+(?:['-]\w+)?|[^\w\s]|_

参见 proof .

解释

--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
['-] any character of: ''', '-'
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
)? end of grouping
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
[^\w\s] any character except: word characters (a-
z, A-Z, 0-9, _), whitespace (\n, \r, \t,
\f, and " ")
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
_ '_'

Python code :

import re
regex = r"\w+(?:['-]\w+)?|[^\w\s]|_"
test_str = "\"The element with the longest half-life is Uranium-234\" said the professor."
print(re.findall(regex, test_str))

结果:['"', 'The', 'element', 'with', 'the', 'longest', 'half-life', 'is', '铀 234'、'"'、'说'、'那个'、'教授'、'.']

关于python - 完全标记化句子,包括标点符号、缩略语和带连字符的单词,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66414047/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com