gpt4 book ai didi

python - 在 Python 中使用 nltk 模块拆分单词

转载 作者:太空宇宙 更新时间:2023-11-04 08:45:44 26 4
gpt4 key购买 nike

我正在尝试使用 nltk 模块找到一种在 Python 中拆分单词的方法。鉴于我拥有的原始数据是标记化单词列表,我不确定如何实现我的目标,例如

['usingvariousmolecularbiology', 'techniques', 'toproduce', 'genotypes', 'following', 'standardoperatingprocedures', '.', 'Operateandmaintainautomatedequipment', '.', 'Updatesampletrackingsystemsandprocess', 'documentation', 'toallowaccurate', 'monitoring', 'andrapid', 'progression', 'ofcasework']

如您所见,许多单词粘在一起(即“to”和“produce”粘在一个字符串“toproduce”中)。这是从 PDF 文件中抓取数据的神器,我想找到一种使用 python 中的 nltk 模块来拆分粘在一起的单词的方法(即将“toproduce”拆分为两个单词:“to”和“produce”;将“standardoperatingprocedures”拆分为三个词:“standard”、“operating”、“procedures”)。

感谢您的帮助!

最佳答案

我相信你会想在这种情况下使用分词,我不知道 NLTK 中有任何分词功能可以处理没有空格的英语句子。您可以改用 pyenchant。我仅作为示例提供以下代码。 (它适用于数量相对较短的字符串——比如你的示例列表中的字符串——但对于较长的字符串或更多的字符串来说效率非常低。)它需要修改,并且它不会成功地分割每个在任何情况下都是字符串。

import enchant  # pip install pyenchant
eng_dict = enchant.Dict("en_US")

def segment_str(chars, exclude=None):
"""
Segment a string of chars using the pyenchant vocabulary.
Keeps longest possible words that account for all characters,
and returns list of segmented words.

:param chars: (str) The character string to segment.
:param exclude: (set) A set of string to exclude from consideration.
(These have been found previously to lead to dead ends.)
If an excluded word occurs later in the string, this
function will fail.
"""
words = []

if not chars.isalpha(): # don't check punctuation etc.; needs more work
return [chars]

if not exclude:
exclude = set()

working_chars = chars
while working_chars:
# iterate through segments of the chars starting with the longest segment possible
for i in range(len(working_chars), 1, -1):
segment = working_chars[:i]
if eng_dict.check(segment) and segment not in exclude:
words.append(segment)
working_chars = working_chars[i:]
break
else: # no matching segments were found
if words:
exclude.add(words[-1])
return segment_str(chars, exclude=exclude)
# let the user know a word was missing from the dictionary,
# but keep the word
print('"{chars}" not in dictionary (so just keeping as one segment)!'
.format(chars=chars))
return [chars]
# return a list of words based on the segmentation
return words

如您所见,这种方法(大概)只会错误分割您的一个字符串:

>>> t = ['usingvariousmolecularbiology', 'techniques', 'toproduce', 'genotypes', 'following', 'standardoperatingprocedures', '.', 'Operateandmaintainautomatedequipment', '.', 'Updatesampletrackingsystemsandprocess', 'documentation', 'toallowaccurate', 'monitoring', 'andrapid', 'progression', 'ofcasework']
>>> [segment(chars) for chars in t]
"genotypes" not in dictionary (so just keeping as one segment)
[['using', 'various', 'molecular', 'biology'], ['techniques'], ['to', 'produce'], ['genotypes'], ['following'], ['standard', 'operating', 'procedures'], ['.'], ['Operate', 'and', 'maintain', 'automated', 'equipment'], ['.'], ['Updates', 'ample', 'tracking', 'systems', 'and', 'process'], ['documentation'], ['to', 'allow', 'accurate'], ['monitoring'], ['and', 'rapid'], ['progression'], ['of', 'casework']]

然后您可以使用 chain 来展平这个列表列表:

>>> from itertools import chain
>>> list(chain.from_iterable(segment_str(chars) for chars in t))
"genotypes" not in dictionary (so just keeping as one segment)!
['using', 'various', 'molecular', 'biology', 'techniques', 'to', 'produce', 'genotypes', 'following', 'standard', 'operating', 'procedures', '.', 'Operate', 'and', 'maintain', 'automated', 'equipment', '.', 'Updates', 'ample', 'tracking', 'systems', 'and', 'process', 'documentation', 'to', 'allow', 'accurate', 'monitoring', 'and', 'rapid', 'progression', 'of', 'casework']

关于python - 在 Python 中使用 nltk 模块拆分单词,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40826165/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com