gpt4 book ai didi

python - 根据给定的字典标记连接的字符

转载 作者:行者123 更新时间:2023-11-28 22:28:12 25 4
gpt4 key购买 nike

我想根据给定的字典对连接的字符进行标记,并给出和输出找到的标记化单词。例如,我有以下内容

dictionary = ['yak', 'kin', 'yakkin', 'khai', 'koo']
chars = 'yakkinpadthaikhaikoo'

输出应该如下所示:

[('yakkin', (0, 6), 6), ('padthai', (6, 13), 7), ('khai', (13, 17), 4), ('koo', (17, 20), 3)]

我想得到元组列表作为输出。元组中的第一个元素是在字典中找到的单词,第二个元素是字符偏移量,第三个元素是找到的单词的长度。如果找不到字符,我们会将它们组合成一个词,例如padthai 在上面的例子中。如果从字典中找到多个单词,我们将选择最长的一个(选择 yakkin 而不是 yakkin)。

我在下面有我当前的实现。它以 index if 0 开始,然后循环遍历字符(目前还行不通)。

import numpy as np

def tokenize(chars, dictionary):
n_chars = len(chars)
start = 0
char_found = []
words = []
for _ in range(int(n_chars/3)):
for r in range(1, n_chars + 1):
if chars[start:(start + r)] in dictionary:
char_found.append((chars[start:(start + r)], (start, start + r), len(chars[start:start+r])))
id_offset = np.argmax([t[1][1] for t in char_found])
start = char_found[id_offset][2]
if char_found[id_offset] not in words:
words.append(char_found[id_offset])
return words

tokenize(chars, dictionary) # give only [('yakkin', (0, 6), 6)]

我绞尽脑汁想解决这个问题。请随时发表评论/建议!

最佳答案

它看起来有点恶心,但它确实有效

def tokenize(string, dictionary):
# sorting dictionary words by length
# because we need to find longest word if its possible
# like "yakkin" instead of "yak"
sorted_dictionary = sorted(dictionary,
key=lambda word: len(word),
reverse=True)
start = 0
tokens = []
while start < len(string):
substring = string[start:]
try:
word = next(word
for word in sorted_dictionary
if substring.startswith(word))
offset = len(word)
except StopIteration:
# no words from dictionary were found
# at the beginning of substring,
# looking for next appearance of dictionary words
words_indexes = [substring.find(word)
for word in sorted_dictionary]
# if word is not found, "str.find" method returns -1
appeared_words_indexes = filter(lambda index: index > 0,
words_indexes)
try:
offset = min(appeared_words_indexes)
except ValueError:
# an empty sequence was passed to "min" function
# because there are no words from dictionary in substring
offset = len(substring)
word = substring[:offset]
token = word, (start, start + offset), offset
tokens.append(token)
start += offset
return tokens

给出输出

>>>tokenize('yakkinpadthaikhaikoo', dictionary)
[('yakkin', (0, 6), 6),
('padthai', (6, 13), 7),
('khai', (13, 17), 4),
('koo', (17, 20), 3)]
>>>tokenize('lolyakhaiyakkinpadthaikhaikoolol', dictionary)
[('lol', (0, 3), 3),
('yak', (3, 6), 3),
('hai', (6, 9), 3),
('yakkin', (9, 15), 6),
('padthai', (15, 22), 7),
('khai', (22, 26), 4),
('koo', (26, 29), 3),
('lol', (29, 32), 3)]

关于python - 根据给定的字典标记连接的字符,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43670873/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com