gpt4 book ai didi

python - 来自python中文本的n-grams

转载 作者:行者123 更新时间:2023-11-28 21:37:42 24 4
gpt4 key购买 nike

对我以前的post进行了更新,并做了一些更改:
说我有100条微博。
在这些推特中,我需要提取:1)食品名称,2)饮料名称。我还需要附上类型(饮料或食品)和一个id号(每个项目有一个唯一的id)为每个提取。
我已经有了名字、类型和ID号的词典:

lexicon = {
'dr pepper': {'type': 'drink', 'id': 'd_123'},
'coca cola': {'type': 'drink', 'id': 'd_234'},
'cola': {'type': 'drink', 'id': 'd_345'},
'banana': {'type': 'food', 'id': 'f_456'},
'banana split': {'type': 'food', 'id': 'f_567'},
'cream': {'type': 'food', 'id': 'f_678'},
'ice cream': {'type': 'food', 'id': 'f_789'}}

推特示例:
经过“tweetç1”的各种处理,我有以下句子:
sentences = [
'dr pepper is better than coca cola and suits banana split with ice cream',
'coca cola and banana is not a good combo']

我请求的输出(可以是列表以外的其他类型):
["tweet_id_1",
[[["dr pepper"], ["drink", "d_124"]],
[["coca cola"], ["drink", "d_234"]],
[["banana split"], ["food", "f_567"]],
[["ice cream"], ["food", "f_789"]]],

"tweet_id_1",,
[[["coca cola"], ["drink", "d_234"]],
[["banana"], ["food", "f_456"]]]]

重要的是,输出不应在ngrams中提取unigrams(n>1):
["tweet_id_1",
[[["dr pepper"], ["drink", "d_124"]],
[["coca cola"], ["drink", "d_234"]],
[["cola"], ["drink", "d_345"]],
[["banana split"], ["food", "f_567"]],
[["banana"], ["food", "f_456"]],
[["ice cream"], ["food", "f_789"]],
[["cream"], ["food", "f_678"]]],

"tweet_id_1",
[[["coca cola"], ["drink", "d_234"]],
[["cola"], ["drink", "d_345"]],
[["banana"], ["food", "f_456"]]]]

理想情况下,我希望能够在提取之前在各种nltk过滤器中运行我的语句,比如lemmatize()和pos_tag(),以获得如下输出。但是有了这个regexp解决方案,如果我这样做了,那么所有的单词都被拆分成unigram,或者它们将从字符串“coca cola”生成1个unigram和1个bigram,这将生成我不想要的输出(如上面的例子)。
理想输出(同样,输出类型不重要):
["tweet_id_1",
[[[("dr pepper", "NN")], ["drink", "d_124"]],
[[("coca cola", "NN")], ["drink", "d_234"]],
[[("banana split", "NN")], ["food", "f_567"]],
[[("ice cream", "NN")], ["food", "f_789"]]],

"tweet_id_1",
[[[("coca cola", "NN")], ["drink", "d_234"]],
[[("banana", "NN")], ["food", "f_456"]]]]

最佳答案

可能不是最有效的解决方案,但这肯定会让你开始-

sentences = [
'dr pepper is better than coca cola and suits banana split with ice cream',
'coca cola and banana is not a good combo']

lexicon = {
'dr pepper': {'type': 'drink', 'id': 'd_123'},
'coca cola': {'type': 'drink', 'id': 'd_234'},
'cola': {'type': 'drink', 'id': 'd_345'},
'banana': {'type': 'food', 'id': 'f_456'},
'banana split': {'type': 'food', 'id': 'f_567'},
'cream': {'type': 'food', 'id': 'f_678'},
'ice cream': {'type': 'food', 'id': 'f_789'}}

lexicon_list = list(lexicon.keys())
lexicon_list.sort(key = lambda s: len(s.split()), reverse=True)

chunks = []

for sentence in sentences:
for lex in lexicon_list:
if lex in sentence:
chunks.append({lex: list(lexicon[lex].values()) })
sentence = sentence.replace(lex, '')

print(chunks)

输出
[{'dr pepper': ['drink', 'd_123']}, {'coca cola': ['drink', 'd_234']}, {'banana split': ['food', 'f_567']}, {'ice cream': ['food', 'f_789']}, {'coca cola': ['drink', 'd_234']}, {'banana': ['food', 'f_456']}]

解释
lexicon_list = list(lexicon.keys())获取需要搜索的短语列表,并按长度对它们进行排序(以便首先找到较大的块)
输出是一个 dict列表,其中每个dict都有 list值。

关于python - 来自python中文本的n-grams,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49091931/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com