gpt4 book ai didi

python - 如何在不删除重复项的情况下找到最常用的单词?

转载 作者:行者123 更新时间:2023-12-04 03:30:27 24 4
gpt4 key购买 nike

我有一个如下所示的列表:

group = [
#Group 1 ('aaa bbbb' the most common words = two words)
['aaaa bbbb nnnn', #<-- row 1
'aaaa bbbb oooo', #<-- row 2
'aaaa bbbb pppp'], #<-- row 3

#Group 2 ('hello' the most common word = one word)
['hello Jack T.', #<-- row 1
'hello Ramona D.', #<-- row 2
'hello Robert G.'], #<-- row 3

#Group 3 ('yes! go go' the most common words = the whole string)
['yes! go go', #<-- row 1
'yes! go go', #<-- row 2
'yes! go go', #<-- row 3
'yes! go go'], #<-- row 4

#Group 4 (only one word = it's an invalid group)
['python'], #<-- row 1

#Group 5 (only one word = it's an invalid group)
['java'] #<-- row 1

]

我需要为每个组找到最常用的单词并将它们保存到新列表中:

像这样:

OUT : ['aaaa','hello','yes! go go']

但是第三组有重复的单词 -> 'go go' 我需要两个,所以真正的结果是:

OUT : ['aaaa','hello','yes! go']

这是工作代码

#Try to count words for each group
for groups in group:
#how many groups ?
nGroup = len(groups)
#join lists
words = " ".join(groups).split()

我得到:

WORDS ['aaaa', 'bbbb', 'nnnn', 'aaaa', 'bbbb', 'oooo', 'aaaa', 'bbbb', 'pppp']
WORDS ['hello', 'Jack', 'T.', 'hello', 'Ramona', 'D.', 'hello', 'Robert', 'G.']
WORDS ['java']
WORDS ['python']
WORDS ['yes!', 'go', 'go', 'yes!', 'go', 'go', 'yes!', 'go', 'go', 'yes!', 'go', 'go']
    #how many identical rows ?
rows = collections.Counter(words)
#what's the common words for each row ?
wCommon = rows.most_common()
#how often that's?
mCommon = rows.most_common(1)[0][1]
print (f"wCommon :{wCommon} rows :{rows} mCommon :{mCommon}")

我得到:

#Group 1
wCommon :[('aaaa', 3), ('bbbb', 3),
('nnnn', 1), ('oooo', 1),
('pppp', 1)]
rows :Counter({'aaaa': 3, 'bbbb': 3,
'nnnn': 1, 'oooo': 1,
'pppp': 1})
mCommon :3


#Group 2
wCommon :[('hello', 3), ('Jack', 1), ('T.', 1),
('Ramona', 1), ('D.', 1),
('Robert', 1), ('G.', 1)]
rows:Counter({'hello': 3, 'Jack': 1, 'T.': 1,
'Ramona': 1, 'D.': 1,
'Robert': 1, 'G.': 1})
mCommon:3


#Group 3
wCommon :[('java', 1)] rows:Counter({'java': 1}) mCommon:1
#Group 4
wCommon :[('python', 1)] rows:Counter({'python': 1}) mCommon:1
#Group 5
wCommon :[('go', 8), ('yes!', 4)] rows:Counter({'go': 8, 'yes!': 4})
mCommon:8

以下是原始列表,但它可以更改。我试图将它分成几组并计算每行的常用词......例如:

aaaa, hello , yes! go go

但有时会出现一个或多个常用词,例如'aaaa bbbb' 如何获取?或像“go”这样的重复,在这种情况下它不起作用

list_1 = [

 "aaaa bbbb nnnn",
"aaaa bbbb oooo",
"aaaa bbbb pppp",
"hello Ramona D.",
"hello Jack T.",
"hello Robert G.",
"yes! go go",
"yes! go go",
"yes! go go",
"yes! go go",
"python",
"java"

]

编辑:谢谢大家

最佳答案

您可以只检查多个词是否有相同的出现次数以及它们是否连续出现:

import collections

groups = [
#Group 1 ('aaa bbbb' the most common words = two words)
[
'aaaa bbbb nnnn', #<-- row 1
'aaaa bbbb oooo', #<-- row 2
'aaaa bbbb pppp'
],
# Group 2 (one word 'aaaa' or 'bbbb', lets take the first)
['aaaa nnnn bbbb', 'aaaa oooo bbbb', 'aaaa pppp bbbb'],
#Group 3 (two words 'oooo bbbb')
['aaa1 oooo bbbb', 'aaa2 oooo bbbb', 'aaa3 oooo bbbb'],

#Group 4 ('hello' the most common word = one word)
[
'hello Jack T.', #<-- row 1
'hello Ramona D.', #<-- row 2
'hello Robert G.'
], #<-- row 3

#Group 5 ('yes! go go' the most common words = the whole string)
[
'yes! go go', #<-- row 1
'yes! go go', #<-- row 2
'yes! go go', #<-- row 3
'yes! go go'
], #<-- row 4

#Group 6 (only one word = it's an invalid group)
['python'], #<-- row 1

#Group 7 (only one word = it's an invalid group)
['java'],
[
"yu yu hakusho co dell'altro mondo", "yu yu hakusho re dell'inferno jr",
'yu yu hakusho un amico per la pelle'
],
[
"yu yu yu hakusho co dell'altro mondo",
"yu yu hakusho re dell'inferno jr yu yu",
'yu yu yu hakusho un amico per la pelle'
]
]


def mostCommon(group):
# skip invalid
if len(group) < 2:
return

# all identical!
if len(set(group)) == 1:
return group[0]

words = " ".join(group).split()
c = collections.Counter(words)
_maxCounts = max(c.values())

# normalize maxCounts, in case maxCounts > length of group
_maxItems = []
for k, v in c.items():
if v >= len(group) or v >= _maxCounts:
_maxItems.extend([k] * divmod(v, len(group))[0])

# One word appears most often.
if len(_maxItems) == 1:
return _maxItems[0]

# Multiple words having same max. occurences, do the words appear consecutively ?
# Lookup reverse, starting with longest
_combinations = [_maxItems[:x] for x in range(1, len(_maxItems) + 1)]
print(_combinations)
for c in _combinations[::-1]:
if len(set([item
for item in group if ' '.join(c) in item])) == len(group):
return ' '.join(c)


for i, group in enumerate(groups):
result = mostCommon(group)
print(f"Group {i+1}: {result}")

输出:

Group 1: aaaa bbbb
Group 2: aaaa
Group 3: oooo bbbb
Group 4: hello
Group 5: yes! go go
Group 6: None
Group 7: None
Group 8: yu yu hakusho
Group 9: yu yu

关于python - 如何在不删除重复项的情况下找到最常用的单词?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66971715/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com