gpt4 book ai didi

python - 从文件中的 n 行 block 中提取项目,计算每个 block 的项目频率,Python

转载 作者:行者123 更新时间:2023-12-01 06:08:05 24 4
gpt4 key购买 nike

我有一个文本文件,其中包含 5 行制表符分隔行 block :

 1 \t DESCRIPTION \t SENTENCE \t ITEMS

1 \t DESCRIPTION \t SENTENCE \t ITEMS

1 \t DESCRIPTION \t SENTENCE \t ITEMS

1 \t DESCRIPTION \t SENTENCE \t ITEMS

1 \t DESCRIPTION \t SENTENCE \t ITEMS

2 \t DESCRIPTION \t SENTENCE \t ITEMS

2 \t DESCRIPTION \t SENTENCE \t ITEMS

2 \t DESCRIPTION \t SENTENCE \t ITEMS

2 \t DESCRIPTION \t SENTENCE \t ITEMS

2 \t DESCRIPTION \t SENTENCE \t ITEMS

等等

在每个 block 中,DESCRIPTION 和 SENTENCE 列是相同的。感兴趣的数据位于 ITEMS 列中,该列对于 block 中的每一行都不同,并且采用以下格式:

word1, word2, word3

...等等

对于每个5行 block ,我需要统计word1、word2等在ITEMS中的频率。例如,如果第一个 5 行 block 如下

 1 \t DESCRIPTION \t SENTENCE \t word1, word2, word3

1 \t DESCRIPTION \t SENTENCE \t word1, word2

1 \t DESCRIPTION \t SENTENCE \t word4

1 \t DESCRIPTION \t SENTENCE \t word1, word2, word3

1 \t DESCRIPTION \t SENTENCE \t word1, word2

那么这个 5 行 block 的正确输出将是

1, SENTENCE, (word1: 4, word2: 4, word3: 2, word4: 1)

即, block 编号,后面跟着句子,后面跟着单词的频率计数。

我有一些代码来提取五行 block ,并在提取后计算 block 中单词的频率,但我坚持隔离每个 block 、获取单词频率、继续下一个 block 的任务,等等

from itertools import groupby 

def GetFrequencies(file):
file_contents = open(file).readlines() #file as list
"""use zip to get the entire file as list of 5-line chunk tuples"""
five_line_increments = zip(*[iter(file_contents)]*5)
for chunk in five_line_increments: #for each 5-line chunk...
for sentence in chunk: #...and for each sentence in that chunk
words = sentence.split('\t')[3].split() #get the ITEMS column at index 3
words_no_comma = [x.strip(',') for x in words] #get rid of the commas
words_no_ws = [x.strip(' ')for x in words_no_comma] #get rid of the whitespace resulting from the removed commas


"""STUCK HERE The idea originally was to take the words lists for
each chunk and combine them to create a big list, 'collection,' and
feed this into the for-loop below."""





for key, group in groupby(collection): #collection is a big list containing all of the words in the ITEMS section of the chunk, e.g, ['word1', 'word2', word3', 'word1', 'word1', 'word2', etc.]
print key,len(list(group)),

最佳答案

使用Python 2.7

#!/usr/bin/env python

import collections

chunks={}

with open('input') as fd:
for line in fd:
line=line.split()
if not line:
continue
if chunks.has_key(line[0]):
for i in line[3:]:
chunks[line[0]].append(i.replace(',',''))
else:
chunks[line[0]]=[line[2]]

for k,v in chunks.iteritems():
counter=collections.Counter(v[1:])
print k, v[0], counter

输出:

1 SENTENCE Counter({'word1': 3, 'word2': 3, 'word4': 1, 'word3': 1})

关于python - 从文件中的 n 行 block 中提取项目,计算每个 block 的项目频率,Python,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/7162215/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com