gpt4 book ai didi

python - 在 Python 中通过谓词对可迭代对象进行分组

转载 作者:太空狗 更新时间:2023-10-29 20:18:58 26 4
gpt4 key购买 nike

我正在解析这样一个文件:

--header--data1data2--header--data3data4data5--header----header--...

And I want groups like this:

[ [header, data1, data2], [header, data3, data4, data5], [header], [header], ... ]

所以我可以像这样遍历它们:

for grp in group(open('file.txt'), lambda line: 'header' in line):
for item in grp:
process(item)

并使检测组逻辑与处理组逻辑分开。

但我需要一个可迭代对象的可迭代对象,因为这些组可以任意大并且我不想存储它们。也就是说,每次遇到谓词指示的“哨兵”或“ header ”项时,我都想将可迭代对象拆分为子组。这似乎是一项常见的任务,但我找不到有效的 Pythonic 实现。

这是愚蠢的追加到列表的实现:

def group(iterable, isstart=lambda x: x):
"""Group `iterable` into groups starting with items where `isstart(item)` is true.

Start items are included in the group. The first group may or may not have a
start item. An empty `iterable` results in an empty result (zero groups)."""
items = []
for item in iterable:
if isstart(item) and items:
yield iter(items)
items = []
items.append(item)
if items:
yield iter(items)

感觉必须有一个不错的 itertools 版本,但它让我望而却步。 “明显”(?!)groupby 解决方案似乎不起作用,因为可能有相邻的 header ,它们需要放在不同的组中。我能想到的最好办法是(ab)使用 groupby 和一个保持计数器的关键函数:

def igroup(iterable, isstart=lambda x: x):
def keyfunc(item):
if isstart(item):
keyfunc.groupnum += 1 # Python 2's closures leave something to be desired
return keyfunc.groupnum
keyfunc.groupnum = 0
return (group for _, group in itertools.groupby(iterable, keyfunc))

但我觉得 Python 可以做得更好——遗憾的是,它比哑列表版本还要慢:

# ipython%time deque(group(xrange(10 ** 7), lambda x: x % 1000 == 0), maxlen=0)CPU times: user 4.20 s, sys: 0.03 s, total: 4.23 s%time deque(igroup(xrange(10 ** 7), lambda x: x % 1000 == 0), maxlen=0)CPU times: user 5.45 s, sys: 0.01 s, total: 5.46 s

To make it easy on you, here's some unit test code:

class Test(unittest.TestCase):
def test_group(self):
MAXINT, MAXLEN, NUMTRIALS = 100, 100000, 21
isstart = lambda x: x == 0
self.assertEqual(next(igroup([], isstart), None), None)
self.assertEqual([list(grp) for grp in igroup([0] * 3, isstart)], [[0]] * 3)
self.assertEqual([list(grp) for grp in igroup([1] * 3, isstart)], [[1] * 3])
self.assertEqual(len(list(igroup([0,1,2] * 3, isstart))), 3) # Catch hangs when groups are not consumed
for _ in xrange(NUMTRIALS):
expected, items = itertools.tee(itertools.starmap(random.randint, itertools.repeat((0, MAXINT), random.randint(0, MAXLEN))))
for grpnum, grp in enumerate(igroup(items, isstart)):
start = next(grp)
self.assertTrue(isstart(start) or grpnum == 0)
self.assertEqual(start, next(expected))
for item in grp:
self.assertFalse(isstart(item))
self.assertEqual(item, next(expected))

那么:如何在 Python 中通过谓词优雅高效地对可迭代对象进行子分组?

最佳答案

how can I subgroup an iterable by a predicate elegantly and efficiently in Python?

这是一个简洁、内存高效的实现,与您的问题非常相似:

from itertools import groupby, imap
from operator import itemgetter

def igroup(iterable, isstart):
def key(item, count=[False]):
if isstart(item):
count[0] = not count[0] # start new group
return count[0]
return imap(itemgetter(1), groupby(iterable, key))

它支持无限组。

tee-based 解决方案稍微快一些,但它会消耗当前组的内存(类似于问题中基于 list 的解决方案):

from itertools import islice, tee

def group(iterable, isstart):
it, it2 = tee(iterable)
count = 0
for item in it:
if isstart(item) and count:
gr = islice(it2, count)
yield gr
for _ in gr: # skip to the next group
pass
count = 0
count += 1
if count:
gr = islice(it2, count)
yield gr
for _ in gr: # skip to the next group
pass

groupby-解决方案可以用纯 Python 实现:

def igroup_inline_key(iterable, isstart):
it = iter(iterable)

def grouper():
"""Yield items from a single group."""
while not p[START]:
yield p[VALUE] # each group has at least one element (a header)
p[VALUE] = next(it)
p[START] = isstart(p[VALUE])

p = [None]*2 # workaround the absence of `nonlocal` keyword in Python 2.x
START, VALUE = 0, 1
p[VALUE] = next(it)
while True:
p[START] = False # to distinguish EOF and a start of new group
yield grouper()
while not p[START]: # skip to the next group
p[VALUE] = next(it)
p[START] = isstart(p[VALUE])

为了避免重复代码,while True 循环可以写成:

while True:
p[START] = False # to distinguish EOF and a start of new group
g = grouper()
yield g
if not p[START]: # skip to the next group
for _ in g:
pass
if not p[START]: # EOF
break

虽然之前的变体可能更明确和可读。

我认为纯 Python 中的通用内存高效解决方案不会比基于 groupby 的解决方案快得多。

如果 process(item)igroup() 快,并且可以在字符串中有效地找到 header (例如,对于固定的静态 header ),则 you could improve performance by reading your file in large chunks and splitting on the header value .它应该使您的任务受 IO 限制。

关于python - 在 Python 中通过谓词对可迭代对象进行分组,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/12775449/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com