gpt4 book ai didi

python - 将序列拆分为重叠 block 的更好方法?

转载 作者:太空宇宙 更新时间:2023-11-04 07:56:12 25 4
gpt4 key购买 nike

我需要一个函数来将可迭代对象拆分为 block ,并可选择在 block 之间有重叠。

我写了下面的代码,它给了我正确的输出,但效率很低(很慢)。我不知道如何加快速度。有没有更好的方法?

def split_overlap(seq, size, overlap):
'''(seq,int,int) => [[...],[...],...]
Split a sequence into chunks of a specific size and overlap.
Works also on strings!

Examples:
>>> split_overlap(seq=list(range(10)),size=3,overlap=2)
[[0, 1, 2], [1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6], [5, 6, 7], [6, 7, 8], [7, 8, 9]]

>>> split_overlap(seq=range(10),size=3,overlap=2)
[range(0, 3), range(1, 4), range(2, 5), range(3, 6), range(4, 7), range(5, 8), range(6, 9), range(7, 10)]

>>> split_overlap(seq=list(range(10)),size=7,overlap=2)
[[0, 1, 2, 3, 4, 5, 6], [5, 6, 7, 8, 9]]
'''
if size < 1 or overlap < 0:
raise ValueError('"size" must be an integer with >= 1 while "overlap" must be >= 0')
result = []
while True:
if len(seq) <= size:
result.append(seq)
return result
else:
result.append(seq[:size])
seq = seq[size-overlap:]

到目前为止的测试结果:

l = list(range(10))
s = 4
o = 2
print(split_overlap(l,s,o))
print(list(split_overlap_jdehesa(l,s,o)))
print(list(nwise_overlap(l,s,o)))
print(list(split_overlap_Moinuddin(l,s,o)))
print(list(gen_split_overlap(l,s,o)))
print(list(itr_split_overlap(l,s,o)))

[[0, 1, 2, 3], [2, 3, 4, 5], [4, 5, 6, 7], [6, 7, 8, 9]]
[(0, 1, 2, 3), (2, 3, 4, 5), (4, 5, 6, 7), (6, 7, 8, 9)]
[(0, 1, 2, 3), (2, 3, 4, 5), (4, 5, 6, 7), (6, 7, 8, 9), (8, 9, None, None)] #wrong
[[0, 1, 2, 3], [2, 3, 4, 5], [4, 5, 6, 7], [6, 7, 8, 9], [8, 9]] #wrong
[[0, 1, 2, 3], [2, 3, 4, 5], [4, 5, 6, 7], [6, 7, 8, 9]]
[(0, 1, 2, 3), (2, 3, 4, 5), (4, 5, 6, 7), (6, 7, 8, 9)]

%%timeit
split_overlap(l,7,2)
718 ns ± 2.36 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%%timeit
list(split_overlap_jdehesa(l,7,2))
4.02 µs ± 64.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%%timeit
list(nwise_overlap(l,7,2))
5.05 µs ± 102 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%%timeit
list(split_overlap_Moinuddin(l,7,2))
3.89 µs ± 78.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%%timeit
list(gen_split_overlap(l,7,2))
1.22 µs ± 13.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%%timeit
list(itr_split_overlap(l,7,2))
3.41 µs ± 36.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

以更长的列表作为输入:

l = list(range(100000))

%%timeit
split_overlap(l,7,2)
4.27 s ± 132 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
list(split_overlap_jdehesa(l,7,2))
31.1 ms ± 495 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
list(nwise_overlap(l,7,2))
5.74 ms ± 66 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
list(split_overlap_Moinuddin(l,7,2))
16.9 ms ± 89.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
list(gen_split_overlap(l,7,2))
4.54 ms ± 22.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
list(itr_split_overlap(l,7,2))
19.1 ms ± 240 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

从其他测试(此处未报告),事实证明对于小列表 len(list) <= 100 ,我的原始实现 split_overlap()是最快的。但对于任何比这更大的东西,gen_split_overlap()是迄今为止最有效的解决方案。

最佳答案

有时可读性与速度很重要。迭代索引、生成切片的简单生成器可以在合理的时间内完成工作:

def gen_split_overlap(seq, size, overlap):        
if size < 1 or overlap < 0:
raise ValueError('size must be >= 1 and overlap >= 0')

for i in range(0, len(seq) - overlap, size - overlap):
yield seq[i:i + size]

如果你想处理潜在的无限迭代,你只需要保持 overlap items from the previous yield and slice size - overlap new items:

def itr_split_overlap(iterable, size, overlap):
itr = iter(iterable)

# initial slice, in case size exhausts iterable on the spot
next_ = tuple(islice(itr, size))
yield next_
# overlap for initial iteration
prev = next_[-overlap:] if overlap else ()

# For long lists the repeated calls to a lambda are slow, but using
# the 2-argument form of `iter()` is in general a nice trick.
#for chunk in iter(lambda: tuple(islice(itr, size - overlap)), ()):

while True:
chunk = tuple(islice(itr, size - overlap))

if not chunk:
break

next_ = (*prev, *chunk)
yield next_

# overlap == 0 is a special case
if overlap:
prev = next_[-overlap:]

关于python - 将序列拆分为重叠 block 的更好方法?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48381870/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com