gpt4 book ai didi

python - 最有效的字符串缓冲

转载 作者:太空宇宙 更新时间:2023-11-04 04:25:59 24 4
gpt4 key购买 nike

我在当前的项目中遇到了一个需求,这使我需要一种以最少的时间成本对 unicode 符号序列进行缓冲的方法。这种缓冲区的基本操作是:

  • 将其值作为 unicode 字符串读取
  • 在缓冲区的尾部附加一个符号
  • 刷新缓冲区

因此,我测试了几种方法来找到时间开销最小的方法,但我仍然不确定我是否找到了最快的方法。我尝试了以下算法(从最有效的开始列出):

  1. 符号的列表
  2. io.StringIO 对象
  3. 简单的字符串存储
  4. 预分配array.array

谁能给我一个更好的方法来应对这个挑战的提示?项目解释器是 CPython 2.7。我测试的 MCVE 是:

# -*- coding: utf-8 -*-

import timeit
import io
import array
import abc


class BaseBuffer:
"""A base abstract class for all buffers below"""
__metaclass__ = abc.ABCMeta

def __init__(self):
pass

def clear(self):
old_val = self.value()
self.__init__()
return old_val

@abc.abstractmethod
def value(self):
return self

@abc.abstractmethod
def write(self, symbol):
pass


class ListBuffer(BaseBuffer):
"""Use lists as a storage"""
def __init__(self):
BaseBuffer.__init__(self)
self.__io = []

def value(self):
return u"".join(self.__io)

def write(self, symbol):
self.__io.append(symbol)


class StringBuffer(BaseBuffer):
"""Simply append to the stored string. Obviously unefficient due to strings immutability"""
def __init__(self):
BaseBuffer.__init__(self)
self.__io = u""

def value(self):
return self.__io

def write(self, symbol):
self.__io += symbol


class StringIoBuffer(BaseBuffer):
"""Use the io.StringIO object"""
def __init__(self):
BaseBuffer.__init__(self)
self.__io = io.StringIO()

def value(self):
return self.__io.getvalue()

def write(self, symbol):
self.__io.write(symbol)


class ArrayBuffer(BaseBuffer):
"""Preallocate an array"""
def __init__(self):
BaseBuffer.__init__(self)
self.__io = array.array("u", (u"\u0000" for _ in xrange(1000000)))
self.__caret = 0

def clear(self):
val = self.value()
self.__caret = 0
return val

def value(self):
return u"".join(self.__io[n] for n in xrange(self.__caret))

def write(self, symbol):
self.__io[self.__caret] = symbol
self.__caret += 1


def time_test():
# Test distinct buffer data length
for i in xrange(1000):
for j in xrange(i):
buffer_object.write(unicode(i % 10))
buffer_object.clear()


if __name__ == '__main__':

number_of_runs = 10
for buffer_object in (ListBuffer(), StringIoBuffer(), StringBuffer(), ArrayBuffer()):
print("Class {klass}: {elapsed:.2f}s per {number_of_runs} runs".format(
klass=buffer_object.__class__.__name__,
elapsed=timeit.timeit(stmt=time_test, number=number_of_runs),
number_of_runs=number_of_runs,
))

...我这次运行的结果是:

Class ListBuffer: 1.88s per 10 runs
Class StringIoBuffer: 2.04s per 10 runs
Class StringBuffer: 2.40s per 10 runs
Class ArrayBuffer: 3.10s per 10 runs

最佳答案

我尝试了几个备选方案(见下文),但我无法胜过 ListBuffer 实现。我尝试过的事情:

非预分配数组

class ArrayBufferNoPreallocate(BaseBuffer):
"""array buffer"""
def __init__(self):
BaseBuffer.__init__(self)
self.__io = array.array("u")

def value(self):
return self.__io.tounicode()

def write(self, symbol):
self.__io.append(symbol)

NumPy

class NumpyBuffer(BaseBuffer):
"""numpy array with pre-allocation"""
def __init__(self):
BaseBuffer.__init__(self)
self.__io = np.zeros((1000000,), dtype=np.unicode_)
self.__cursor = 0

def clear(self):
val = self.value()
self.__cursor = 0
return val

def value(self):
return np.char.join(u"", (self.__io[i] for i in xrange(self.__cursor)))

def write(self, symbol):
self.__io[self.__cursor] = symbol
self.__cursor += 1

结果

Class ListBuffer: 3.40s per 10 runs
Class StringIoBuffer: 4.44s per 10 runs
Class StringBuffer: 4.58s per 10 runs
Class ArrayBuffer: 4.65s per 10 runs
Class ArrayBufferNoPreallocate: 3.94s per 10 runs
Class NumpyBuffer: 5.73s per 10 runs

如果您真的想要显着提高速度,您可能必须编写一个c 扩展 或使用类似cython 的东西。

如果您可以优化您的问题,使其不需要为每个字符调用一个函数,您也可以获得一些性能。

关于python - 最有效的字符串缓冲,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53495898/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com