gpt4 book ai didi

用于合并排序文件的 Python 类,如何改进?

转载 作者:太空狗 更新时间:2023-10-29 20:40:13 25 4
gpt4 key购买 nike

背景:

我正在清理以制表符分隔的大型(无法保存在内存中)文件。当我清理输入文件时,我在内存中建立了一个列表;当它达到 1,000,000 个条目(大约 1GB 内存)时,我对其进行排序(使用下面的默认键)并将列表写入文件。此类用于将排序的文件放回一起。它适用于我迄今为止遇到的文件。到目前为止,我最大的案例是合并 66 个排序文件。

问题:

  1. 我的逻辑是否存在漏洞(哪里脆弱)?
  2. 我实现了归并排序吗算法正确吗?
  3. 是否有任何明显的改进可以做吗?

示例数据:

这是对其中一个文件中一行的抽象:

'hash_of_SomeStringId\tSome String Id\t\t\twww.somelink.com\t\tOtherData\t\n'

要点是我使用 'SomeStringId'.lower().replace(' ', '') 作为我的排序键。

原代码:

class SortedFileMerger():
""" A one-time use object that merges any number of smaller sorted
files into one large sorted file.

ARGS:
paths - list of paths to sorted files
output_path - string path to desired output file
dedup - (boolean) remove lines with duplicate keys, default = True
key - use to override sort key, default = "line.split('\t')[1].lower().replace(' ', '')"
will be prepended by "lambda line: ". This should be the same
key that was used to sort the files being merged!
"""
def __init__(self, paths, output_path, dedup=True, key="line.split('\t')[1].lower().replace(' ', '')"):
self.key = eval("lambda line: %s" % key)
self.dedup = dedup
self.handles = [open(path, 'r') for path in paths]
# holds one line from each file
self.lines = [file_handle.readline() for file_handle in self.handles]
self.output_file = open(output_path, 'w')
self.lines_written = 0
self._mergeSortedFiles() #call the main method

def __del__(self):
""" Clean-up file handles.
"""
for handle in self.handles:
if not handle.closed:
handle.close()
if self.output_file and (not self.output_file.closed):
self.output_file.close()

def _mergeSortedFiles(self):
""" Merge the small sorted files to 'self.output_file'. This can
and should only be called once.
Called from __init__().
"""
previous_comparable = ''
min_line = self._getNextMin()
while min_line:
index = self.lines.index(min_line)
comparable = self.key(min_line)
if not self.dedup:
#not removing duplicates
self._writeLine(index)
elif comparable != previous_comparable:
#removing duplicates and this isn't one
self._writeLine(index)
else:
#removing duplicates and this is one
self._readNextLine(index)
previous_comparable = comparable
min_line = self._getNextMin()
#finished merging
self.output_file.close()

def _getNextMin(self):
""" Returns the next "smallest" line in sorted order.
Returns None when there are no more values to get.
"""
while '' in self.lines:
index = self.lines.index('')
if self._isLastLine(index):
# file.readline() is returning '' because
# it has reached the end of a file.
self._closeFile(index)
else:
# an empty line got mixed in
self._readNextLine(index)
if len(self.lines) == 0:
return None
return min(self.lines, key=self.key)

def _writeLine(self, index):
""" Write line to output file and update self.lines
"""
self.output_file.write(self.lines[index])
self.lines_written += 1
self._readNextLine(index)

def _readNextLine(self, index):
""" Read the next line from handles[index] into lines[index]
"""
self.lines[index] = self.handles[index].readline()

def _closeFile(self, index):
""" If there are no more lines to get in a file, it
needs to be closed and removed from 'self.handles'.
It's entry in 'self.lines' also need to be removed.
"""
handle = self.handles.pop(index)
if not handle.closed:
handle.close()
# remove entry from self.lines to preserve order
_ = self.lines.pop(index)

def _isLastLine(self, index):
""" Check that handles[index] is at the eof.
"""
handle = self.handles[index]
if handle.tell() == os.path.getsize(handle.name):
return True
return False

编辑:实现来自Brian的建议我提出了以下解决方案:

第二次编辑:根据 John Machin 更新代码的建议:

def decorated_file(f, key):
""" Yields an easily sortable tuple.
"""
for line in f:
yield (key(line), line)

def standard_keyfunc(line):
""" The standard key function in my application.
"""
return line.split('\t', 2)[1].replace(' ', '').lower()

def mergeSortedFiles(paths, output_path, dedup=True, keyfunc=standard_keyfunc):
""" Does the same thing SortedFileMerger class does.
"""
files = map(open, paths) #open defaults to mode='r'
output_file = open(output_path, 'w')
lines_written = 0
previous_comparable = ''
for line in heapq26.merge(*[decorated_file(f, keyfunc) for f in files]):
comparable = line[0]
if previous_comparable != comparable:
output_file.write(line[1])
lines_written += 1
previous_comparable = comparable
return lines_written

粗略测试

使用相同的输入文件(2.2 GB 数据):

  • SortedFileMerger 类用了 51分钟(3068.4 秒)
  • Brian的解决方案耗时 40 分钟(2408.5 秒)
  • 添加 John Machin 后的建议,解决方案代码用了 36 分钟(2214.0 秒)

最佳答案

注意在python2.6中,heapq有一个新的merge将为您执行此操作的函数。

要处理自定义键函数,您可以用装饰它的东西包装文件迭代器,以便它根据键进行比较,然后将其剥离:

def decorated_file(f, key):
for line in f:
yield (key(line), line)

filenames = ['file1.txt','file2.txt','file3.txt']
files = map(open, filenames)
outfile = open('merged.txt')

for line in heapq.merge(*[decorated_file(f, keyfunc) for f in files]):
outfile.write(line[1])

[编辑] 即使在早期版本的 python 中,简单地从后来的 heapq 模块中获取 merge 的实现可能是值得的。它是纯 python,并且在 python2.5 中未修改地运行,并且由于它使用堆来获取下一个最小值,因此在合并大量文件时应该非常有效。

您应该能够简单地从 python2.6 安装中复制 heapq.py,将其作为“heapq26.py”复制到您的源代码并使用“from heapq26 import merge”——有其中没有使用 2.6 特定功能。或者,您可以只复制合并函数(重写 heappop 等调用以引用 python2.5 heapq 模块)。

关于用于合并排序文件的 Python 类,如何改进?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/1001569/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com