gpt4 book ai didi

Python 脚本使用所有 RAM

转载 作者:行者123 更新时间:2023-12-01 09:02:48 26 4
gpt4 key购买 nike

我有一个 Python 脚本,用于解析大型文档中的电子邮件。该脚本使用了我机器上的所有 RAM,并使其锁定到我必须重新启动它的位置。我想知道是否有一种方法可以限制这一点,或者甚至可以在读取一个文件并提供一些输出后暂停。任何帮助都会非常感谢。

#!/usr/bin/env python

# Extracts email addresses from one or more plain text files.
#
# Notes:
# - Does not save to file (pipe the output to a file if you want it saved).
# - Does not check for duplicates (which can easily be done in the terminal).
# - Does not save to file (pipe the output to a file if you want it saved).
# Twitter @Critical24 - DefensiveThinking.io


from optparse import OptionParser
import os.path
import re

regex = re.compile(("([a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`"
"{|}~-]+)*(@|\sat\s)(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?(\.|"
"\sdot\s))+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)"))

def file_to_str(filename):
"""Returns the contents of filename as a string."""
with open(filename, encoding='utf-8') as f: #Added encoding='utf-8'
return f.read().lower() # Case is lowered to prevent regex mismatches.

def get_emails(s):
"""Returns an iterator of matched emails found in string s."""
# Removing lines that start with '//' because the regular expression
# mistakenly matches patterns like 'http://foo@bar.com' as '//foo@bar.com'.
return (email[0] for email in re.findall(regex, s) if not email[0].startswith('//'))

import os
not_parseble_files = ['.txt', '.csv']
for root, dirs, files in os.walk('.'):#This recursively searches all sub directories for files
for file in files:
_,file_ext = os.path.splitext(file)#Here we get the extension of the file
file_path = os.path.join(root,file)
if file_ext in not_parseble_files:#We make sure the extension is not in the banned list 'not_parseble_files'
print("File %s is not parseble"%file_path)
continue #This one continues the loop to the next file
if os.path.isfile(file_path):
for email in get_emails(file_to_str(file_path)):
print(email)

最佳答案

您似乎正在使用 f.read() 将最大 8 GB 的文件读取到内存中。相反,您可以尝试将正则表达式应用于文件的每一行,而无需将整个文件存储在内存中。

with open(filename, encoding='utf-8') as f: #Added encoding='utf-8'
return (email[0] for line in f
for email in re.findall(regex, line.lower())
if not email[0].startswith('//'))

不过,这仍然需要很长时间。另外,我没有检查您的正则表达式是否存在可能的问题。

关于Python 脚本使用所有 RAM,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52334602/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com