gpt4 book ai didi

Python 脚本 - 电子邮件解析器

转载 作者:行者123 更新时间:2023-12-01 01:38:11 25 4
gpt4 key购买 nike

大家早上好,

我现在正在上 Python 类(class),我们还没有讨论我要问的内容。所以任何帮助都会很棒。我有一个 Python 脚本,可以从文档中解析电子邮件,但它一次只允许我处理一个文档。我有大约 500 个文档,其中大部分包含电子邮件地址。我想知道是否有一种方法可以更改此脚本以读取所有子文件夹和文档并跳过任何错误(如果有)。据我所知,有些文件类型可能无法读取。一些常见的文件类型包括 .txt、.csv、.sql、.xlsx。

这是我找到的脚本,它一次对于一个文件非常有效。一如既往地感谢大家的帮助。

#!/usr/bin/env python
#
# Extracts email addresses from one or more plain text files.
#
# Notes:
# - Does not save to file (pipe the output to a file if you want it saved).
# - Does not check for duplicates (which can easily be done in the terminal).
#


from optparse import OptionParser
import os.path
import re

regex = re.compile(("([a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`"
"{|}~-]+)*(@|\sat\s)(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?(\.|"
"\sdot\s))+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)"))

def file_to_str(filename):
"""Returns the contents of filename as a string."""
with open(filename) as f:
return f.read().lower() # Case is lowered to prevent regex mismatches.

def get_emails(s):
"""Returns an iterator of matched emails found in string s."""
# Removing lines that start with '//' because the regular expression
# mistakenly matches patterns like 'http://foo@bar.com' as '//foo@bar.com'.
return (email[0] for email in re.findall(regex, s) if not email[0].startswith('//'))

if __name__ == '__main__':
parser = OptionParser(usage="Usage: python %prog [FILE]...")
# No options added yet. Add them here if you ever need them.
options, args = parser.parse_args()

if not args:
parser.print_usage()
exit(1)

for arg in args:
if os.path.isfile(arg):
for email in get_emails(file_to_str(arg)):
print email
else:
print '"{}" is not a file.'.format(arg)
parser.print_usage()

最佳答案

你可以像这样使用os.walk:

not_parseble_files = ['.txt', '.csv']
for root, dirs, files in os.walk(root_folder):#This recursively searches all sub directories for files
for file in files:
_,file_ext = os.path.splitext(file)#Here we get the extension of the file
file_path = os.path.join(root,file)
if file_ext in not_parseble_files:#We make sure the extension is not in the banned list 'not_parseble_files'
print("File %s is not parseble"%file_path)
continue #This one continues the loop to the next file
if os.path.isfile(file_path):
for email in get_emails(file_to_str(file_path)):
print(email)

关于Python 脚本 - 电子邮件解析器,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52185229/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com