gpt4 book ai didi

email - 从 mbox 文件中提取电子邮件正文,无论字符集和内容传输编码如何,将其解码为纯文本

转载 作者:行者123 更新时间:2023-12-03 23:25:05 24 4
gpt4 key购买 nike

我正在尝试使用 Python 3 从雷鸟 mbox 文件中提取电子邮件正文。它是一个 IMAP 帐户。

我希望将电子邮件正文的文本部分作为 unicode 字符串进行处理。它应该“看起来像”电子邮件在 Thunderbird 中所做的,并且不包含转义字符,例如\r\n =20 等。

我认为这是我不知道如何解码或删除的内容传输编码。
我收到包含各种不同内容类型和不同内容传输编码的电子邮件。
这是我目前的尝试:

import mailbox
import quopri,base64

def myconvert(encoded,ContentTransferEncoding):
if ContentTransferEncoding == 'quoted-printable':
result = quopri.decodestring(encoded)
elif ContentTransferEncoding == 'base64':
result = base64.b64decode(encoded)

mboxfile = 'C:/Users/Username/Documents/Thunderbird/Data/profile/ImapMail/server.name/INBOX'

for msg in mailbox.mbox(mboxfile):
if msg.is_multipart(): #Walk through the parts of the email to find the text body.
for part in msg.walk():
if part.is_multipart(): # If part is multipart, walk through the subparts.
for subpart in part.walk():
if subpart.get_content_type() == 'text/plain':
body = subpart.get_payload() # Get the subpart payload (i.e the message body)
for k,v in subpart.items():
if k == 'Content-Transfer-Encoding':
cte = v # Keep the Content Transfer Encoding
elif subpart.get_content_type() == 'text/plain':
body = part.get_payload() # part isn't multipart Get the payload
for k,v in part.items():
if k == 'Content-Transfer-Encoding':
cte = v # Keep the Content Transfer Encoding

print(body)
print('Body is of type:',type(body))
body = myconvert(body,cte)
print(body)

但这失败了:
Body is of type: <class 'str'>
Traceback (most recent call last):
File "C:/Users/David/Documents/Python/test2.py", line 31, in <module>
body = myconvert(body,cte)
File "C:/Users/David/Documents/Python/test2.py", line 6, in myconvert
result = quopri.decodestring(encoded)
File "C:\Python32\lib\quopri.py", line 164, in decodestring
return a2b_qp(s, header=header)
TypeError: 'str' does not support the buffer interface

最佳答案

这是一些完成这项工作的代码,它会打印错误而不是因为那些失败的消息而崩溃。我希望它可能有用。请注意,如果 Python 3 中存在错误并且已修复,则 .get_payload(decode=True) 行可能会返回 str 对象而不是 bytes 对象。我今天在 2.7.2 和 Python 3.2.1 上运行了这段代码。

import mailbox

def getcharsets(msg):
charsets = set({})
for c in msg.get_charsets():
if c is not None:
charsets.update([c])
return charsets

def handleerror(errmsg, emailmsg,cs):
print()
print(errmsg)
print("This error occurred while decoding with ",cs," charset.")
print("These charsets were found in the one email.",getcharsets(emailmsg))
print("This is the subject:",emailmsg['subject'])
print("This is the sender:",emailmsg['From'])

def getbodyfromemail(msg):
body = None
#Walk through the parts of the email to find the text body.
if msg.is_multipart():
for part in msg.walk():

# If part is multipart, walk through the subparts.
if part.is_multipart():

for subpart in part.walk():
if subpart.get_content_type() == 'text/plain':
# Get the subpart payload (i.e the message body)
body = subpart.get_payload(decode=True)
#charset = subpart.get_charset()

# Part isn't multipart so get the email body
elif part.get_content_type() == 'text/plain':
body = part.get_payload(decode=True)
#charset = part.get_charset()

# If this isn't a multi-part message then get the payload (i.e the message body)
elif msg.get_content_type() == 'text/plain':
body = msg.get_payload(decode=True)

# No checking done to match the charset with the correct part.
for charset in getcharsets(msg):
try:
body = body.decode(charset)
except UnicodeDecodeError:
handleerror("UnicodeDecodeError: encountered.",msg,charset)
except AttributeError:
handleerror("AttributeError: encountered" ,msg,charset)
return body


#mboxfile = 'C:/Users/Username/Documents/Thunderbird/Data/profile/ImapMail/server.name/INBOX'
print(mboxfile)
for thisemail in mailbox.mbox(mboxfile):
body = getbodyfromemail(thisemail)
print(body[0:1000])

关于email - 从 mbox 文件中提取电子邮件正文,无论字符集和内容传输编码如何,将其解码为纯文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/7166922/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com