gpt4 book ai didi

python-3.x - 如何将 mbox 转换为 JSON 结构?

转载 作者:行者123 更新时间:2023-12-01 14:17:39 26 4
gpt4 key购买 nike

我正在尝试将 mbox 转换为适合导入 MongoDB 的 JSON 结构,即我正在使用挖掘社交网络第二版邮箱章节,但它无法正常工作。我正在尝试将 mbox 转换为适合导入 MongoDB 的 JSON 结构,即我正在使用挖掘社交网络第二版邮箱章节,但它无法正常工作。

 import sys
import mailbox
import email
import quopri
import json
import time
from BeautifulSoup import BeautifulSoup
from dateutil.parser import parse

MBOX = 'resources/ch06-mailboxes/data/enron.mbox'
OUT_FILE = MBOX + '.json'

def cleanContent(msg):

# Decode message from "quoted printable" format, but first
# re-encode, since decodestring will try to do a decode of its own
msg = quopri.decodestring(msg.encode('utf-8'))

# Strip out HTML tags, if any are present.
# Bail on unknown encodings if errors happen in BeautifulSoup.
try:
soup = BeautifulSoup(msg)
except:
return ''
return ''.join(soup.findAll(text=True))

# There's a lot of data to process, and the Pythonic way to do it is with a
# generator. See http://wiki.python.org/moin/Generators.
# Using a generator requires a trivial encoder to be passed to json for object
# serialization.

class Encoder(json.JSONEncoder):
def default(self, o): return list(o)
# The generator itself...
def gen_json_msgs(mb):
while 1:
msg = mb.next()
if msg is None:
break

yield jsonifyMessage(msg)

def jsonifyMessage(msg):
json_msg = {'parts': []}
for (k, v) in msg.items():
json_msg[k] = v.decode('utf-8', 'ignore')

# The To, Cc, and Bcc fields, if present, could have multiple items.
# Note that not all of these fields are necessarily defined.

for k in ['To', 'Cc', 'Bcc']:
if not json_msg.get(k):
continue
json_msg[k] = json_msg[k].replace('\n', '').replace('\t', '').replace('\r', '')\
.replace(' ', '').decode('utf-8', 'ignore').split(',')

for part in msg.walk():
json_part = {}

if part.get_content_maintype() != 'text':
print >> sys.stderr, "Skipping MIME content in JSONification
({0})".format(part.get_content_maintype())
continue

json_part['contentType'] = part.get_content_type()
content = part.get_payload(decode=False).decode('utf-8', 'ignore')
json_part['content'] = cleanContent(content)
json_msg['parts'].append(json_part)

# Finally, convert date from asctime to milliseconds since epoch using the
# $date descriptor so it imports "natively" as an ISODate object in MongoDB
then = parse(json_msg['Date'])
millis = int(time.mktime(then.timetuple())*1000 + then.microsecond/1000)
json_msg['Date'] = {'$date' : millis}

return json_msg

mbox = mailbox.UnixMailbox(open(MBOX, 'rb'), email.message_from_file)

# Write each message out as a JSON object on a separate line
# for easy import into MongoDB via mongoimport

f = open(OUT_FILE, 'w')
for msg in gen_json_msgs(mbox):
if msg != None:
f.write(json.dumps(msg, cls=Encoder) + '\n')
f.close()

print "All done"

getting error:
80 # for easy import into MongoDB via mongoimport
81
---> 82 f = open(OUT_FILE, 'w')
83 for msg in gen_json_msgs(mbox):
84 if msg != None:

IOError: [Errno 13] Permission denied: 'resources/ch06-mailboxes/data/enron.mbox.json'

最佳答案

您提到的代码在 Third Edition of Mining Social Web 中已过时

我尝试制作一个可行的脚本,它不仅可以将 MBOX 转换为 JSON,甚至可以将附件提取为可用格式。链接到 repo - https://github.com/PS1607/mbox-to-json

阅读README文件以获取使用说明。

关于python-3.x - 如何将 mbox 转换为 JSON 结构?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22006616/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com