gpt4 book ai didi

python - 使用 libpst 将 Outlook PST 转换为 json

转载 作者:行者123 更新时间:2023-12-01 03:51:45 43 4
gpt4 key购买 nike

我有一个 Outlook PST 文件,我想获取电子邮件的 json,例如类似的东西

{"emails": [
{"from": "alice@example.com",
"to": "bob@example.com",
"bcc": "eve@example.com",
"subject": "mitm",
"content": "be careful!"
}, ...]}

我想过使用readpst转换为MH格式,然后在ruby/python/bash脚本中扫描它,有更好的方法吗?

不幸的是,ruby-msg gem 不适用于我的 PST 文件(并且看起来自 2014 年以来就没有更新过)。

最佳答案

我找到了一种分两个阶段完成的方法,首先转换为 mbox,然后转换为 json:

# requires installing libpst
pst2json my.pst
# or you can specify a custom output dir and an outlook mail folder,
# e.g. Inbox, Sent, etc.
pst2json -o email/ -f Inbox my.pst

其中 pst2json 是我的脚本,mbox2json 是从 Mining the Social Web 稍作修改的。 .

pst2json:

#!/usr/bin/env bash

usage(){
echo "usage: $(basename $0) [-o <output-dir>] [-f <folder>] <pst-file>"
echo "default output-dir: email/mbox-all/<pst-file>"
echo "default folder: Inbox"
exit 1
}

which readpst || { echo "Error: libpst not installed"; exit 1; }
folder=Inbox

while (( $# > 0 )); do
[[ -n "$pst_file" ]] && usage
case "$1" in
-o)
if [[ -n "$2" ]]; then
out_dir="$2"
shift 2
else
usage
fi
;;
-f)
if [[ -n "$2" ]]; then
folder="$2"
shift 2
else
usage
fi
;;
*)
pst_file="$1"
shift
esac
done

default_out_dir="email/mbox-all/$(basename $pst_file)"
out_dir=${out_dir:-"$default_out_dir"}
mkdir -p "$out_dir"
readpst -o "$out_dir" "$pst_file"
[[ -f "$out_dir/$folder" ]] || { echo "Error: folder $folder is missing or empty."; exit 1; }
res="$out_dir"/"$folder".json
mbox2json "$out_dir/$folder" "$res" && echo "Success: result saved to $res"

mbox2json(python 2.7):

# -*- coding: utf-8 -*-

import sys
import mailbox
import email
import quopri
import json
from BeautifulSoup import BeautifulSoup

MBOX = sys.argv[1]
OUT_FILE = sys.argv[2]
SKIP_HTML=True

def cleanContent(msg):

# Decode message from "quoted printable" format

msg = quopri.decodestring(msg)

# Strip out HTML tags, if any are present

soup = BeautifulSoup(msg)
return ''.join(soup.findAll(text=True))


def jsonifyMessage(msg):
json_msg = {'parts': []}
for (k, v) in msg.items():
json_msg[k] = v.decode('utf-8', 'ignore')

# The To, CC, and Bcc fields, if present, could have multiple items
# Note that not all of these fields are necessarily defined

for k in ['To', 'Cc', 'Bcc']:
if not json_msg.get(k):
continue
json_msg[k] = json_msg[k].replace('\n', '').replace('\t', '').replace('\r'
, '').replace(' ', '').decode('utf-8', 'ignore').split(',')

try:
for part in msg.walk():
json_part = {}
if part.get_content_maintype() == 'multipart':
continue
type = part.get_content_type()
if SKIP_HTML and type == 'text/html':
continue
json_part['contentType'] = type
content = part.get_payload(decode=False).decode('utf-8', 'ignore')
json_part['content'] = cleanContent(content)

json_msg['parts'].append(json_part)
except Exception, e:
sys.stderr.write('Skipping message - error encountered (%s)\n' % (str(e), ))
finally:
return json_msg

# There's a lot of data to process, so use a generator to do it. See http://wiki.python.org/moin/Generators
# Using a generator requires a trivial custom encoder be passed to json for serialization of objects
class Encoder(json.JSONEncoder):
def default(self, o):
return {'emails': list(o)}


# The generator itself...
def gen_json_msgs(mb):
while 1:
msg = mb.next()
if msg is None:
break
yield jsonifyMessage(msg)

mbox = mailbox.UnixMailbox(open(MBOX, 'rb'), email.message_from_file)
json.dump(gen_json_msgs(mbox),open(OUT_FILE, 'wb'), indent=4, cls=Encoder)

现在,可以轻松处理该文件了。例如。仅获取电子邮件的内容:

jq '.emails[] | .parts[] | .content' < out/Inbox.json

关于python - 使用 libpst 将 Outlook PST 转换为 json,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38122047/

43 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com