gpt4 book ai didi

python - 清理用于文本分析 python 的电子邮件链

转载 作者:太空狗 更新时间:2023-10-30 02:24:46 26 4
gpt4 key购买 nike

我有一些文字:

text = """From: 'Mark Twain' <mark.twain@gmail.com>
To: 'Edgar Allen Poe' <eap@gmail.com>
Subject: RE:Hello!

Ed,

I just read the Tell Tale Heart. You\'ve got problems man.

Sincerely,
Marky Mark

From: 'Edgar Allen Poe' <eap@gmail.com>
To: 'Mark Twain' <mark.twain@gmail.com>
Subject: RE: Hello!

Mark,

The world is crushing my soul, and so are you.

Regards,
Edgar"""

看起来像这样:

"From: 'Mark Twain' <mark.twain@gmail.com>\nTo: 'Edgar Allen Poe' <eap@gmail.com>\nSubject: RE:Hello!\n\nEd,\n\nI just read the Tell Tale Heart. You've got problems man.\n\nSincerely,\nMarky Mark\n\nFrom: 'Edgar Allen Poe' <eap@gmail.com>\nTo: 'Mark Twain' <mark.twain@gmail.com>\nSubject: RE: Hello!\n\nMark,\n\nThe world is crushing my soul, and so are you.\n\nRegards,\nEdgar"

我正在尝试解析其中的消息。最终,我想要一个列表或字典,其中包含“从”和“到”,然后是用于进行一些分析的消息正文。

我尝试通过降低所有内容然后拆分字符串来解析它。

text = text.lower()
text = text.translate(string.punctuation)
text_list = text.split('+')
text_list = [x for x in text_list if len(x) != 0]

有更好的方法吗?

最佳答案

您可以使用 re 来拆分消息 (explanation of this regexp on external site)。结果是包含键 'from''to''subject''message' 的字典列表>:

text = """From: 'Mark Twain' <mark.twain@gmail.com>
To: 'Edgar Allen Poe' <eap@gmail.com>
Subject: RE:Hello!

Ed,

I just read the Tell Tale Heart. You\'ve got problems man.

Sincerely,
Marky Mark

From: 'Edgar Allen Poe' <eap@gmail.com>
To: 'Mark Twain' <mark.twain@gmail.com>
Subject: RE: Hello!

Mark,

The world is crushing my soul, and so are you.

Regards,
Edgar"""

import re
from pprint import pprint

groups = re.findall(r'^From:(.*?)To:(.*?)Subject:(.*?)$(.*?)(?=^From:|\Z)', text, flags=re.DOTALL|re.M)
emails = []
for g in groups:
d = {}
d['from'] = g[0].strip()
d['to'] = g[1].strip()
d['subject'] = g[2].strip()
d['message'] = g[3].strip()
emails.append(d)

pprint(emails)

打印:

[{'from': "'Mark Twain' <mark.twain@gmail.com>",
'message': 'Ed,\n'
'\n'
"I just read the Tell Tale Heart. You've got problems man.\n"
'\n'
'Sincerely,\n'
'Marky Mark',
'subject': 'RE:Hello!',
'to': "'Edgar Allen Poe' <eap@gmail.com>"},
{'from': "'Edgar Allen Poe' <eap@gmail.com>",
'message': 'Mark,\n'
'\n'
'The world is crushing my soul, and so are you.\n'
'\n'
'Regards,\n'
'Edgar',
'subject': 'RE: Hello!',
'to': "'Mark Twain' <mark.twain@gmail.com>"}]

关于python - 清理用于文本分析 python 的电子邮件链,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51676027/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com