gpt4 book ai didi

python - 删除破坏 readline() 的不需要的字符

转载 作者:太空宇宙 更新时间:2023-11-03 21:07:47 24 4
gpt4 key购买 nike

我正在编写一个小脚本来运行版权声明电子邮件的大文件夹并查找相关信息(IP 和时间戳)。我已经找到了解决一些小格式化障碍的方法(有时 IP 和 TS 位于不同的行,有时位于同一行,有时位于不同的位置,时间戳有 4 种不同的格式,等等)。

我遇到了一个奇怪的问题,我正在解析的一些文件在一行中间喷出了奇怪的字符,破坏了我对 readline() 返回的解析。在文本编辑器中读取时,相关行看起来很正常,但 readline() 读取 IP 中间的一个“=”和两个“\n”字符。

例如

Normal return from readline():
"IP Address: xxx.xxx.xxx.xxx"

Broken readline() return:
"IP Address: xxx.xxx.xxx="

The next two lines after that being:
""
".xxx"

知道如何解决这个问题吗?我真的无法控制可能导致此问题的问题,我只是需要在不变得太疯狂的情况下处理它。

相关函数,供引用(我知道很乱):

def getIP(em):
ce = codecs.open(em, encoding='latin1')
iplabel = ""
while not ("Torrent Hash Value: " in iplabel):
iplabel = ce.readline()

ipraw = ce.readline()
if ("File Size" in ipraw):
ipraw = ce.readline()

ip = re.findall( r'[0-9]+(?:\.[0-9]+){3}', ipraw)
if ip:
return ip[0]
ce.close()
else:
ipraw = ce.readline()
ip = re.findall( r'[0-9]+(?:\.[0-9]+){3}', ipraw)
if ip:
return ip[0]
ce.close()
else:
return ("No IP found in: " + ipraw)
ce.close()

最佳答案

您正在处理的至少部分电子邮件可能已被编码为 quoted-printable

此编码用于使 8 位字符数据可在 7 位(仅限 ASCII)系统上传输,但它也强制执行 76 个字符的固定行长度。这是通过插入由“=”后跟行尾标记组成的软换行符来实现的。

Python 提供 quopri处理引用打印的编码和解码的模块。从 Quoted-printable 中解码数据将删除这些软换行符。

作为示例,我们使用您问题的第一段。

>>> import quopri
>>> s = """I'm writing a small script to run through large folders of copyright notice emails and finding relevant information (IP and timestamp). I've already found ways around a few little formatting hurdles (sometimes IP and TS are on different lines, sometimes on same, sometimes in different places, timestamps come in 4 different formats, etc.)."""

>>> # Encode to latin-1 as quopri deals with bytes, not strings.
>>> bs = s.encode('latin-1')

>>> # Encode
>>> encoded = quopri.encodestring(bs)
>>> # Observe the "=\n" inserted into the text.
>>> encoded
b"I'm writing a small script to run through large folders of copyright notice=\n emails and finding relevant information (IP and timestamp). I've already f=\nound ways around a few little formatting hurdles (sometimes IP and TS are o=\nn different lines, sometimes on same, sometimes in different places, timest=\namps come in 4 different formats, etc.)."

>>> # Printing without decoding from quoted-printable shows the "=".
>>> print(encoded.decode('latin-1'))
I'm writing a small script to run through large folders of copyright notice=
emails and finding relevant information (IP and timestamp). I've already f=
ound ways around a few little formatting hurdles (sometimes IP and TS are o=
n different lines, sometimes on same, sometimes in different places, timest=
amps come in 4 different formats, etc.).

>>> # Decode from quoted-printable to remove soft line breaks.
>>> print(quopri.decodestring(encoded).decode('latin-1'))
I'm writing a small script to run through large folders of copyright notice emails and finding relevant information (IP and timestamp). I've already found ways around a few little formatting hurdles (sometimes IP and TS are on different lines, sometimes on same, sometimes in different places, timestamps come in 4 different formats, etc.).

要正确解码,需要处理整个消息正文,这与使用 readline 的方法冲突。解决这个问题的一种方法是将解码后的字符串加载到缓冲区中:

import io

def getIP(em):
with open(em, 'rb') as f:
bs = f.read()
decoded = quopri.decodestring(bs).decode('latin-1')

ce = io.StringIO(decoded)
iplabel = ""
while not ("Torrent Hash Value: " in iplabel):
iplabel = ce.readline()
...

如果您的文件包含完整的电子邮件(包括标题),则使用 email 中的工具模块将自动处理此解码。

import email
from email import policy

with open('message.eml') as f:
s = f.read()
msg = email.message_from_string(s, policy=policy.default)
body = msg.get_content()

关于python - 删除破坏 readline() 的不需要的字符,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55288102/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com