gpt4 book ai didi

python - 换行符阻止 BeautifulSoup 提取数据

转载 作者:行者123 更新时间:2023-11-27 23:26:50 28 4
gpt4 key购买 nike

我的脚本通过 imaplib 从电子邮件收件箱中获取 HTML 代码,将其传递给 BeautifulSoup 并尝试提取其中的所有 href .

rv, data = M.SEARCH(None, '(FROM "foo@bar.com")')
if rv == 'OK':
for num in data[0].split():
typ, data = M.fetch(num, '(RFC822)')
html = data[0][1]

soup = BeautifulSoup(html, 'lxml')
for a in soup.find_all('a', href=True):
print a['href']

但是 html 变量包含每 N 个字符换行的 HTML 代码,阻止 BeautifulSoup 准确返回 href,特别是长那些被新线分开的。

=0D3D 等奇怪的字符到处都是。

messages, <a=0D
href=3D"http://links.google.com/wf/click?upn=3DOGGGYNMPA980E3DmngbHusD=
Uo-2BK17XLM3ogFJfQXXXfMWZLdsQSSVv33HbPoHPXGcH8tSf9ZFFU5i-2FrV4O6ISlpDCIVaN5=
83xr1CGoa5yxZimagE5JiSUAhbZH8P7WiNvf35BsXrCxmrmRLMGB-2BJAQ-3D-3D_IcMuwcQVVt=
a699aeVjRRVxwBCNHkXaWO-2FyIlAqZ7CPsryDB24UVYZbMIvGLJb13chayC-2FLeucv-2FTrko=
7LaiaWHkzy85DWXrK1olI1SEJZs-2BMCAWfoVfloGJivlLSH0GQk0XeVT0j383tZrsymuWLF0S2=
q5j3LR91e76dRXQe7p8t5CgrBe-2FqGk6bmURG9XCNw3dwpHnymaR-2FggHQx6GnbbueF7PVp2H=
-2BGoHUEkMOSXJ8FfSgQIiGICvxz1zcBJPw-2FRoE3YDl-2By8XETkXjVaNchNA1ZN8FDCD5VUf=
V9oUOnavAirXX-2FEw1THfSpV4VYDX">unsubscribe</a></td>=0D
</tr>=0D
<tr>=0D
<td height=3D"12"></td>=0D
</tr>=0D

我们能做些什么来解决这个问题?

最佳答案

您可以使用 quopri解码 Quoted-printable数据:

Quoted-Printable 或 QP 编码是一种使用可打印 ASCII 字符(字母数字和等号“=”)通过 7 位数据路径或通常通过不是 8 位干净的介质。 1它被定义为用于电子邮件的 MIME 内容传输编码。

QP 通过使用等号“=”作为转义字符来工作。它还将行长度限制为 76,因为某些软件对行长度有限制。

html = """<a=0D
href=3D"http://links.google.com/wf/click?upn=3DOGGGYNMPA980E3DmngbHusD=
Uo-2BK17XLM3ogFJfQXXXfMWZLdsQSSVv33HbPoHPXGcH8tSf9ZFFU5i-2FrV4O6ISlpDCIVaN5=
83xr1CGoa5yxZimagE5JiSUAhbZH8P7WiNvf35BsXrCxmrmRLMGB-2BJAQ-3D-3D_IcMuwcQVVt=
a699aeVjRRVxwBCNHkXaWO-2FyIlAqZ7CPsryDB24UVYZbMIvGLJb13chayC-2FLeucv-2FTrko=
7LaiaWHkzy85DWXrK1olI1SEJZs-2BMCAWfoVfloGJivlLSH0GQk0XeVT0j383tZrsymuWLF0S2=
q5j3LR91e76dRXQe7p8t5CgrBe-2FqGk6bmURG9XCNw3dwpHnymaR-2FggHQx6GnbbueF7PVp2H=
-2BGoHUEkMOSXJ8FfSgQIiGICvxz1zcBJPw-2FRoE3YDl-2By8XETkXjVaNchNA1ZN8FDCD5VUf=
V9oUOnavAirXX-2FEw1THfSpV4VYDX">unsubscribe</a></td>=0D
</tr>=0D
<tr>=0D
<td height=3D"12"></td>=0D
</tr>=0D"""


from bs4 import BeautifulSoup
import quopri

soup = BeautifulSoup(quopri.decodestring(html), "lxml")
print(soup)
print(soup.select_one("a")["href"])

将输出:

<html><body><a href="http://links.google.com/wf/click?upn=OGGGYNMPA980E3DmngbHusDUo-2BK17XLM3ogFJfQXXXfMWZLdsQSSVv33HbPoHPXGcH8tSf9ZFFU5i-2FrV4O6ISlpDCIVaN583xr1CGoa5yxZimagE5JiSUAhbZH8P7WiNvf35BsXrCxmrmRLMGB-2BJAQ-3D-3D_IcMuwcQVVta699aeVjRRVxwBCNHkXaWO-2FyIlAqZ7CPsryDB24UVYZbMIvGLJb13chayC-2FLeucv-2FTrko7LaiaWHkzy85DWXrK1olI1SEJZs-2BMCAWfoVfloGJivlLSH0GQk0XeVT0j383tZrsymuWLF0S2q5j3LR91e76dRXQe7p8t5CgrBe-2FqGk6bmURG9XCNw3dwpHnymaR-2FggHQx6GnbbueF7PVp2H-2BGoHUEkMOSXJ8FfSgQIiGICvxz1zcBJPw-2FRoE3YDl-2By8XETkXjVaNchNA1ZN8FDCD5VUfV9oUOnavAirXX-2FEw1THfSpV4VYDX">unsubscribe</a>
<tr>
<td height="12"></td>
</tr> </body></html>
http://links.google.com/wf/click?upn=OGGGYNMPA980E3DmngbHusDUo-2BK17XLM3ogFJfQXXXfMWZLdsQSSVv33HbPoHPXGcH8tSf9ZFFU5i-2FrV4O6ISlpDCIVaN583xr1CGoa5yxZimagE5JiSUAhbZH8P7WiNvf35BsXrCxmrmRLMGB-2BJAQ-3D-3D_IcMuwcQVVta699aeVjRRVxwBCNHkXaWO-2FyIlAqZ7CPsryDB24UVYZbMIvGLJb13chayC-2FLeucv-2FTrko7LaiaWHkzy85DWXrK1olI1SEJZs-2BMCAWfoVfloGJivlLSH0GQk0XeVT0j383tZrsymuWLF0S2q5j3LR91e76dRXQe7p8t5CgrBe-2FqGk6bmURG9XCNw3dwpHnymaR-2FggHQx6GnbbueF7PVp2H-2BGoHUEkMOSXJ8FfSgQIiGICvxz1zcBJPw-2FRoE3YDl-2By8XETkXjVaNchNA1ZN8FDCD5VUfV9oUOnavAirXX-2FEw1THfSpV4VYDX

如果您打印十六进制字符 3D0D,您会发现这一切都有意义:

In [4]: print("\x3D")
=

In [5]: print("\x0D")


In [6]:

关于python - 换行符阻止 BeautifulSoup 提取数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38166697/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com