gpt4 book ai didi

python - 如何从我收到的电子邮件中的超链接中提取 URL?

转载 作者:行者123 更新时间:2023-12-04 10:07:07 24 4
gpt4 key购买 nike

我正在尝试使用 beautifulsoup 从我的电子邮件中提取 URL。当我使用 google API 从我的 get 请求返回原始 HTML 时,这就是我得到的(我已经删除了敏感信息并将其替换为 a 和 1)。在这中间,href=3D"后跟一个 URL 是我需要的 URL。它包含 2 行,但是当我复制和粘贴它(删除 = 时)它是正确的 URL。

<html><head></head><body><div class=3D"ydp20dc8582yahoo-style-wrap" style=
=3D"font-family:Helvetica Neue, Helvetica, Arial, sans-serif;font-size:13px=
;"><div></div>
<div><br></div><div><br></div>
=20
</div><div id=3D"ydp475be88byahoo_quoted_8442876516" class=3D"ydp47=
5be88byahoo_quoted">
<div style=3D"font-family:'Helvetica Neue', Helvetica, Arial, s=
ans-serif;font-size:13px;color:#26882a;">
<div>----- Forwarded Message -----</div>
<div><b>From:</b> auto-confirm@aaaaaaaaaaaaaaaaaaaaaaa.com =
&lt;auto-confirm@aaaaaaaaaaaaaaaaaaaaaaa.com&gt;</div><div><b>To:</b> "aaaa=
aaaa@yahoo.com" &lt;aaaaaaaa@yahoo.com&gt;</div><div><b>Sent:</b> Thursday,=
April 23, 2020, 1:39:28 PM CDT</div><div><b>Subject:</b> You chose a Virtu=
aaaaaaaaaaaa!</div><div><br></div>
<div><div id=3D"ydp475be88byiv6890824975"><div><p> Hello aa=
aaaaaaaaaa, </p><p> Thanks for visiting <a href=3D"https://www.aaaaaaaaaaaa=
aaaaaaaaaaa.com/token/111111111aaaaa11111aaaa111111111" rel=3D"nofollow" ta=
rget=3D"_blank">https://www.aaaaaaaaaaaaaaaaaaaaaaa.com</a>. You recently s=
elected a aaaaaaaaaaaaaaaaaaaaaaaaaaaa. </p><p><a href=3D"https://www.aaaaa=
aaaaaaaaaaaaaaaaaa.com/token/111111111aaaaa11111aaaa111111111" rel=3D"nofol=
low" target=3D"_blank">Click here</a> to aaaaaaaaaaaaaaaaaaaaaaaa details, =
spend history and more. <br>Enjoy aaaaaaaaa!</p><p> https://www.aaaaaaaaaaa=
aaaaaaaaaaaa.com </p><p>Digital token: 1111-111111-1111</p><hr><p>Please do=
n=E2=80=99t reply to this email. If you have questions, please <a href=3D"h=
ttps://www.aaaaaaaaaaaaaaaaaaaaaaaaa.com/ContactUs" rel=3D"nofollow" target=
=3D"_blank"> click here. </a></p></div></div></div>
</div>
</div></body></html>

我需要在 2 行的 href 标签中提取 URL。当我把它做成beautifulsoup 项目时,它似乎在= 符号处剪掉了所有的标签。这是当我将上述内容分配给一个漂亮的汤项目然后打印它时显示的内容。
<html><head></head><body><div arial="" class='3D"ydp20dc1111yahoo-style-wrap"' helvetica="" 
neue="" sans-serif="" style='=3D"font-family:Helvetica'><div></div>
<div><br/></div><div><br/></div>
=20
</div><div class='3D"ydp47=' id='3D"ydp47511111yahoo_quoted_8445876516"'>
<div arial="" helvetica="" neue="" s='ans-serif;font-size:13px;color:#26282a;"'
style="3D&quot;font-family:'Helvetica">
<div>----- Forwarded Message -----</div>
<div><b>From:</b> auto-confirm@aaaaaaaaaaaaaaaaaaaa.com =
&lt;auto-confirm@aaaaaaaaaaaaaaaaaaaaaaa.com&gt;</div><div><b>To:</b> "aaaa=
aaaa@yahoo.com" &lt;aaaaaaaa@yahoo.com&gt;</div><div><b>Sent:</b> Thursday,=
April 23, 2020, 1:39:28 PM CDT</div><div><b>Subject:</b> You chose a Virtu=
aaaaaaaaaaa!</div><div><br/></div>
<div><div id='3D"ydp475be88byiv6890824975"'><div><p> Hello aa=
aaaaaaaaa, </p><p> Thanks for visiting <a alsolutions.com=""
href='3D"https://www.aaaaaaaaaaaa=' rel='3D"nofollow"'
ta='rget=3D"_blank"'>https://www.aaaaaaaaaaaaaaaaaaaaaaaaa.com</a>. You recently s=
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa. </p><p><a href='3D"https://www.aaaaa='
aaaaaaaaaaaaaaaaaaa.com="" low="" rel='3D"nofol=' target='3D"_blank"'>Click here</a> to
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa =
aaaaaaaaaaaaaaaaaaaa. <br/>Enjoy aaaaaaaaaaaaa</p><p> https://www.aaaaaaaaaaaaaaa=
aaaaaaaaaaaaaa.com </p><p>Digital token: aaaa-aaaaaa-aaaa</p><hr/><p>Please do=
n=E2=80=99t reply to this email. If you have questions, please <a href='3D"h='
rel='3D"nofollow"' target='=3D"_blank"' ttps:=""> click here. </a></p></div></div></div>
</div>
</div></body></html>

如您所见,当 google api 切断它时,beautifulsoup 似乎丢失了 URL。我不知道为什么谷歌的 api 会这样分解。这是我用来从我的电子邮件中提取 html 的代码。
for item in msg_id:
message = service.users().messages().get(userId = user_id, id = item, format =
'raw').execute()
msg_raw = base64.urlsafe_b64decode(message['raw'].encode('ASCII'))
msg_str = email.message_from_bytes(msg_raw)
content_types = msg_str.get_content_maintype()
if content_types == 'multipart':
part1, part2 = msg_str.get_payload()
# print(part2.get_payload())
return part2.get_payload()
else:
return msg_str.get_payload()

关于如何更改我的 google API 请求或 beautifulsoup 请求的任何帮助都会非常有帮助。提前致谢。

编辑:我做了@fedeCalendino 的建议,这是输出。它仍然将 URL 分成 2 行,中间有一个 =。
  soup = BeautifulSoup(content)
[<a href="https://www.aaaaaaaaaaaaaaa=
aaaaaaaaaaaaa.com/token/aaaaaaa111111111aaaaaaaaaa11111111" rel="nofollow"
ta='rget="_blank"'>https://www.aaaaaaaaaaaaaaaaaaaaaaaaaa.com</a>, <a
href="https://www.aaaaa=
iddigitalsolutions.com/token/aaaaaaa111111111aaaaaaa1111111" rel="nofol=
low" target="_blank">Click here</a>, <a href="h=
ttps://www.aaaaaaaaaaaaaaaaaaaaaaaaa.com/ContactUs" rel="nofollow"
target='="_blank"'> click here. </a>]

最佳答案

您可以先清理内容,然后再放入 bs。

   content = google_api.get_email()
content = content.replace("=3D", "=")

soup = BeautifulSoup(content)
all_as = soup.find_all("a")

关于python - 如何从我收到的电子邮件中的超链接中提取 URL?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61529755/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com