gpt4 book ai didi

python - 如何提取 BeautifulSoup 中 标签外的文本

转载 作者:太空宇宙 更新时间:2023-11-04 05:25:53 24 4
gpt4 key购买 nike

谁能帮我提取From 后面的测试,我想提取发件人姓名。它位于 em 标签之外。我正在使用 python BeautifulSoup 包。

这是网页链接:http://seclists.org/fulldisclosure/2016/Jan/0

我能够成功提取电子邮件标题,因为它在标签中。 html 页面中没有其他 div 或类。

这是页面的html代码:

这是我试过的

def title_spider(max_pages):
page = 0
while page <= max_pages:
url = 'http://seclists.org/fulldisclosure/2016/Jan/' + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for email_title in soup.find('b'):
title = email_title.string
print(title)

for date_stamp in soup.em:
date = date_stamp
print(date)
page += 1

title_spider(2)

`

最佳答案

你想要下一个 sibling ,如果你想要特定 em 的发件人和日期,你可以结合正则表达式:

import re

def title_spider(max_pages):
for page in range(max_pages + 1):
url = 'http://seclists.org/fulldisclosure/2016/Jan/{}'.format(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for email_title in soup.find('b'):
title = email_title.string
print(title)

for em in soup.find_all("em", text=re.compile("From|Date")):
print(em.text, em.next_sibling)

这给了你:

In [5]: title_spider(2)
Alcatel Lucent Home Device Manager - Management Console Multiple XSS
From : Uğur Cihan KOÇ <u.cihan.koc () gmail com>
Date : Sun, 3 Jan 2016 13:20:53 +0200
Executable installers/self-extractors are vulnerable^WEVIL (case 17): Kaspersky Labs utilities
From : "Stefan Kanthak" <stefan.kanthak () nexgo de>
Date : Sun, 3 Jan 2016 16:12:50 +0100
Possible vulnerability in F5 BIG-IP LTM - Improper input validation of the HTTP version number of the HTTP reqest allows any payload size and conent to pass through
From : Eitan Caspi <eitanc () yahoo com>
Date : Sun, 3 Jan 2016 21:10:27 +0000 (UTC)

关于python - 如何提取 BeautifulSoup 中 <em> 标签外的文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38725870/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com