gpt4 book ai didi

python - 如何使用 Beautiful soup 从页面中提取链接

转载 作者:行者123 更新时间:2023-12-02 08:15:10 24 4
gpt4 key购买 nike

我有一个HTML Page具有多个 div,例如:

<div class="post-info-wrap">
<h2 class="post-title"><a href="https://www.example.com/blog/111/this-is-1st-post/" title="Example of 1st post &#8211; Example 1 Post" rel="bookmark">sample post &#8211; example 1 post</a></h2>
<div class="post-meta clearfix">

<div class="post-info-wrap">
<h2 class="post-title"><a href="https://www.example.com/blog/111/this-is-2nd-post/" title="Example of 2nd post &#8211; Example 2 Post" rel="bookmark">sample post &#8211; example 2 post</a></h2>
<div class="post-meta clearfix">

我需要使用 post-info-wrap 类获取所有 div 的值,我是 BeautifulSoup 的新手

所以我需要这些网址:

我已经尝试过:

import re
import requests
from bs4 import BeautifulSoup

r = requests.get("https://www.example.com/blog/author/abc")
data = r.content # Content of response

soup = BeautifulSoup(data, "html.parser")
for link in soup.select('.post-info-wrap'):
print link.find('a').attrs['href']

这段代码似乎不起作用。我是美丽汤的新手。如何提取链接?

最佳答案

link = i.find('a',href=True)始终不返回 anchor 标记(a),可能返回NoneType,所以需要验证链接是否为None,继续for循环,否则获取链接href值。

按网址抓取链接:

import re
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.example.com/blog/author/abc")
data = r.content # Content of response
soup = BeautifulSoup(data, "html.parser")

for i in soup.find_all('div',{'class':'post-info-wrap'}):
link = i.find('a',href=True)
if link is None:
continue
print(link['href'])

通过 HTML 抓取链接:

from bs4 import BeautifulSoup
html = '''<div class="post-info-wrap"><h2 class="post-title"><a href="https://www.example.com/blog/111/this-is-1st-post/" title="Example of 1st post &#8211; Example 1 Post" rel="bookmark">sample post &#8211; example 1 post</a></h2><div class="post-meta clearfix">
<div class="post-info-wrap"><h2 class="post-title"><a href="https://www.example.com/blog/111/this-is-2nd-post/" title="Example of 2nd post &#8211; Example 2 Post" rel="bookmark">sample post &#8211; example 2 post</a></h2><div class="post-meta clearfix">'''

soup = BeautifulSoup(html, "html.parser")

for i in soup.find_all('div',{'class':'post-info-wrap'}):
link = i.find('a',href=True)
if link is None:
continue
print(link['href'])

更新:

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome('/usr/bin/chromedriver')
driver.get("https://www.example.com/blog/author/abc")

soup = BeautifulSoup(driver.page_source, "html.parser")

for i in soup.find_all('div', {'class': 'post-info-wrap'}):
link = i.find('a', href=True)
if link is None:
continue
print(link['href'])

操作:

https://www.example.com/blog/911/article-1/
https://www.example.com/blog/911/article-2/
https://www.example.com/blog/911/article-3/
https://www.example.com/blog/911/article-4/
https://www.example.com/blog/random-blog/article-5/

对于 Chrome 浏览器:

http://chromedriver.chromium.org/downloads

安装 Chrome 浏览器的网络驱动程序:

https://christopher.su/2015/selenium-chromedriver-ubuntu/

Selenium 教程

https://selenium-python.readthedocs.io/

其中 '/usr/bin/chromedriver' chrome webdriver 路径。

关于python - 如何使用 Beautiful soup 从页面中提取链接,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56421148/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com