gpt4 book ai didi

python - 使用python解析相对链接和绝对链接

转载 作者:搜寻专家 更新时间:2023-10-31 22:28:18 24 4
gpt4 key购买 nike

这是一个下载图片、音频、视频等的项目。但是在某些网站上,我发现没有完整的链接。只是相对路径。所以我不知道如何获得这些相关链接。

我的完整项目在:

https://github.com/MuneebKalathil/MaD

这是我的示例链接,我想从此链接下载所有图像。有缩略图,但我不想要那些图像。如果单击缩略图,它将转到原始图像页面。我想下载那个图片

http://www.ragalahari.com/actress/14035/kajal-aggarwal-at-memu-saitham-dinner-with-stars.aspx

部分来源是:

<tr>
<td id='pagingCell'>
</td>
</tr>
<tr>
<td align='center'><div id='galdiv' style='float:center;margin-right:3px;;margin-bottom:3px'>
<a href='/actress/14035/kajal-aggarwal-at-memu-saitham-dinner-with-stars/image1.aspx' ><img src="http://imgcdn.raagalahari.com/nov2014/starzone/kajal-agarwal-memu-saitham/kajal-agarwal-memu-saitham1t.jpg" alt="Kajal Aggarwal" title="Kajal Aggarwal at Dine with Stars Memu Saitham"></a>

而且,我想先得到一个相对链接地址:

/actress/14035/kajal-aggarwal-at-memu-saitham-dinner-with-stars/image1.aspx

并找到它的绝对路径。

最佳答案

定义基本 url,找到所有 img 标签,如果 src 属性值不是以 http 开头,则使用 urlparse.urljoin()加入基本 url 和 src

示例,使用 requestsBeautifulSoup :

from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup

base_url = 'http://www.ragalahari.com'
url = 'http://www.ragalahari.com/actress/14035/kajal-aggarwal-at-memu-saitham-dinner-with-stars.aspx'

soup = BeautifulSoup(requests.get(url).content)

for img in soup.find_all('img', src=True):
src = img.get('src')
if not src.startswith('http'):
src = urljoin(base_url, src)

print(src)

打印:

http://icdn.raagalahari.com/images/ragalaharilogo.png
http://www.ragalahari.com/images/helpicon.png
http://www.ragalahari.com/images/rssicon.png
http://www.ragalahari.com/images/twittericon.png
http://www.ragalahari.com/images/facebookicon.png
http://www.ragalahari.com/images/searchicon.png
http://imgcdn.raagalahari.com/nov2014/starzone/kajal-agarwal-memu-saitham/kajal-agarwal-memu-saitham1t.jpg
http://imgcdn.raagalahari.com/nov2014/starzone/kajal-agarwal-memu-saitham/kajal-agarwal-memu-saitham2t.jpg
http://imgcdn.raagalahari.com/nov2014/starzone/kajal-agarwal-memu-saitham/kajal-agarwal-memu-saitham3t.jpg
http://imgcdn.raagalahari.com/nov2014/starzone/kajal-agarwal-memu-saitham/kajal-agarwal-memu-saitham4t.jpg
...

更新(获取a链接的部分代码):

for a in soup.select('div#galdiv a'):
link = a.get('href')
if not link.startswith('http'):
link = urljoin(base_url, link)

print(link)

关于python - 使用python解析相对链接和绝对链接,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27631243/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com