gpt4 book ai didi

python - 如何使用 beautifulSoup 从网站中提取和下载所有图像?

转载 作者:太空狗 更新时间:2023-10-29 21:22:28 26 4
gpt4 key购买 nike

我正在尝试从 url 中提取和下载所有图像。我写了一个脚本

import urllib2
import re
from os.path import basename
from urlparse import urlsplit

url = "http://filmygyan.in/katrina-kaifs-top-10-cutest-pics-gallery/"
urlContent = urllib2.urlopen(url).read()
# HTML image tag: <img src="url" alt="some_text"/>
imgUrls = re.findall('img .*?src="(.*?)"', urlContent)

# download all images
for imgUrl in imgUrls:
try:
imgData = urllib2.urlopen(imgUrl).read()
fileName = basename(urlsplit(imgUrl)[2])
output = open(fileName,'wb')
output.write(imgData)
output.close()
except:
pass

我不想提取此页面的图像,请参阅此图像 http://i.share.pho.to/1c9884b1_l.jpeg我只想获取所有图像而不单击“下一步”按钮我不知道如何获取“下一个”类中的所有图片。我应该在 findall 中做哪些更改?

最佳答案

以下应从给定页面中提取所有图像并将其写入运行脚本的目录。

import re
import requests
from bs4 import BeautifulSoup

site = 'http://pixabay.com'

response = requests.get(site)

soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')

urls = [img['src'] for img in img_tags]


for url in urls:
filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)
if not filename:
print("Regex didn't match with the url: {}".format(url))
continue
with open(filename.group(1), 'wb') as f:
if 'http' not in url:
# sometimes an image source can be relative
# if it is provide the base url which also happens
# to be the site variable atm.
url = '{}{}'.format(site, url)
response = requests.get(url)
f.write(response.content)

关于python - 如何使用 beautifulSoup 从网站中提取和下载所有图像?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/18408307/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com