gpt4 book ai didi

python - 自动化无聊的东西 - 图像站点下载器

转载 作者:行者123 更新时间:2023-12-03 23:45:31 25 4
gpt4 key购买 nike

我正在从 Automate The Boring Stuff 书中编写一个项目。任务如下:
图片站点下载器
编写一个程序,访问像 Flickr 或 Imgur 这样的照片共享网站,
搜索一类照片,然后下载所有结果
图片。您可以编写一个程序,适用于任何具有
一个搜索功能。
这是我的代码:

import requests, bs4, os

# The outerHTML file which I got by rightClicking and copying the <html> tag on 'page source'
flickrFile=open('flickrHtml.html',encoding="utf8")

#Parsing the HTML document
flickrSoup=bs4.BeautifulSoup(flickrFile,'html.parser')

# categoryElem is the Element which has image source inside
categoryElem=flickrSoup.select("a[class='overlay']")
#len(categoryElem)=849

os.makedirs('FlickrImages', exist_ok=True)
for i in range(len(categoryElem)-1):

# Regex searching for the href
import re
html=str(categoryElem[i])
htmlRegex=re.compile(r'href.*/"')
mo=htmlRegex.search(html)
imageUrl=mo.group()

imageUrl=imageUrl.replace('"','')
imageUrl=imageUrl.replace('href=','')

imageUrlFlickr="https://www.flickr.com"+str(imageUrl)

# Downloading the response object of the Image URL
res = requests.get(imageUrlFlickr)
imageSoup=bs4.BeautifulSoup(res.text)
picElem=imageSoup.select('div[class="view photo-well-media-scrappy-view requiredToShowOnServer"] img')

# Regex searching for the jpg file in the picElem HTML element
html=str(picElem)
htmlRegex=re.compile(r'//live.*\.jpg')
mo=htmlRegex.search(html)
try:
imageUrlRegex=mo.group()
except Exception as exc:
print('There was a problem: %s' % (exc))
res1=requests.get('https:'+imageUrlRegex)
try:
res1.raise_for_status()
except Exception as exc:
print('There was a problem: %s' % (exc))
# Dowloading the jpg to my folder
imageFile = open(os.path.join('FlickrImages', os.path.basename(imageUrlRegex)), 'wb')
for chunk in res1.iter_content(100000):
imageFile.write(chunk)
查了之后 this question ,我认为要下载图片“海”的所有 400 万个结果,我复制(如问题的答案中所述)整个 OuterHTML。如果我没有看这个问题,也没有复制完整的 HTML 源代码(在我的代码中,它存储在 flickrFile=open('flickrHtml.html',encoding="utf8") 中),我最终会得到 categoryElem等于 24,因此只下载 24 张图片,而不是 849 张图片。

There are 4 million pictures, how do I download all of them, without having to download the HTML source to a separate file?


我正在考虑我的程序执行以下操作:
  • 获取搜索的第一张图片的网址--> 下载图片--> 获取下一张图片的网址--> 下载图片.... 依此类推,直到没有任何东西可供下载。

  • 我没有采用第一种方法,因为我不知道如何访问第一张图片的链接。我尝试获取它的 URL,但是当我检查“照片流”中第一张图片(或任何其他图片)的元素时,它给了我一个指向特定用户的“照片流”的链接,而不是一般“海上搜索照片流”。
    Here is the link for the photo stream Search
    如果有人也能帮我解决这个问题,那就太棒了。
    Here is some code来自执行相同任务的人,但他只下载了前 24 张图片,这些图片显示在未渲染的原始 HTML 中

    最佳答案

    如果您想使用 requests + Beautfulsoup , 在下面试试这个(通过传递参数 page ):

    import re, requests, threading, os
    from bs4 import BeautifulSoup

    def download_image(url):
    with open(os.path.basename(url), "wb") as f:
    f.write(requests.get(url).content)
    print(url, "download successfully")

    original_url = "https://www.flickr.com/search/?text=sea&view_all=1&page={}"

    pages = range(1, 5000) # not sure how many pages here

    for page in pages:
    concat_url = original_url.format(page)
    print("Now it is page", page)
    soup = BeautifulSoup(requests.get(concat_url).content, "lxml")
    soup_list = soup.select(".photo-list-photo-view")
    for element in soup_list:
    img_url = 'https:'+re.search(r'url\((.*)\)', element.get("style")).group(1)
    # the url like: https://live.staticflickr.com/xxx/xxxxx_m.jpg
    # if you want to get a clearer(and larger) picture, remove the "_m" in the end of the url.
    # For prevent IO block,I create a thread to download it.pass the url of the image as argument.
    threading.Thread(target=download_image, args=(img_url,)).start()

    如果使用 Selenium ,可能会更容易,示例代码如下:
    from selenium import webdriver
    import re, requests, threading, os

    # download_image
    def download_image(url):
    with open(os.path.basename(url), "wb") as f:
    f.write(requests.get(url).content)


    driver = webdriver.Chrome()
    original_url = "https://www.flickr.com/search/?text=sea&view_all=1&page={}"

    pages = range(1, 5000) # not sure how many pages here

    for page in pages:
    concat_url = original_url.format(page)
    print("Now it is page", page)
    driver.get(concat_url)
    for element in driver.find_elements_by_css_selector(".photo-list-photo-view"):
    img_url = 'https:'+re.search(r'url\(\"(.*)\"\)', element.get_attribute("style")).group(1)
    # the url like: https://live.staticflickr.com/xxx/xxxxx_m.jpg
    # if you want to get a clearer(and larger) picture, remove the "_m" in the end of the url.
    # For prevent IO block,I create a thread to download it.pass the url of the image as argument.
    threading.Thread(target=download_image, args=(img_url, )).start()
    它在我的电脑上成功下载。
    enter image description here

    关于python - 自动化无聊的东西 - 图像站点下载器,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63035100/

    25 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com