gpt4 book ai didi

python - 使用 BeautifulSoup 从页面中抓取所有结果

转载 作者:太空宇宙 更新时间:2023-11-03 14:03:28 24 4
gpt4 key购买 nike

                              **Update**
===================================================

好的,伙计们,到目前为止一切顺利。我有代码可以让我抓取图像,但它以一种奇怪的方式存储它们。它会下载前 40 多个图像,然后在之前创建的“kittens”文件夹中创建另一个“kittens”文件夹并重新开始(下载与第一个文件夹中相同的图像)。我怎样才能改变它?这是代码:

from selenium import webdriver
from selenium.webdriver import Chrome
from selenium.common.exceptions import WebDriverException
from bs4 import BeautifulSoup as soup
import requests
import time
import os

image_tags = []

driver = webdriver.Chrome()
driver.get(url='https://www.pexels.com/search/kittens/')
last_height = driver.execute_script('return document.body.scrollHeight')

while True:
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
time.sleep(1)
new_height = driver.execute_script('return document.body.scrollHeight')
if new_height == last_height:
break
else:
last_height = new_height

sp = soup(driver.page_source, 'html.parser')

for img_tag in sp.find_all('img'):
image_tags.append(img_tag)


if not os.path.exists('kittens'):
os.makedirs('kittens')

os.chdir('kittens')

x = 0

for image in image_tags:
try:
url = image['src']
source = requests.get(url)
with open('kitten-{}.jpg'.format(x), 'wb') as f:
f.write(requests.get(url).content)
x += 1
except:
pass

================================================== =============================

我正在尝试编写一个蜘蛛来从某个页面上抓取小猫的图像。我有一个小问题,因为我的蜘蛛只获取前 15 张图像。我知道这可能是因为页面向下滚动后正在加载更多图像。我该如何解决这个问题?这是代码:

import requests
from bs4 import BeautifulSoup as bs
import os


url = 'https://www.pexels.com/search/cute%20kittens/'

page = requests.get(url)
soup = bs(page.text, 'html.parser')

image_tags = soup.findAll('img')

if not os.path.exists('kittens'):
os.makedirs('kittens')

os.chdir('kittens')

x = 0

for image in image_tags:
try:
url = image['src']
source = requests.get(url)
if source.status_code == 200:
with open('kitten-' + str(x) + '.jpg', 'wb') as f:
f.write(requests.get(url).content)
f.close()
x += 1
except:
pass

最佳答案

由于站点是动态的,因此您需要使用浏览器操作工具,例如selenium:

from selenium import webdriver
from bs4 import BeautifulSoup as soup
import time
import os
driver = webdriver.Chrome()
driver.get('https://www.pexels.com/search/cute%20kittens/')
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(0.5)
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height

image_urls = [i['src'] for i in soup(driver.page_source, 'html.parser').find_all('img')]
if not os.path.exists('kittens'):
os.makedirs('kittens')
os.chdir('kittens')
with open('kittens.txt') as f:
for url in image_urls:
f.write('{}\n'.format(url))

关于python - 使用 BeautifulSoup 从页面中抓取所有结果,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49088880/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com