gpt4 book ai didi

python - 如何从 aspx 页面抓取图像?

转载 作者:可可西里 更新时间:2023-11-01 13:02:26 27 4
gpt4 key购买 nike

我正在尝试从 aspx 页面抓取图像我有这段代码可以从普通网页抓取图像但无法抓取 aspx 页面因为我需要将 http post 请求发送到 aspx 页面我即使阅读了几个线程也无法弄清楚如何做到这一点这是原始代码

from bs4 import BeautifulSoup as bs
import urlparse
import urllib2
from urllib import urlretrieve
import os
import sys
import subprocess
import re


def thefunc(url, out_folder):

c = False

我已经为aspx页面定义了headers和一个区分普通页面和aspx页面的if语句

    select =  raw_input('Is this a .net  aspx page ? y/n : ')
if select.lower().startswith('y'):
usin = raw_input('Specify origin of .net page : ')
usaspx = raw_input('Specify aspx page url : ')

aspx 页面的页眉

        headdic = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Origin': usin,
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17',
'Content-Type': 'application/x-www-form-urlencoded',
'Referer': usaspx,
'Accept-Encoding': 'gzip,deflate,sdch',
'Accept-Language': 'en-US,en;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'
}
c = True

if c:
req = urllib2.Request(url, headers=headic)
else:
req = urllib2.Request(url, headers={'User-Agent' : "Magic Browser"})
resp = urllib2.urlopen(req)

soup = bs(resp, 'lxml')

parsed = list(urlparse.urlparse(url))

print '\n',len(soup.findAll('img')), 'images are about to be downloaded'

for image in soup.findAll("img"):

print "Image: %(src)s" % image

filename = image["src"].split("/")[-1]

parsed[2] = image["src"]

outpath = os.path.join(out_folder, filename)

try:

if image["src"].lower().startswith("http"):
urlretrieve(image["src"], outpath)
else:
urlretrieve(urlparse.urlunparse(parsed), outpath)
except:
print 'OOPS missed one for some reason !!'
pass


try:
put = raw_input('Please enter the page url : ')
reg1 = re.compile('^http*',re.IGNORECASE)
reg1.match(put)
except:
print('Type the url carefully !!')
sys.exit()
fol = raw_input('Enter the foldername to save the images : ')
if os.path.isdir(fol):
thefunc(put, fol)
else:
subprocess.call('mkdir', fol)
thefunc(put, fol)

我对 aspx 检测和创建 aspx 页面的 header 做了一些修改,但是接下来如何修改我被困在这里

***这里是aspx页面链接*** http://www.foxrun.com.au/Products/Cylinders_with_Gadgets.aspx

抱歉,如果我不清楚,你可以看到我是编程新手,我想问的是当我单击下一页时如何获取从 aspx 页面获取的图像浏览器中的按钮导致如果我只能抓取一个页面导致 url 不会更改,除非我以某种方式发送 http 帖子告诉页面显示带有新图片的下一页,因为 url 保持不变我希望我清楚

最佳答案

您可以使用请求通过将正确的数据发布到 url 来完成,您可以从初始页面解析这些数据:

import requests
from bs4 import BeautifulSoup
from urlparse import urljoin
from itertools import chain

url = "http://www.foxrun.com.au/Products/Cylinders_with_Gadgets.aspx"


def validate(soup):
return {"__VIEWSTATE": soup.select_one("#__VIEWSTATE")["value"],
"__VIEWSTATEGENERATOR": soup.select_one("#__VIEWSTATEGENERATOR")["value"],
"__EVENTVALIDATION": soup.select_one("#__EVENTVALIDATION")["value"]}


def parse(base, url):
data = {"__ASYNCPOST": "true"
}
h = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17'}
soup = BeautifulSoup(requests.get(url).text)
data.update(validate(soup))
# gets links for < 1,2,3,4,5,6>
pages = [a["id"] for a in soup.select("a[id^=ctl01_ctl00_pbsc1_pbPagerBottom_btnP]")][2:]
# get images from initial page
yield [img["src"] for img in soup.select("img")]
# add token for post
data.update(validate(soup))
for p in pages:
# we need $ in place of _ for the form data
data["__EVENTTARGET"] = p.replace("_", "$")
data["RadScriptManager1"] = "ctl01$ctl00$pbsc1$ctl01$ctl00$pbsc1$ajaxPanel1Panel|{}".format(p.replace("_", "$"))
r = requests.post(url, data=data, headers=h).text
soup = BeautifulSoup(r)
yield [urljoin(base, img["src"]) for img in soup.select("img")]


for url in chain.from_iterable(parse("http://www.foxrun.com.au/", url)):
print(url)

这将为您提供链接,您只需下载内容并将其写入文件即可。通常我们可以创建一个 Session 并从一页转到下一页,但在这种情况下,发布的内容是 ctl01$ctl00$pbsc1$pbPagerBottom$btnNext 这将正常工作从初始页到第二页,但没有从第二页到第三页等的概念。因为我们在表单数据中没有页码。

关于python - 如何从 aspx 页面抓取图像?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37393171/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com