gpt4 book ai didi

javascript - 抓取需要您向下滚动的网站

转载 作者:数据小太阳 更新时间:2023-10-29 05:28:35 26 4
gpt4 key购买 nike

我想在这里抓取这个网站:

但是,它需要我向下滚动才能收集更多数据。我不知道如何使用 Beautiful soup 或 python 向下滚动。这里有人知道怎么做吗?

代码有点乱,但就在这里。

import scrapy
from scrapy.selector import Selector
from testtest.items import TesttestItem
import datetime
from selenium import webdriver
from bs4 import BeautifulSoup
from HTMLParser import HTMLParser
import re
import time

class MLStripper(HTMLParser):


class MySpider(scrapy.Spider):
name = "A1Locker"

def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()

allowed_domains = ['https://www.a1lockerrental.com']
start_urls = ['http://www.a1lockerrental.com/self-storage/mo/st-
louis/4427-meramec-bottom-rd-facility/unit-sizes-prices#/units?
category=all']

def parse(self, response):

url='http://www.a1lockerrental.com/self-storage/mo/st-
louis/4427-meramec-bottom-rd-facility/unit-sizes-prices#/units?
category=Small'
driver = webdriver.Firefox()
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
url2='http://www.a1lockerrental.com/self-storage/mo/st-louis/4427-
meramec-bottom-rd-facility/unit-sizes-prices#/units?category=Medium'
driver2 = webdriver.Firefox()
driver2.get(url2)
html2 = driver.page_source
soup2 = BeautifulSoup(html2, 'html.parser')
#soup.append(soup2)
#print soup
items = []
inside = "Indoor"
outside = "Outdoor"
inside_units = ["5 x 5", "5 x 10"]
outside_units = ["10 x 15","5 x 15", "8 x 10","10 x 10","10 x
20","10 x 25","10 x 30"]
sizeTagz = soup.findAll('span',{"class":"sss-unit-size"})
sizeTagz2 = soup2.findAll('span',{"class":"sss-unit-size"})
#print soup.findAll('span',{"class":"sss-unit-size"})



rateTagz = soup.findAll('p',{"class":"unit-special-offer"})


specialTagz = soup.findAll('span',{"class":"unit-special-offer"})
typesTagz = soup.findAll('div',{"class":"unit-info"},)

rateTagz2 = soup2.findAll('p',{"class":"unit-special-offer"})


specialTagz2 = soup2.findAll('span',{"class":"unit-special-offer"})
typesTagz2 = soup2.findAll('div',{"class":"unit-info"},)
yield {'date': datetime.datetime.now().strftime("%m-%d-%y"),
'name': "A1Locker"
}
size = []
for n in range(len(sizeTagz)):
print len(rateTagz)
print len(typesTagz)

if "Outside" in (typesTagz[n]).get_text():



size.append(re.findall(r'\d+',
(sizeTagz[n]).get_text()))
size.append(re.findall(r'\d+',
(sizeTagz2[n]).get_text()))
print "logic hit"
for i in range(len(size)):
yield {
#soup.findAll('p',{"class":"icon-bg"})
#'name': soup.find('strong', {'class':'high'}).text

'size': size[i]
#"special": (specialTagz[n]).get_text(),
#"rate": re.findall(r'\d+',(rateTagz[n]).get_text()),
#"size": i.css(".sss-unit-size::text").extract(),
#"types": "Outside"

}
driver.close()

代码的预期输出是让它显示从该网页收集的数据:http://www.a1lockerrental.com/self-storage/mo/st-louis/4427-meramec-bottom-rd-facility/unit-sizes-prices#/units?category=all

这样做需要能够向下滚动以查看其余数据。至少我的想法是这样。

谢谢,DM123

最佳答案

您尝试抓取的网站正在使用 JavaScript 动态加载内容。不幸的是,很多网络爬虫,比如 BeautifulSoup ,不能自己执行 JavaScript。然而,有许多选项,其中许多以 headless 浏览器的形式出现。一个经典的是PhantomJS , 但可能值得一看 great list of options on GitHub ,其中一些可能与漂亮的汤很好地搭配,例如 Selenium。

牢记 Selenium,this Stackoverflow question 的答案也可能有所帮助。

关于javascript - 抓取需要您向下滚动的网站,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45620396/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com