gpt4 book ai didi

python - Scrapy 无法点击所有页面

转载 作者:太空宇宙 更新时间:2023-11-03 16:23:47 27 4
gpt4 key购买 nike

我正在使用 Scrapy 抓取网上商店。产品是动态加载的,这就是我使用 Selenium 来爬行页面的原因。我开始抓取所有类别,然后为主函数调用这些类别。

在爬行每个类别时都会出现问题:蜘蛛被指示从第一页抓取所有数据,然后单击按钮进入下一页,直到没有按钮为止。如果我只是将一个类别 url 作为 start_url 输入,该代码就可以正常工作,但奇怪的是,如果我在主代码中运行它,它不会点击所有页面。在单击所有下一步按钮之前,它会随机切换到新类别。

我不知道为什么会出现这种情况。

import scrapy
from scrapy import signals
from scrapy.http import TextResponse
from scrapy.xlib.pydispatch import dispatcher
from horni.items import HorniItem

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
from selenium.webdriver.common.keys import Keys

class horniSpider(scrapy.Spider):
name = "final"
allowed_domains = ["example.com"]
start_urls = ['https://www.example.com']

def parse(self, response):
for post in response.xpath('//body'):
item = HorniItem()
for href in response.xpath('//li[@class="sub"]/a/@href'):
item['maincategory'] = response.urljoin(href.extract())
yield scrapy.Request(item['maincategory'], callback = self.parse_subcategories)

def parse_subcategories(self, response):
item = HorniItem()
for href in response.xpath('//li[@class="sub"]/a/@href'):
item['subcategory'] = response.urljoin(href.extract())
yield scrapy.Request(item['subcategory'], callback = self.parse_articles)


def __init__(self):
self.driver = webdriver.Chrome()
dispatcher.connect(self.spider_closed, signals.spider_closed)

def spider_closed(self, spider):
self.driver.close()

def parse_articles(self, response):
self.driver.get(response.url)
response = TextResponse(url=self.driver.current_url, body=self.driver.page_source, encoding='utf-8')
item = HorniItem()
for sel in response.xpath('//body'):
item['title'] = sel.xpath('//div[@id="article-list-headline"]/div/h1/text()').extract()
yield item
for post in response.xpath('//body'):
id = post.xpath('//a[@class="title-link"]/@href').extract()
prices = post.xpath('///span[@class="price ng-binding"]/text()').extract()
articles = post.xpath('//a[@class="title-link"]/span[normalize-space()]/text()').extract()
id = [i.split('/')[-2] for i in id]
prices = [x for x in prices if x != u'\xa0']
articles = [w.replace(u'\n', '') for w in articles]
result = zip(id, prices, articles)
for id, price, article in result:
item = HorniItem()
item['id'] = id
item['price'] = price
item['name'] = article
yield item
while True:
next = self.driver.find_element_by_xpath('//div[@class="paging-wrapper"]/a[@class="paging-btn right"]')
try:
next.click()
response = TextResponse(url=self.driver.current_url, body=self.driver.page_source, encoding='utf-8')
item = HorniItem()
for post in response.xpath('//body'):
id = post.xpath('//a[@class="title-link"]/@href').extract()
prices = post.xpath('///span[@class="price ng-binding"]/text()').extract()
articles = post.xpath('//a[@class="title-link"]/span[normalize-space()]/text()').extract()
id = [i.split('/')[-2] for i in id]
prices = [x for x in prices if x != u'\xa0']
articles = [w.replace(u'\n', '') for w in articles]
result = zip(id, prices, articles)
for id, price, article in result:
item = HorniItem()
item['id'] = id
item['price'] = price
item['name'] = article
yield item
except:
break

更新

看来问题出在 DOWNLOAD_DELAY 设置上。由于网站上的下一步按钮实际上不会生成新的 url,而只是执行 JavaScript,因此网站 URL 不会更改。

最佳答案

我找到了答案:

问题在于,由于页面内容是动态生成的,因此单击 NEXT 按钮实际上并未更改 url。与项目的 DOWNLOAD_DELAY 设置相结合,这意味着蜘蛛会在页面上停留给定的时间,无论它是否能够单击所有可能的 NEXT-按钮。

DOWNLOAD_DELAY 设置得足够高,可以让蜘蛛在每个网址上停留足够长的时间并抓取每个页面。

问题是,这会迫使蜘蛛在每个网址上等待设定的时间,即使没有NEXT按钮可供点击。但是好吧...

关于python - Scrapy 无法点击所有页面,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38180300/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com