gpt4 book ai didi

python - scrapy 爬行蜘蛛 ajax 分页

转载 作者:太空宇宙 更新时间:2023-11-03 11:03:01 27 4
gpt4 key购买 nike

我试图废弃具有 ajax 分页调用的链接。我正在尝试抓取 http://www.demo.com关联。在 .py 文件中,我提供了用于限制 XPATH 的代码,编码是:

# -*- coding: utf-8 -*-
import scrapy

from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import sumSpider, Rule
from scrapy.selector import HtmlXPathSelector
from sum.items import sumItem

class Sumspider1(sumSpider):
name = 'sumDetailsUrls'
allowed_domains = ['sum.com']
start_urls = ['http://www.demo.com']
rules = (
Rule(LinkExtractor(restrict_xpaths='.//ul[@id="pager"]/li[8]/a'), callback='parse_start_url', follow=True),
)

#use parse_start_url if your spider wants to crawl from first page , so overriding
def parse_start_url(self, response):
print '********************************************1**********************************************'
#//div[@class="showMoreCars hide"]/a
#.//ul[@id="pager"]/li[8]/a/@href
self.log('Inside - parse_item %s' % response.url)
hxs = HtmlXPathSelector(response)
item = sumItem()
item['page'] = response.url
title = hxs.xpath('.//h1[@class="page-heading"]/text()').extract()
print '********************************************title**********************************************',title
urls = hxs.xpath('.//a[@id="linkToDetails"]/@href').extract()
print '**********************************************2***url*****************************************',urls

finalurls = []

for url in urls:
print '---------url-------',url
finalurls.append(url)

item['urls'] = finalurls
return item

我的 items.py 文件包含

from scrapy.item import Item, Field


class sumItem(Item):
# define the fields for your item here like:
# name = scrapy.Field()
page = Field()
urls = Field()

我仍然没有得到准确的输出,当我抓取它时无法获取所有页面。

最佳答案

希望下面的代码对您有所帮助。

一些蜘蛛.py
# -*- coding: utf-8 -*-
import scrapy
import re
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.spider import BaseSpider
from demo.items import DemoItem
from selenium import webdriver

def removeUnicodes(strData):
if(strData):
strData = strData.encode('utf-8').strip()
strData = re.sub(r'[\n\r\t]',r' ',strData.strip())
return strData

class demoSpider(scrapy.Spider):
name = "domainurls"
allowed_domains = ["domain.com"]
start_urls = ['http://www.domain.com/used/cars-in-trichy/']

def __init__(self):
self.driver = webdriver.Remote("http://127.0.0.1:4444/wd/hub", webdriver.DesiredCapabilities.HTMLUNITWITHJS)

def parse(self, response):
self.driver.get(response.url)
self.driver.implicitly_wait(5)
hxs = Selector(response)
item = DemoItem()
finalurls = []
while True:
next = self.driver.find_element_by_xpath('//div[@class="showMoreCars hide"]/a')

try:
next.click()
# get the data and write it to scrapy items
item['pageurl'] = response.url
item['title'] = removeUnicodes(hxs.xpath('.//h1[@class="page-heading"]/text()').extract()[0])
urls = self.driver.find_elements_by_xpath('.//a[@id="linkToDetails"]')

for url in urls:
url = url.get_attribute("href")
finalurls.append(removeUnicodes(url))

item['urls'] = finalurls

except:
break

self.driver.close()
return item

items.py

from scrapy.item import Item, Field

class DemoItem(Item):
page = Field()
urls = Field()
pageurl = Field()
title = Field()

注意:您需要运行 selenium rc 服务器,因为 HTMLUNITWITHJS 仅使用 Python 与 selenium rc 一起工作。

发出命令运行您的 selenium rc 服务器:

java -jar selenium-server-standalone-2.44.0.jar

使用命令运行你的蜘蛛:

spider crawl domainurls -o someoutput.json

关于python - scrapy 爬行蜘蛛 ajax 分页,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27501751/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com