gpt4 book ai didi

ajax - Scrapy 跟随分页 AJAX 请求 - POST

转载 作者:行者123 更新时间:2023-12-03 23:24:38 24 4
gpt4 key购买 nike

我对scrapy很陌生并且已经构建了一些蜘蛛。
我想从这个 page 中抓取评论.到目前为止,我的蜘蛛抓取了第一页并抓取了这些项目,但是在分页时它不遵循链接。

我知道发生这种情况是因为它是一个 Ajax 请求,但它是一个 POST 而不是 GET 我是新手,但我读了 this .我已经阅读了这篇文章 here并按照“迷你教程”从似乎是的响应中获取网址

http://www.pcguia.pt/category/reviews/sorter=recent&location=&loop=main+loop&action=sort&view=grid&columns=3&paginated=2&currentquery%5Bcategory_name%5D=reviews

但是当我尝试在浏览器上打开它时它说

"Página nao encontrada"="PAGE NOT FOUND"



到目前为止,我的想法是正确的,我错过了什么?

编辑:我的蜘蛛:
import scrapy
import json
from scrapy.http import FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from pcguia.items import ReviewItem

class PcguiaSpider(scrapy.Spider):
name = "pcguia" #spider name to call in terminal
allowed_domains = ['pcguia.pt'] #the domain where the spider is allowed to crawl
start_urls = ['http://www.pcguia.pt/category/reviews/#paginated=1'] #url from which the spider will start crawling
page_incr = 1
pagination_url = 'http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php'

def parse(self, response):

sel = Selector(response)

if self.page_incr > 1:
json_data = json.loads(response.body)
sel = Selector(text=json_data.get('content', ''))


hxs = Selector(response)

item_pub = ReviewItem()

item_pub['date']= hxs.xpath('//span[@class="date"]/text()').extract() # is in the format year-month-dayThours:minutes:seconds-timezone ex: 2015-03-31T09:40:00-0700


item_pub['title'] = hxs.xpath('//title/text()').extract()

#pagination code starts here
# if page has content
if sel.xpath('//div[@class="panel-wrapper"]'):
self.page_incr +=1
formdata = {
'sorter':'recent',
'location':'main loop',
'loop':'main loop',
'action':'sort',
'view':'grid',
'columns':'3',
'paginated':str(self.page_incr),
'currentquery[category_name]':'reviews'
}
yield FormRequest(url=self.pagination_url, formdata=formdata, callback=self.parse)
else:
return

yield item_pub

输出:
2015-05-12 14:53:45+0100 [scrapy] INFO: Scrapy 0.24.5 started (bot: pcguia)
2015-05-12 14:53:45+0100 [scrapy] INFO: Optional features available: ssl, http11
2015-05-12 14:53:45+0100 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'pcguia.spiders', 'SPIDER_MODULES': ['pcguia.spiders'], 'BOT_NAME': 'pcguia'}
2015-05-12 14:53:45+0100 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-05-12 14:53:45+0100 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-05-12 14:53:45+0100 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-05-12 14:53:45+0100 [scrapy] INFO: Enabled item pipelines:
2015-05-12 14:53:45+0100 [pcguia] INFO: Spider opened
2015-05-12 14:53:45+0100 [pcguia] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-05-12 14:53:45+0100 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6033
2015-05-12 14:53:45+0100 [scrapy] DEBUG: Web service listening on 127.0.0.1:6090
2015-05-12 14:53:45+0100 [pcguia] DEBUG: Crawled (200) <GET http://www.pcguia.pt/category/reviews/#paginated=1> (referer: None)
2015-05-12 14:53:45+0100 [pcguia] DEBUG: Scraped from <200 http://www.pcguia.pt/category/reviews/>
{'date': '',
'title': [u'Reviews | PCGuia'],
}
2015-05-12 14:53:47+0100 [pcguia] DEBUG: Crawled (200) <POST http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php> (referer: http://www.pcguia.pt/category/reviews/)
2015-05-12 14:53:47+0100 [pcguia] DEBUG: Scraped from <200 http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php>
{'date': ''
'title': ''
}

最佳答案

你可以试试这个

from scrapy.http import FormRequest
from scrapy.selector import Selector
# other imports

class SpiderClass(Spider)
# spider name and all
page_incr = 1
pagination_url = 'http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php'

def parse(self, response):

sel = Selector(response)

if page_incr > 1:
json_data = json.loads(response.body)
sel = Selector(text=json_data.get('content', ''))

# your code here

#pagination code starts here
# if page has content
if sel.xpath('//div[@class="panel-wrapper"]'):
self.page_incr +=1
formdata = {
'sorter':'recent',
'location':'main loop',
'loop':'main loop',
'action':'sort',
'view':'grid',
'columns':'3',
'paginated':str(self.page_incr),
'currentquery[category_name]':'reviews'
}
yield FormRequest(url=self.pagination_url, formdata=formdata, callback=self.parse)
else:
return

我已经使用scrapy shell 及其工作进行了测试,

在 Scrapy Shell 中
In [0]: response.url
Out[0]: 'http://www.pcguia.pt/category/reviews/#paginated=1'

In [1]: from scrapy.http import FormRequest

In [2]: from scrapy.selector import Selector

In [3]: import json

In [4]: response.xpath('//h2/a/text()').extract()
Out[4]:
[u'HP Slate 8 Plus',
u'Astro A40 +MixAmp Pro',
u'Asus ROG G751J',
u'BQ Aquaris E5 HD 4G',
u'Asus GeForce GTX980 Strix',
u'AlienTech BattleBox Edition',
u'Toshiba Encore Mini WT7-C',
u'Samsung Galaxy Note 4',
u'Asus N551JK',
u'Western Digital My Passport Wireless',
u'Nokia Lumia 735',
u'Photoshop Elements 13',
u'AMD Radeon R9 285',
u'Asus GeForce GTX970 Stryx',
u'TP-Link AC750 Wifi Repeater']

In [5]: url = "http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php"

In [6]: formdata = {
'sorter':'recent',
'location':'main loop',
'loop':'main loop',
'action':'sort',
'view':'grid',
'columns':'3',
'paginated':'2',
'currentquery[category_name]':'reviews'
}

In [7]: r = FormRequest(url=url, formdata=formdata)

In [8]: fetch(r)
2015-05-12 18:29:16+0530 [default] DEBUG: Crawled (200) <POST http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x7fcc247c4590>
[s] item {}
[s] r <POST http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php>
[s] request <POST http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php>
[s] response <200 http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php>
[s] settings <scrapy.settings.Settings object at 0x7fcc2a74f450>
[s] spider <Spider 'default' at 0x7fcc239ba990>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser

In [9]: json_data = json.loads(response.body)

In [10]: sell = Selector(text=json_data.get('content', ''))

In [11]: sell.xpath('//h2/a/text()').extract()
Out[11]:
[u'Asus ROG GR8',
u'Devolo dLAN 1200+',
u'Yezz Billy 4,7',
u'Sony Alpha QX1',
u'Toshiba Encore2 WT10',
u'BQ Aquaris E5 FullHD',
u'Toshiba Canvio AeroMobile',
u'Samsung Galaxy Tab S 10.5',
u'Modecom FreeTab 7001 HD',
u'Steganos Online Shield VPN',
u'AOC G2460PG G-Sync',
u'AMD Radeon R7 SSD',
u'Nvidia Shield',
u'Asus ROG PG278Q GSync',
u'NOX Krom Kombat']

编辑
import scrapy
import json
from scrapy.http import FormRequest, Request
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from pcguia.items import ReviewItem
from dateutil import parser
import re


class PcguiaSpider(scrapy.Spider):
name = "pcguia" #spider name to call in terminal
allowed_domains = ['pcguia.pt'] #the domain where the spider is allowed to crawl
start_urls = ['http://www.pcguia.pt/category/reviews/#paginated=1'] #url from which the spider will start crawling
page_incr = 1
pagination_url = 'http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php'

def parse(self, response):
sel = Selector(response)
if self.page_incr > 1:
json_data = json.loads(response.body)
sel = Selector(text=json_data.get('content', ''))
review_links = sel.xpath('//h2/a/@href').extract()
for link in review_links:
yield Request(url=link, callback=self.parse_review)
#pagination code starts here
# if page has content
if sel.xpath('//div[@class="panel-wrapper"]'):
self.page_incr +=1
formdata = {
'sorter':'recent',
'location':'main loop',
'loop':'main loop',
'action':'sort',
'view':'grid',
'columns':'3',
'paginated':str(self.page_incr),
'currentquery[category_name]':'reviews'
}
yield FormRequest(url=self.pagination_url, formdata=formdata, callback=self.parse)
else:
return

def parse_review(self, response):
month_matcher = 'novembro|janeiro|agosto|mar\xe7o|fevereiro|junho|dezembro|julho|abril|maio|outubro|setembro'
month_dict = {u'abril': u'April',
u'agosto': u'August',
u'dezembro': u'December',
u'fevereiro': u'February',
u'janeiro': u'January',
u'julho': u'July',
u'junho': u'June',
u'maio': u'May',
u'mar\xe7o': u'March',
u'novembro': u'November',
u'outubro': u'October',
u'setembro': u'September'}
review_date = response.xpath('//span[@class="date"]/text()').extract()
review_date = review_date[0].strip().strip('Publicado a').lower() if review_date else ''
month = re.findall('%s'% month_matcher, review_date)[0]
_date = parser.parse(review_date.replace(month, month_dict.get(month))).strftime('%Y-%m-%dT%H:%M:%T')
title = response.xpath('//h1[@itemprop="itemReviewed"]/text()').extract()
title = title[0].strip() if title else ''
item_pub = ReviewItem(
date=_date,
title=title)
yield item_pub

输出
{'date': '2014-11-05T00:00:00', 'title': u'Samsung Galaxy Tab S 10.5'}

关于ajax - Scrapy 跟随分页 AJAX 请求 - POST,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30189862/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com