gpt4 book ai didi

python - Scrapy 分页不起作用并优化了蜘蛛

转载 作者:行者123 更新时间:2023-12-01 02:05:40 24 4
gpt4 key购买 nike

请帮助我优化我的 scrapy 蜘蛛。特别是下一页分页不起作用。有很多页,每页有 50 项。我在 parse_items 中捕获第一页 50 个项目(链接),下一页项目也在 parse_items 中废弃。

import scrapy
from scrapy import Field
from fake_useragent import UserAgent

class DiscoItem(scrapy.Item):
release = Field()
images = Field()


class discoSpider(scrapy.Spider):
name = 'myspider'
allowed_domains = ['discogs.com']
query = input('ENTER SEARCH MUSIC TYPE : ')
start_urls =['http://www.discogs.com/search?q=%s&type=release'%query]
custome_settings = {
'User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36",

'handle_httpstatus_list' : [301,302,],
'download_delay' :10}

def start_requests(self):
yield scrapy.Request(url=self.start_urls[0], callback=self.parse)

def parse(self, response):
print('START parse \n')
print("*****",response.url)

#next page pagination
next_page =response.css('a.pagination_next::attr(href)').extract_first()
next_page = response.urljoin(next_page)
yield scrapy.Request(url=next_page, callback=self.parse_items2)

headers={}
for link in response.css('a.search_result_title ::attr(href)').extract():

ua = UserAgent()# random user agent
headers['User-Agent'] = ua.random
yield scrapy.Request(response.urljoin(link),headers=headers,callback=self.parse_items)


def parse_items2(self, response):
print('parse_items2 *******', response.url)
yield scrapy.Request(url=response.url, callback=self.parse)

def parse_items(self,response):

print("parse_items**********",response.url)
items = DiscoItem()
for imge in response.css('div#page_content'):
img = imge.css("span.thumbnail_center img::attr(src)").extract()[0]
items['images'] = img
release=imge.css('div.content a ::text').extract()
items['release']=release[4]
yield items

最佳答案

当我尝试运行您的代码时(在修复了许多缩进、拼写和字母大小写错误之后),这一行显示在 scrapy 的日志中:

2018-03-05 00:47:28 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://www.discogs.com/search/?q=rock&type=release&page=2> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)

Scrapy 默认情况下会过滤重复请求,而您的 parse_items2() 方法除了创建重复请求外什么也不做。我看不出该方法存在的任何原因。

您应该做的是将 ˙parse()` 方法指定为请求的回调,并避免使用不执行任何操作的额外方法:

yield scrapy.Request(url=next_page, callback=self.parse)

关于python - Scrapy 分页不起作用并优化了蜘蛛,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49100849/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com