gpt4 book ai didi

python - 如何使用 scrapy CrawlSpider 请求发送 cookie?

转载 作者:太空狗 更新时间:2023-10-29 16:53:17 25 4
gpt4 key购买 nike

我正在尝试创建这个 Reddit scraper 使用 Python 的 Scrapy框架。

我使用 CrawSpider 爬取了 Reddit 及其子版 block 。但是,当我遇到包含成人内容的页面时,该网站会要求我提供 cookie over18=1

所以,我一直在尝试为蜘蛛发出的每个请求发送一个 cookie,但是,它没有成功。

这是我的爬虫代码。如您所见,我尝试使用 start_requests() 方法为每个蜘蛛请求添加一个 cookie。

这里有人能告诉我怎么做吗?或者我做错了什么?

from scrapy import Spider
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from reddit.items import RedditItem
from scrapy.http import Request, FormRequest

class MySpider(CrawlSpider):
name = 'redditscraper'
allowed_domains = ['reddit.com', 'imgur.com']
start_urls = ['https://www.reddit.com/r/nsfw']

rules = (
Rule(LinkExtractor(
allow=['/r/nsfw/\?count=\d*&after=\w*']),
callback='parse_item',
follow=True),
)

def start_requests(self):
for i,url in enumerate(self.start_urls):
print(url)
yield Request(url,cookies={'over18':'1'},callback=self.parse_item)

def parse_item(self, response):
titleList = response.css('a.title')

for title in titleList:
item = RedditItem()
item['url'] = title.xpath('@href').extract()
item['title'] = title.xpath('text()').extract()
yield item

最佳答案

好的。尝试做这样的事情。

def start_requests(self):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36'}
for i,url in enumerate(self.start_urls):
yield Request(url,cookies={'over18':'1'}, callback=self.parse_item, headers=headers)

阻止你的是 User-Agent。

编辑:

不知道 CrawlSpider 有什么问题,但是 Spider 可以正常工作。

#!/usr/bin/env python
# encoding: utf-8
import scrapy


class MySpider(scrapy.Spider):
name = 'redditscraper'
allowed_domains = ['reddit.com', 'imgur.com']
start_urls = ['https://www.reddit.com/r/nsfw']

def request(self, url, callback):
"""
wrapper for scrapy.request
"""
request = scrapy.Request(url=url, callback=callback)
request.cookies['over18'] = 1
request.headers['User-Agent'] = (
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, '
'like Gecko) Chrome/45.0.2454.85 Safari/537.36')
return request

def start_requests(self):
for i, url in enumerate(self.start_urls):
yield self.request(url, self.parse_item)

def parse_item(self, response):
titleList = response.css('a.title')

for title in titleList:
item = {}
item['url'] = title.xpath('@href').extract()
item['title'] = title.xpath('text()').extract()
yield item
url = response.xpath('//a[@rel="nofollow next"]/@href').extract_first()
if url:
yield self.request(url, self.parse_item)
# you may consider scrapy.pipelines.images.ImagesPipeline :D

关于python - 如何使用 scrapy CrawlSpider 请求发送 cookie?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32623285/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com