gpt4 book ai didi

python - 如何在 scrapy 中覆盖/使用 cookie

转载 作者:太空狗 更新时间:2023-10-29 17:15:48 26 4
gpt4 key购买 nike

我要抓取http://www.3andena.com/ ,该网站首先以阿拉伯语启动,并将语言设置存储在 cookie 中。如果您尝试直接通过 URL ( http://www.3andena.com/home.php?sl=en ) 访问语言版本,则会出现问题并返回服务器错误。

因此,我想将 cookie 值“store_language”设置为“en”,然后开始使用该 cookie 值废弃网站。

我正在使用 CrawlSpider 和一些规则。

这是代码

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy import log
from bkam.items import Product
from scrapy.http import Request
import re

class AndenaSpider(CrawlSpider):
name = "andena"
domain_name = "3andena.com"
start_urls = ["http://www.3andena.com/Kettles/?objects_per_page=10"]

product_urls = []

rules = (
# The following rule is for pagination
Rule(SgmlLinkExtractor(allow=(r'\?page=\d+$'),), follow=True),
# The following rule is for produt details
Rule(SgmlLinkExtractor(restrict_xpaths=('//div[contains(@class, "products-dialog")]//table//tr[contains(@class, "product-name-row")]/td'), unique=True), callback='parse_product', follow=True),
)

def start_requests(self):
yield Request('http://3andena.com/home.php?sl=en', cookies={'store_language':'en'})

for url in self.start_urls:
yield Request(url, callback=self.parse_category)


def parse_category(self, response):
hxs = HtmlXPathSelector(response)

self.product_urls.extend(hxs.select('//td[contains(@class, "product-cell")]/a/@href').extract())

for product in self.product_urls:
yield Request(product, callback=self.parse_product)


def parse_product(self, response):
hxs = HtmlXPathSelector(response)
items = []
item = Product()

'''
some parsing
'''

items.append(item)
return items

SPIDER = AndenaSpider()

这是日志:

2012-05-30 19:27:13+0000 [andena] DEBUG: Redirecting (301) to <GET http://www.3andena.com/home.php?sl=en&xid_479d9=97656c0c5837f87b8c479be7c6621098> from <GET http://3andena.com/home.php?sl=en>
2012-05-30 19:27:14+0000 [andena] DEBUG: Redirecting (302) to <GET http://www.3andena.com/home.php?sl=en&xid_479d9=97656c0c5837f87b8c479be7c6621098> from <GET http://www.3andena.com/home.php?sl=en&xid_479d9=97656c0c5837f87b8c479be7c6621098>
2012-05-30 19:27:14+0000 [andena] DEBUG: Crawled (200) <GET http://www.3andena.com/Kettles/?objects_per_page=10> (referer: None)
2012-05-30 19:27:15+0000 [andena] DEBUG: Crawled (200) <GET http://www.3andena.com/B-and-D-Concealed-coil-pan-kettle-JC-62.html> (referer: http://www.3andena.com/Kettles/?objects_per_page=10)

最佳答案

修改你的代码如下:

def start_requests(self):
for url in self.start_urls:
yield Request(url, cookies={'store_language':'en'}, callback=self.parse_category)

Scrapy.Request 对象接受可选的cookies 关键字参数,see documentation here

关于python - 如何在 scrapy 中覆盖/使用 cookie,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/10667202/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com