gpt4 book ai didi

python - scrapy 不通过 POST 请求发送 Cookie

转载 作者:行者123 更新时间:2023-11-28 18:16:17 25 4
gpt4 key购买 nike

我正在尝试使用 scrapy 提交 POST 请求,但它没有在 header 中发送 Cookie。

设置

在 OSX 下运行。创建一个 virtualenv 并运行 pip install Scrapy。然后我创建了一个默认蜘蛛:

(hotlanesbot)tollspider $ scrapy startproject vai66tolls
(hotlanesbot)tollspider $ cd vai66tolls/
(hotlanesbot)vai66tolls $ scrapy genspider vai66tolls-spider vai66tolls.com

然后我在 settings.py 中启用了 cookie 调试:

COOKIES_DEBUG = True

代码

蜘蛛的代码非常基本:解析站点然后 POST 表单并在 parse_eb 中处理响应。 vai66tolls_spider.py的内容:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http.cookies import CookieJar

class Vai66tollsSpiderSpider(scrapy.Spider):
name = 'vai66tolls-spider'
allowed_domains = ['vai66tolls.com']
start_urls = ['http://vai66tolls.com/']

def parse(self, response):
filename = "/tmp/body.html"
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)

self.log('Initial Response headers: (%s)' % response.headers)

# look for "cookie" things in response headers
poss_cookies = response.headers.getlist('Set-Cookie')
self.log('Set-Cookie?: (%s)' % poss_cookies)

poss_cookies = response.headers.getlist('Cookie')
self.log('Cookie?: (%s)' % poss_cookies)

poss_cookies = response.headers.getlist('cookie')
self.log('cookie?: (%s)' % poss_cookies)

# Parse Eastbound
r = scrapy.FormRequest.from_response(
response,
callback=self.parse_eb,
)

yield r

def parse_eb(self, response):
filename = "/tmp/eb.txt"
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
self.log('Request headers: %s' % response.request.headers)
self.log('Request cookies: %s' % response.request.cookies)

您可以 view it on github here .

输出

我正在运行抓取工具:

(hotlanesbot)vai66tolls $ scrapy crawl vai66tolls-spider

在日志输出中,我看到“收到 cookie”DEBUG 语句,但没有看到我期望来自 the documentation 的“发送 cookie 到”消息/the CookiesMiddleware .

这是输出的一个较大的摘录:

2018-01-10 08:50:35 [scrapy.core.engine] INFO: Spider opened
2018-01-10 08:50:35 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-01-10 08:50:35 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-01-10 08:50:35 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://vai66tolls.com/robots.txt> from <GET http://vai66tolls.com/robots.txt>
2018-01-10 08:50:35 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://vai66tolls.com/robots.txt> (referer: None)
2018-01-10 08:50:35 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://vai66tolls.com/> from <GET http://vai66tolls.com/>
2018-01-10 08:50:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://vai66tolls.com/> (referer: None)
2018-01-10 08:50:35 [vai66tolls-spider] DEBUG: Saved file /tmp/body.html
2018-01-10 08:50:35 [vai66tolls-spider] DEBUG: Initial Response headers: ({'X-Powered-By': ['ASP.NET'], 'X-Aspnet-Version': ['4.0.30319'], 'Server': ['Microsoft-IIS/10.0'], 'Cache-Control': ['private'], 'Date': ['Wed, 10 Jan 2018 13:50:35 GMT'], 'Content-Type': ['text/html; charset=utf-8']})
2018-01-10 08:50:35 [vai66tolls-spider] DEBUG: Set-Cookie?: ([])
2018-01-10 08:50:35 [vai66tolls-spider] DEBUG: Cookie?: ([])
2018-01-10 08:50:35 [vai66tolls-spider] DEBUG: cookie?: ([])
2018-01-10 08:50:35 [scrapy.downloadermiddlewares.cookies] DEBUG: Received cookies from: <200 https://vai66tolls.com/>
Set-Cookie: ASP.NET_SessionId=im3zxr01stwmr02z0cisggbl; path=/; HttpOnly

2018-01-10 08:50:35 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://vai66tolls.com/> (referer: https://vai66tolls.com/)
2018-01-10 08:50:35 [vai66tolls-spider] DEBUG: Saved file /tmp/eb.txt
2018-01-10 08:50:35 [vai66tolls-spider] DEBUG: Request headers: {'Accept-Language': ['en'], 'Accept-Encoding': ['gzip,deflate'], 'Accept': ['text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], 'User-Agent': ['Scrapy/1.5.0 (+https://scrapy.org)'], 'Referer': ['https://vai66tolls.com/'], 'Content-Type': ['application/x-www-form-urlencoded']}
2018-01-10 08:50:35 [vai66tolls-spider] DEBUG: Request cookies: {}
2018-01-10 08:50:35 [scrapy.core.engine] INFO: Closing spider (finished)

(未显示的行指示 scrapy.downloadermiddlewares.cookies.CookiesMiddleware 包含在下载器中间件中)。

为了比较,如果我通过 Chrome 的调试器工具监控初始请求,我会看到以下响应 header :

cache-control:private
content-length:7289
content-type:text/plain; charset=utf-8
date:Tue, 09 Jan 2018 04:38:57 GMT
server:Microsoft-IIS/10.0
status:200
x-aspnet-version:4.0.30319
x-powered-by:ASP.NET

对于后续表单 POST,调试器工具报告这些请求 header :

:authority:vai66tolls.com
:method:POST
:path:/
:scheme:https
accept:*/*
accept-encoding:gzip, deflate, br
accept-language:en-US,en;q=0.9
cache-control:no-cache
content-length:4480
content-type:application/x-www-form-urlencoded; charset=UTF-8
cookie:ASP.NET_SessionId=up5ygvcjzjalnw2z1r1e0qeg
origin:https://vai66tolls.com
pragma:no-cache
referer:https://vai66tolls.com/
user-agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36
x-microsoftajax:Delta=true
x-requested-with:XMLHttpRequest

另外,对于 Chrome,我可以生成一个 curl 请求以正常工作。使用 curl 请求,我确认从 header 中删除 Cookie 足以阻止返回正确的响应。例如,我知道可能有其他需要发送的表单数据,但如果我没有 Cookie,它肯定会失败。

问题

  1. 为什么 scrapy 不在请求头中包含 Cookie?
  2. 有什么方法可以手动获取 scrapy 提取的 cookie,以便我可以将它添加到 FormRequest.from_response() 中?

最佳答案

检查您是否还有 COOKIES_ENABLED在设置中设置为 True

关于第二个问题。您应该能够从 headers 中提取 cookie Response 对象的

cookies = response.headers.getlist('Set-Cookie')

您现在可以将它们手动插入到 FormRequest 中,将它们作为参数传递给 from_response 方法。我认为应该可以使用 Requestcookies 参数对象,或直接使用 headers 参数 (headers={'Cookie': xxx})。

关于python - scrapy 不通过 POST 请求发送 Cookie,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48161575/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com