gpt4 book ai didi

Pyspider中给爬虫伪造随机请求头的实例

转载 作者:qq735679552 更新时间:2022-09-28 22:32:09 27 4
gpt4 key购买 nike

CFSDN坚持开源创造价值,我们致力于搭建一个资源共享平台,让每一个IT人在这里找到属于你的精彩世界.

这篇CFSDN的博客文章Pyspider中给爬虫伪造随机请求头的实例由作者收集整理,如果你对这篇文章有兴趣,记得点赞哟.

Pyspider 中采用了 tornado 库来做 http 请求,在请求过程中可以添加各种参数,例如请求链接超时时间,请求传输数据超时时间,请求头等等,但是根据pyspider的原始框架,给爬虫添加参数只能通过 crawl_config这个Python字典来完成(如下所示),框架代码将这个字典中的参数转换成 task 数据,进行http请求。这个参数的缺点是不方便给每一次请求做随机请求头.

?
1
2
3
4
5
6
7
8
crawl_config = {
"user_agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36",
"timeout": 120,
"connect_timeout": 60,
"retries": 5,
"fetch_type": 'js',
"auto_recrawl": True,
}

这里写出给爬虫添加随机请求头的方法:

1、编写脚本,将脚本放置在 pyspider 的 libs 文件夹下,命名为 header_switch.py 。

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
#!/usr/bin/env python
# -*- coding:utf-8 -*-
# Created on 2017-10-18 11:52:26
import random
import time
class HeadersSelector(object):
   """
   Header 中缺少几个字段 Host 和 Cookie
   """
   headers_1 = {
     "Proxy-Connection": "keep-alive",
     "Pragma": "no-cache",
     "Cache-Control": "no-cache",
     "User-Agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36",
     "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
     "DNT": "1",
     "Accept-Encoding": "gzip, deflate, sdch",
     "Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.6,en;q=0.4",
     "Referer": "https://www.baidu.com/s?wd=%BC%96%E7%A0%81&rsv_spt=1&rsv_iqid=0x9fcbc99a0000b5d7&issp=1&f=8&rsv_bp=1&rsv_idx=2&ie=utf-8&rqlang=cn&tn=baiduhome_pg&rsv_enter=0&oq=If-None-Match&inputT=7282&rsv_t",
     "Accept-Charset": "gb2312,gbk;q=0.7,utf-8;q=0.7,*;q=0.7",
   } # 网上找的浏览器
   headers_2 = {
     "Proxy-Connection": "keep-alive",
     "Pragma": "no-cache",
     "Cache-Control": "no-cache",
     "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0",
     "Accept": "image/gif,image/x-xbitmap,image/jpeg,application/x-shockwave-flash,application/vnd.ms-excel,application/vnd.ms-powerpoint,application/msword,*/*",
     "DNT": "1",
     "Referer": "https://www.baidu.com/link?url=c-FMHf06-ZPhoRM4tWduhraKXhnSm_RzjXZ-ZTFnPAvZN",
     "Accept-Encoding": "gzip, deflate, sdch",
     "Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.6,en;q=0.4",
   } # window 7 系统浏览器
   headers_3 = {
     "Proxy-Connection": "keep-alive",
     "Pragma": "no-cache",
     "Cache-Control": "no-cache",
     "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0",
     "Accept": "image/x-xbitmap,image/jpeg,application/x-shockwave-flash,application/vnd.ms-excel,application/vnd.ms-powerpoint,application/msword,*/*",
     "DNT": "1",
     "Referer": "https://www.baidu.com/s?wd=http%B4%20Pragma&rsf=1&rsp=4&f=1&oq=Pragma&tn=baiduhome_pg&ie=utf-8&usm=3&rsv_idx=2&rsv_pq=e9bd5e5000010",
     "Accept-Encoding": "gzip, deflate, sdch",
     "Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.7,en;q=0.6",
   } # Linux 系统 firefox 浏览器
   headers_4 = {
     "Proxy-Connection": "keep-alive",
     "Pragma": "no-cache",
     "Cache-Control": "no-cache",
     "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:55.0) Gecko/20100101 Firefox/55.0",
     "Accept": "*/*",
     "DNT": "1",
     "Referer": "https://www.baidu.com/link?url=c-FMHf06-ZPhoRM4tWduhraKXhnSm_RzjXZ-ZTFnP",
     "Accept-Encoding": "gzip, deflate, sdch",
     "Accept-Language": "zh-CN,zh;q=0.9,en-US;q=0.7,en;q=0.6",
   } # Win10 系统 firefox 浏览器
   headers_5 = {
     "Connection": "keep-alive",
     "Pragma": "no-cache",
     "Cache-Control": "no-cache",
     "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64;) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36 Edge/15.15063",
     "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
     "Referer": "https://www.baidu.com/link?url=c-FMHf06-ZPhoRM4tWduhraKXhnSm_RzjXZ-",
     "Accept-Encoding": "gzip, deflate, sdch",
     "Accept-Language": "zh-CN,zh;q=0.9,en-US;q=0.7,en;q=0.6",
     "Accept-Charset": "gb2312,gbk;q=0.7,utf-8;q=0.7,*;q=0.7",
   } # Win10 系统 Chrome 浏览器
   headers_6 = {
     "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
     "Accept-Encoding": "gzip, deflate, sdch",
     "Accept-Language": "zh-CN,zh;q=0.8",
     "Pragma": "no-cache",
     "Cache-Control": "no-cache",
     "Connection": "keep-alive",
     "DNT": "1",
     "Referer": "https://www.baidu.com/s?wd=If-None-Match&rsv_spt=1&rsv_iqid=0x9fcbc99a0000b5d7&issp=1&f=8&rsv_bp=1&rsv_idx=2&ie=utf-8&rq",
     "Accept-Charset": "gb2312,gbk;q=0.7,utf-8;q=0.7,*;q=0.7",
     "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0",
   } # win10 系统浏览器
   def __init__(self):
     pass
   def select_header(self):
     n = random.randint(1, 6)
     switch={
     1: self.headers_1
     2: self.headers_2
     3: self.headers_3
     4: self.headers_4
     5: self.headers_5
     6: self.headers_6
     }
     headers = switch[n]
     return headers

其中,我只写了6个请求头,如果爬虫的量非常大,完全可以写更多的请求头,甚至上百个,然后将 random的随机范围扩大,进行选择.

2、在pyspider 脚本中编写如下代码:

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Created on 2017-08-18 11:52:26
from pyspider.libs.base_handler import *
from pyspider.addings.headers_switch import HeadersSelector
import sys
defaultencoding = 'utf-8'
if sys.getdefaultencoding() != defaultencoding:
   reload(sys)
   sys.setdefaultencoding(defaultencoding)
class Handler(BaseHandler):
   crawl_config = {
     "user_agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36",
     "timeout": 120,
     "connect_timeout": 60,
     "retries": 5,
     "fetch_type": 'js',
     "auto_recrawl": True,
   }
   @every(minutes=24 * 60)
   def on_start(self):
     header_slt = HeadersSelector()
     header = header_slt.select_header() # 获取一个新的 header
     # header["X-Requested-With"] = "XMLHttpRequest"
     orig_href = 'http://sww.bjxch.gov.cn/gggs.html'
     self.crawl(orig_href,
           callback=self.index_page,
           headers=header) # 请求头必须写在 crawl 里,cookies 从 response.cookies 中找
   @config(age=24 * 60 * 60)
   def index_page(self, response):
     header_slt = HeadersSelector()
     header = header_slt.select_header() # 获取一个新的 header
     # header["X-Requested-With"] = "XMLHttpRequest"
     if response.cookies:
       header["Cookies"] = response.cookies

其中最重要的就是在每个回调函数 on_start,index_page 等等 当中,每次调用时,都会实例化一个 header 选择器,给每一次请求添加不一样的 header。要注意添加的如下代码:

?
1
2
3
4
5
6
header_slt = HeadersSelector()
header = header_slt.select_header() # 获取一个新的 header
# header["X-Requested-With"] = "XMLHttpRequest"
header["Host"] = "www.baidu.com"
if response.cookies:
   header["Cookies"] = response.cookies

当使用 XHR 发送 AJAX 请求时会带上 Header,常被用来判断是不是 Ajax 请求, headers 要添加 {‘X-Requested-With': ‘XMLHttpRequest'} 才能抓取到内容.

确定了 url 也就确定了请求头中的 Host,需要按需添加,urlparse包里给出了根据 url解析出 host的方法函数,直接调用netloc即可.

如果响应中有 cookie,就需要将 cookie 添加到请求头中.

如果还有别的伪装需求,自行添加.

如此即可实现随机请求头,完.

以上这篇Pyspider中给爬虫伪造随机请求头的实例就是小编分享给大家的全部内容了,希望能给大家一个参考,也希望大家多多支持我.

原文链接:https://blog.csdn.net/dongrixinyu/article/details/78410282 。

最后此篇关于Pyspider中给爬虫伪造随机请求头的实例的文章就讲到这里了,如果你想了解更多关于Pyspider中给爬虫伪造随机请求头的实例的内容请搜索CFSDN的文章或继续浏览相关文章,希望大家以后支持我的博客! 。

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com