gpt4 book ai didi

python - 如何更改 scrapy spider 中的 User_AGENT?

转载 作者:太空狗 更新时间:2023-10-30 00:44:09 24 4
gpt4 key购买 nike

我写了一个蜘蛛从 http://ip.42.pl/raw 获取我的 IP通过 PROXY. 这是我的第一个蜘蛛。我想更改 user_agent。我从本教程中获得信息 http://blog.privatenode.in/torifying-scrapy-project-on-ubuntu

我完成了本教程中的所有步骤,这是我的代码。

设置.py

BOT_NAME = 'CheckIP'

SPIDER_MODULES = ['CheckIP.spiders']
NEWSPIDER_MODULE = 'CheckIP.spiders'

USER_AGENT_LIST = ['Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B179 Safari/7534.48.3',
'Mozilla/5.0 (Linux; U; Android 4.0.3; ko-kr; LG-L160L Build/IML74K) AppleWebkit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30',
'Mozilla/5.0 (Linux; U; Android 4.0.3; de-ch; HTC Sensation Build/IML74K) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30',
'Mozilla/5.0 (Linux; U; Android 2.3; en-us) AppleWebKit/999+ (KHTML, like Gecko) Safari/999.9',
'Mozilla/5.0 (Linux; U; Android 2.3.5; zh-cn; HTC_IncredibleS_S710e Build/GRJ90) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1'
]

HTTP_PROXY = 'http://127.0.0.1:8123'

DOWNLOADER_MIDDLEWARES = {
'CheckIP.middlewares.RandomUserAgentMiddleware': 400,
'CheckIP.middlewares.ProxyMiddleware': 410,
'CheckIP.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
}

中间件.py

import random
from scrapy.conf import settings
from scrapy import log


class RandomUserAgentMiddleware(object):
def process_request(self, request, spider):
ua = random.choice(settings.get('USER_AGENT_LIST'))
if ua:
request.headers.setdefault('User-Agent', ua)
#this is just to check which user agent is being used for request
spider.log(
u'User-Agent: {} {}'.format(request.headers.get('User-Agent'), request),
level=log.DEBUG
)


class ProxyMiddleware(object):
def process_request(self, request, spider):
request.meta['proxy'] = settings.get('HTTP_PROXY')

checkip.py

import time
from scrapy.spider import Spider
from scrapy.http import Request

class CheckIpSpider(Spider):
name = 'checkip'
allowed_domains = ["ip.42.pl"]
url = "http://ip.42.pl/raw"

def start_requests(self):
yield Request(self.url, callback=self.parse)

def parse(self, response):
now = time.strftime("%c")
ip = now+"-"+response.body+"\n"
with open('ips.txt', 'a') as f:
f.write(ip)

这是 USER_AGENT 的返回信息

2015-10-30 22:24:20+0200 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-10-30 22:24:20+0200 [checkip] DEBUG: User-Agent: Scrapy/0.24.4 (+http://scrapy.org) <GET http://ip.42.pl/raw>

用户代理:Scrapy/0.24.4 (+ http://scrapy.org )

当我在请求中手动添加 header 时,一切正常。

   def start_requests(self):
yield Request(self.url, callback=self.parse, headers={"User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B179 Safari/7534.48.3"})

这是控制台返回的结果

2015-10-30 22:50:32+0200 [checkip] DEBUG: User-Agent: Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B179 Safari/7534.48.3 <GET http://ip.42.pl/raw>

如何在我的爬虫中使用 USER_AGENT_LIST?

最佳答案

如果您不需要随机的 user_agent,您可以将 USER_AGENT 放在您的设置文件中,例如:

settings.py:

...
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:39.0) Gecko/20100101 Firefox/39.0'
...

不需要中间件。但是如果你真的想随机选择一个 user_agent,首先要确保 RandomUserAgentMiddleware 正在被使用,你应该在你的日志中检查这样的东西:

Enabled downloader middlewares:
[
...
'CheckIP.middlewares.RandomUserAgentMiddleware',
...
]

检查 CheckIP.middlewares 是否是该中间件的路径。

现在可能设置被错误地加载到中间件上,我建议使用 from_crawler 方法来加载它:

Class RandomUserAgentMiddleware(object):
def __init__(self, settings):
self.settings = settings

@classmethod
def from_crawler(cls, crawler):
settings = crawler.settings
o = cls(settings, crawler.stats)
return o

现在使用 self.settings.get('USER_AGENT_LIST')process_request 方法中获取您想要的内容。

同时请更新你的 scrapy 版本,看起来你正在使用 0.24 而它已经通过了 1.0

关于python - 如何更改 scrapy spider 中的 User_AGENT?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33444793/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com