gpt4 book ai didi

Python Scrapy FormRequest 回调没有发生

转载 作者:行者123 更新时间:2023-11-30 23:01:30 27 4
gpt4 key购买 nike

我正在使用 Scrapy 编写一个 python 脚本来抓取具有登录页面的网站。我正在尝试使用 Scrapy 中的 FormRequest.from_response 填写表单,但没有成功,不知道为什么,但看起来 from_response 中声明的回调函数没有被调用。

我的spyder代码如下:

class user_scrape(CrawlSpider):
name = "spyder"
allowed_domains = ["domain.tld"]
start_urls = [
"http://domain.tld/page1",
"http://domain.tld/page2"
]

login_user = "username"
login_pass = "secret"
login_page = "http://domain.tld/login.php"

def start_requests(self):
yield Request(
url=self.login_page,
callback=self.login,
dont_filter=True,
)

def login(self, response):
print "----- LOGIN -----"
return FormRequest.from_response(
response,
formname='form_login',
formdata={
'username': self.login_user,
'password': self.login_pass,
'cookietime': 'on',
},
callback=self.check_login_response,
)

def check_login_response(self, response):
print response.url
print response.body

return [Request(url=url)for url in self.start_urls]

def parse(self, response):
print response.url

当我运行spyder时,它会打印“LOGIN”,然后它似乎停止了,并且没有输入应该继续的“check_login_response”。

spyder日志如下:

2016-01-21 16:34:23 [scrapy] INFO: Scrapy 1.0.4 started (bot: UsersScrape)
2016-01-21 16:34:23 [scrapy] INFO: Optional features available: ssl, http11
2016-01-21 16:34:23 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'UsersScrape.spiders', 'SPIDER_MODULES': ['UsersScrape.spiders'], 'RETRY_TIMES': 5, 'BOT_NAME': 'UsersScrape', 'RETRY_HTTP_CODES': [400, 408, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530], 'DOWNLOAD_DELAY': 1, 'USER_AGENT': 'Mozilla/5.0 (Android 4.4; Mobile; rv:41.0) Gecko/41.0 Firefox/41.0'}
2016-01-21 16:34:24 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-01-21 16:34:24 [scrapy] INFO: Enabled downloader middlewares: RetryMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-01-21 16:34:24 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-01-21 16:34:24 [scrapy] INFO: Enabled item pipelines:
2016-01-21 16:34:24 [scrapy] INFO: Spider opened
2016-01-21 16:34:24 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-01-21 16:34:24 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-01-21 16:34:24 [scrapy] DEBUG: Crawled (200) <GET http://domain.tld/login.php?> (referer: None)
----- LOGIN -----
2016-01-21 16:34:25 [scrapy] DEBUG: Redirecting (302) to <GET http://domain.tld.com/> from <POST http://domain.tld/takelogin.php>
2016-01-21 16:34:27 [scrapy] DEBUG: Redirecting (302) to <GET http://domain.tld/> from <GET http://domain.tld/>
2016-01-21 16:34:27 [scrapy] DEBUG: Filtered duplicate request: <GET http://domain.tld/>
2016-01-21 16:34:27 [scrapy] INFO: Closing spider (finished)
2016-01-21 16:34:27 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1261,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 2,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 3877,
'downloader/response_count': 3,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/302': 2,
'dupefilter/filtered': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 1, 21, 15, 34, 27, 101000),
'log_count/DEBUG': 5,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 1,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2016, 1, 21, 15, 34, 24, 238000)}
2016-01-21 16:34:27 [scrapy] INFO: Spider closed (finished)

表单的 HTML 代码为:

<form method="post" name="login_form" action="takelogin.php" onsubmit="return startLoginVerify();">
<table id="login_form" border="0" cellpadding=5>
<tr>
<td colspan="2" align="right">
<img style="cursor:pointer;" onClick="close_login_box();" src="pic/close.gif" align="right">
</td>
</tr>
<tr>
<td class=rowhead style="padding-left:25px;">User:</td>
<td align=left style="padding-right:25px;">
<input type="text" size=30 name="username" id="navbar_login_menu_input_to_focus_on" />
</td>
</tr>
<tr>
<td class=rowhead>Password:</td>
<td align=left><input type="password" size=30 name="password" /></td>
</tr>
....
</table>
</form>

我已经检查了 FormRequest 指南,但没有发现可能导致我的表单无法工作的差异。

感谢您的时间和帮助!!!

最佳答案

日志显示该请求正在被过滤,因为您访问了同一个 url 两次(使同一个请求完全准确)。

尝试将 dont_filter=True 设置为登录请求:

FormRequest.from_response(
response,
formname='form_login',
formdata={
'username': self.login_user,
'password': self.login_pass,
'cookietime': 'on',
},
callback=self.check_login_response,
dont_filter=True,
)

关于Python Scrapy FormRequest 回调没有发生,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34928190/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com