python - Scrapy Python 蜘蛛无法使用 LinkExtractor 或手动 Request() 找到链接-6ren

python - Scrapy Python 蜘蛛无法使用 LinkExtractor 或手动 Request() 找到链接

转载作者：行者123 更新时间：2023-12-01 05:03:03

我正在尝试编写一个 Scrapy 蜘蛛，它可以爬行域上的所有结果页面:https://www.ghcjobs.apply2jobs.com... 。该代码应该做三件事:

(1) 爬取 1-1000 所有页面。这些页面是相同的，只是通过 URL 的最后部分进行区分:&CurrentPage=#。

(2) 单击包含职位发布的结果表中的每个链接，其中链接的类 = SearchResult。这些是表中唯一的链接，因此我在这里不会遇到任何麻烦。

(3) 以 key:value JSON 格式存储职位描述页面上显示的信息。 (这部分以基本方式工作)

我之前曾使用过 scrapy 和 CrawlSpiders，使用 'rule = [Rule(LinkExtractor(allow=') 递归解析页面的方法来查找与给定正则表达式模式匹配的所有链接。我目前对步骤 1 感到困惑，爬行数千个结果页面。

下面是我的蜘蛛代码:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http.request import Request
from scrapy.contrib.linkextractors import LinkExtractor
from genesisSpider.items import GenesisJob

class genesis_crawl_spider(CrawlSpider):
    name = "genesis"
    #allowed_domains = ['http://www.ghcjobs.apply2jobs.com']
    start_urls = ['https://www.ghcjobs.apply2jobs.com/ProfExt/index.cfm?fuseaction=mExternal.returnToResults&CurrentPage=1']

    #allow &CurrentPage= up to 1000, currently ~ 512
    rules = [Rule(LinkExtractor(allow=("^https://www.ghcjobs.apply2jobs.com/ProfExt/
index.cfm\?fuseaction=mExternal.returnToResults&CurrentPage=[1-1000]$")), 'parse_inner_page')]

def parse_inner_page(self, response):
    self.log('===========Entrered Inner Page============')
    self.log(response.url)
    item = GenesisJob()
    item['url'] = response.url

    yield item

这是蜘蛛的输出，上面的一些执行代码被截断:

2014-09-02 16:02:48-0400 [genesis] DEBUG: Crawled (200) <GET https://www.ghcjobs
.apply2jobs.com/ProfExt/index.cfm?fuseaction=mExternal.returnToResults&CurrentPa
ge=1> (referer: None) ['partial']
2014-09-02 16:02:48-0400 [genesis] DEBUG: Crawled (200) <GET https://www.ghcjobs
.apply2jobs.com/ProfExt/index.cfm?CurrentPage=1&fuseaction=mExternal.returnToRes
ults> (referer: https://www.ghcjobs.apply2jobs.com/ProfExt/index.cfm?fuseaction=
mExternal.returnToResults&CurrentPage=1) ['partial']
2014-09-02 16:02:48-0400 [genesis] DEBUG: ===========Entrered Inner Page========
====
2014-09-02 16:02:48-0400 [genesis] DEBUG: https://www.ghcjobs.apply2jobs.com/Pro
fExt/index.cfm?CurrentPage=1&fuseaction=mExternal.returnToResults
2014-09-02 16:02:48-0400 [genesis] DEBUG: Scraped from <200 https://www.ghcjobs.
apply2jobs.com/ProfExt/index.cfm?CurrentPage=1&fuseaction=mExternal.returnToResu
lts>
        {'url': 'https://www.ghcjobs.apply2jobs.com/ProfExt/index.cfm?CurrentPag
e=1&fuseaction=mExternal.returnToResults'}
2014-09-02 16:02:48-0400 [genesis] INFO: Closing spider (finished)
2014-09-02 16:02:48-0400 [genesis] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 930,
         'downloader/request_count': 2,
         'downloader/request_method_count/GET': 2,
         'downloader/response_bytes': 92680,
         'downloader/response_count': 2,
         'downloader/response_status_count/200': 2,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2014, 9, 2, 20, 2, 48, 611000),
         'item_scraped_count': 1,
         'log_count/DEBUG': 7,
         'log_count/INFO': 7,
         'request_depth_max': 1,
         'response_received_count': 2,
         'scheduler/dequeued': 2,
         'scheduler/dequeued/memory': 2,
         'scheduler/enqueued': 2,
         'scheduler/enqueued/memory': 2,
         'start_time': datetime.datetime(2014, 9, 2, 20, 2, 48, 67000)}
2014-09-02 16:02:48-0400 [genesis] INFO: Spider closed (finished)

目前，我陷入了这个项目的目标(1)。正如你所看到的，我的蜘蛛只爬过start_url页面。我的正则表达式应该正确定位页面导航按钮，因为我已经测试了正则表达式。我的回调函数 parse_inner_page 正在工作，如我插入的调试注释所示，但仅在第一页上。我是否错误地使用了“规则”？我在想也许应该归咎于 HTTPS 页面......

作为修补解决方案的一种方法，我尝试使用手动请求来获取第二页结果；这不起作用。这也是它的代码。

Request("https://www.ghcjobs.apply2jobs.com/ProfExt/index.cfm?fuseaction=mExternal.returnToResults&CurrentPage=2",  callback = 'parse_inner_page')

有人可以提供任何指导吗？也许有更好的方法来做到这一点吗？自周五以来，我一直在研究 SO/Scrapy 文档。非常感谢。

更新:我已经解决了这个问题。问题出在我使用的起始网址上。

start_urls = ['https://www.ghcjobs.apply2jobs.com/ProfExt/index.cfm?fuseaction=mExternal.returnToResults&CurrentPage=1']

进入表单提交后页面，该页面是单击 This 上的“搜索”按钮的结果页。这会在客户端运行 javascript 以向服务器提交表单，该表单会报告完整的招聘板(第 1-512 页)。然而，存在另一个硬编码的 URL，它显然无需使用任何客户端 JavaScript 即可调用服务器。所以现在我的起始网址是

start_urls = ['https://www.ghcjobs.apply2jobs.com/ProfExt/index.cfm?fuseaction=mExternal.searchJobs']

一切都回到正轨了!以后检查一下是否有独立于javascript的URL来调用服务器资源。

最佳答案

你确定 Scrapy 看到网页的方式和你一样吗？如今，越来越多的网站是通过 Javascript、Ajax 构建的。而这些动态内容可能需要功能齐全的浏览器才能完全填充。然而，Nutch 和 Scrapy 都无法处理这些开箱即用的问题。

首先，您需要确保您感兴趣的网页内容可以被scrapy检索到。有几种方法可以做到这一点。我通常使用 urllib2 和 beautifulsoup4 来快速尝试一下。你的起始页没有通过我的测试。

$ python
Python 2.7.6 (default, Mar 22 2014, 22:59:56) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib2
>>> from bs4 import BeautifulSoup
>>> url = "https://www.ghcjobs.apply2jobs.com/ProfExt/index.cfm?fuseaction=mExternal.returnToResults&CurrentPage=1"

>>> html = urllib2.urlopen(url).read()
>>> soup = BeautifulSoup(html)
>>> table = soup.find('div', {'id':'VESearchResults'})
>>> table.text
u'\n\n\n\r\n\t\t\tJob Title\xa0\r\n\t\t\t\r\n\t\t\n\r\n\t\t\tArea of Interest\xa0\r\n\t\t\t\r\n\t\t\n\r\n\t\t\tLocation\xa0\r\n\t\t\t\r\n\t\t\n\r\n\t\t\tState\xa0\r\n\t\t\t\r\n\t\t\n\r\n\t\t\tCity\xa0\r\n\t\t\t\r\n\t\t\n\n\n\r\n\t\t\t\t\tNo results matching your criteria.\r\n\t\t\t\t\n\n\n'
>>>

如您所见，“没有符合您条件的结果!”我认为您可能需要弄清楚为什么内容没有填充。 cookies ？发布而不是获取？用户代理..等

此外，您可以使用 scrapy parse命令来帮助您调试。例如，我经常使用这个命令。

scrapy parse http://example.com --rules

其他一些scrapy commands ，也许 Selenium 可能会有所帮助。

在这里，我使用 iPython 中运行 scrapy shell 来检查您的起始 url，并且我在浏览器中看到的第一条记录包含 Englewood，但 scrapy 抓取的 html 中不存在该记录

Here I am using running scrapy shell in iPython to inspect your start url and also the first record that I can see in my browser contains Englewood and it doesn't exist in the html that scrapy grabbed.

更新:

你所做的只是一个非常琐碎的抓取工作，而且你真的不需要Scrapy，这有点大材小用了。以下是我的建议:

看看Selenium (我假设你编写Python)并在你尝试在服务器上运行它时最终制作 headless Selenium。
您可以使用 PhantomJS 来实现这一点，这是一个更轻量级的 Javascript 执行器来完成您的工作。 Here是另一个可能有帮助的 stackoverflow 问题。
几个other您可以利用的资源。

关于python - Scrapy Python 蜘蛛无法使用 LinkExtractor 或手动 Request() 找到链接，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/25631815/

文章推荐： java - 创建一个打开命令的可执行 jar

文章推荐： arrays - 为什么不总是使用循环数组双端队列而不是数组列表？

文章推荐： python - 按列名称选择文本文件中的特定列并提取其内容

文章推荐： python - 欧拉计划 #17 的意外结果(Python 3 与 Python 2.7)

python - requests.request ('POST' 和 request.post 之间的区别
这两个句子有什么区别: res = requests.request('POST', url) 和 res = requests.request.post(url) 最佳答案它们几乎是一样的:htt
FaceBook API : Get the Request Object for a request Id - logged into the account that sent the request. 使用 "Requests Dialog"API
我正在使用“请求对话框”来创建 Facebook 请求。为了让用户收到请求，我需要使用图形 API 访问 Request 对象。我已经尝试了大多数看起来合适的权限设置(read_requests 和
python - http.client.HTTPConnection.request 与 urllib.request.Request
urllib.request和http.client都是python标准库。前者相关方法的文档是 here后者，here (我使用的是3.5) 有谁知道为什么标准库中有两种方法看起来做同样的事情，或者
Python 扭曲错误 : "Request.write called on a request after Request.finish was called"
我是 Twisted 的新手，我不明白为什么在运行我的脚本时会出现此错误。\ 基本上，该脚本由 2 个页面组成，第一个页面是一个 HTML 表单，它调用自身执行一个阻塞方法并显示结果。当请求同时发送到
javascript - request.body 与 request.params 与 request.query
我有一个客户端 JS 文件，其中包含: agent = require('superagent'); request = agent.get(url); 然后我有类似的东西 request.get(u
javascript - 在 Rails 应用程序中提前输入 : Append JSON request to only one specific request instead of appending JSON request to every request via prefetch
提前输入功能可以正常工作。但问题是，提前输入功能会在每个数据请求上发出 JSON 请求，而实际上只应针对一个特定请求发生。我有以下 Controller : #controllers/agencie
request - 如何在中间件和处理程序中读取 Iron Request？
我正在使用 Rust 开发一个小型 API，我不确定如何在两个地方访问来自 Iron 的 Request。 Authentication 中间件为 token 读取一次Request，如果路径被允许(
cnzz统计代码引起的Bad Request - Request Too Long的原因分析
问题起因今天一位网友向我们反馈，用Chrome打开某些博客文章时，会出现"Bad Request - Request Too Long. HTTP Error 400. The siz
java - 领英 OAuth : "signature_invalid" response when requesting a POST HTTP request (for request token)
当我从 LinkedIn 向 https://api.linkedin.com/uas/oauth/requestToken 请求请求 token 时，出现以下错误: oauth_problem=si
android - Request(okhttp3.Request.Builder) 在 okhttp3.Request 中有私有(private)访问权限
我只是想使用 okhttp 下载一些字节数据，但在我完成代码之前，我遇到了一个问题，android studio 报告了一个错误，说“Request(okhttp3.Request.Builder)
node.js - 如何修复 Windows 10 中的 "npm WARN deprecated request@2.88.2: request has been deprecated, see https://github.com/request/request/issues/3142"错误？
我正在使用 Windows 10。我想在我的系统上使用 Angular 4。当我运行 node -v 和 npm -v 时，它会显示版本。但是当我执行语句 npm install -g @angula
rust - 无法编译 Iron 示例 : expected struct `iron::request::Request` , 找到结构 `iron::Request`
我正在尝试让一个简单的 Iron 示例起作用: extern crate iron; extern crate router; use iron::prelude::*; use iron::stat
python - Flask request.form 包含数据，但 request.data 为空且 request.get_json() 返回错误
我正在尝试使用嵌套字典“动态”创建一个数据输入表单(目前，我使用具有 3 个值的数组，但将来数组中的元素数量可能会有所不同)。这似乎工作正常，并且表单“正确”渲染了 html 模板(正确 = 我看到了
ASP.NET:使用 Request ["param"] 与使用 Request.QueryString ["param"] 或 Request.Form ["param"]
从 ASP.NET 中的代码隐藏访问表单或查询字符串值时，使用的优缺点是什么，例如: // short way string p = Request["param"]; 代替: // long way
ios - 如何处理这个 : There are five api requests running parallelly and 2nd request is dependent on 4th request's response
我遇到了一个问题，我想知道更好的解决方法。有五个 api 请求并行运行，第二个请求依赖于第四个请求的响应，但所有 5 个请求都已在运行。什么是更好的方法？需要建议。提前致谢。最佳答案调度地面工
python - urllib.request.Request 说参数无效
我收到以下错误:TypeError:序列项 0:预期字节、字节数组或具有缓冲区接口(interface)的对象、找到元组我检查了Python文档，urllib.request.Request的参数似
python - urllib.request.Request 超时参数错误
当我向函数添加超时参数时，我的代码总是进入异常并打印出“我失败了”。当我删除超时参数时，代码会正常工作，并进入 try 子句。关于超时参数如何在 urllib.request 函数中工作的任何信息？
php - preg_match html代码
我使用 cURL 向服务器发送请求这是链接:Server Side script for cURL request我用 file_get_contents('php://input'); 读取发送的数
java - org.apache.solr.common.SolrException : Bad Request Bad Request request: http://localhost:8080/solr/update? wt=javabin&version=2
请大家帮帮我我正在尝试使用 NUTCH 抓取网站，但它给我错误“java.io.IOException: Job failed!” 我正在运行此命令“bin/nutch solrindex http:
AngularJS 错误 : Unexpected request (No more requests expected)
在我的 AngularJS 应用程序中，我无法弄清楚如何对 then promise 的执行更改 location.url 进行单元测试。我有一个函数，登录，调用服务，身份验证服务 .它返回 pro

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - Scrapy Python 蜘蛛无法使用 LinkExtractor 或手动 Request() 找到链接