python - 在Scrapy的LinkExtractor中使用 "allow"关键字-6ren

python - 在Scrapy的LinkExtractor中使用 "allow"关键字

转载作者：太空宇宙更新时间：2023-11-03 16:22:06

24

4

我正在尝试抓取网站http://www.funda.nl/koop/amsterdam/ ，其中列出了阿姆斯特丹待售房屋。主页包含许多链接，其中一些是指向单个待售房屋的链接。我最终想点击这些链接并从中提取数据。

首先，我尝试简单地列出与各个房屋相对应的链接。我注意到他们的 URL 包含“huis-”后跟一个 8 位代码 - 例如 http://www.funda.nl/koop/amsterdam/huis-49801910-claus-van-amsbergstraat-86/ 。我想使用正则表达式 r'huis-\d{8}' 来匹配该网址子集。

我正在尝试使用 Scrapy 的 LinkExtractor 来执行此操作，但它似乎不起作用。我写的spider如下:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from Funda.items import FundaItem
from scrapy.shell import inspect_response

class FundaSpider(CrawlSpider):
    name = "Funda"
    allowed_domains = ["funda.nl"]
    start_urls = ["http://www.funda.nl/koop/amsterdam/"]

    le1 = LinkExtractor()
    rules = (
    Rule(LinkExtractor(allow=r'huis-\d{8}'), callback='parse_item'),
    )

    def parse_item(self, response):
        links = self.le1.extract_links(response)
        for link in links:
            item = FundaItem()
            item['url'] = link.url
            print("The item is "+str(item))
            yield item

在主项目目录中，如果我运行 scrapycrawl Funda -ofunda.json，则生成的 funda.json 文件以以下行开头:

[
{"url": "http://www.funda.nl/cookiebeleid/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49728947-emmy-andriessestraat-374/ufsavqdqfvxyerrvff.html"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49728947-emmy-andriessestraat-374/"},
{"url": "http://www.funda.nl/koop/"},
{"url": "https://www.funda.nl/mijn/login/?ReturnUrl=%2Fkoop%2Famsterdam%2Fhuis-49728947-emmy-andriessestraat-374%2F"},
{"url": "https://www.funda.nl/mijn/aanmelden/?ReturnUrl=%2Fkoop%2Famsterdam%2Fhuis-49728947-emmy-andriessestraat-374%2F"},
{"url": "http://www.funda.nl/language/switchlanguage/?language=en&returnUrl=%2Fkoop%2Famsterdam%2Fhuis-49728947-emmy-andriessestraat-374%2F"},
{"url": "https://help.funda.nl/hc/nl/categories/200207038"},
{"url": "http://www.funda.nl/koop/amsterdam/"},

如您所见，它包含多行链接，其中不包含“huis-”或 8 位代码。我如何才能将其过滤为仅指向房屋的“真实”链接？

最佳答案

问题在于正则表达式位于 rules 参数的定义中，但不在 le1 的定义中。将其添加到 le1 的定义中可以得到预期的输出。

关于python - 在Scrapy的LinkExtractor中使用 "allow"关键字，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/38351744/

24

4

0

文章推荐： c# - ResourceManager.GetString() FileNotFound 异常

文章推荐： ruby-on-rails - 在多个 Web 服务上创建抽象 API

文章推荐： JavaScript 在 Rails 中不工作

文章推荐： python - Pandas:对一些数据进行分组

AngularJS/SailsJS :Access-Control-Allow-Origin is not allowed by Access-Control-Allow-Headers
我正在 Angular js和sails.js(node.js框架)之间的cors问题上挣扎我尝试修复错误:XMLHttpRequest无法加载http://localhost:1337/en/au
javascript - 在 <iframe/> 上设置沙箱 ="allow-scripts allow-popups allow-same-origin"是否安全？
我在我的应用程序中动态创建一个 iframe，结果如下所示: 这样的沙箱配置是否安全(特别是允许将 iframe 内容视为来自同一来源)？最佳答案正如 Namey 评论的那样，allow-sam
xmlhttprequest - Access-Control-Allow-Origin : "*" not allowed when credentials flag is true, 但没有 Access-Control-Allow-Credentials header
突然，似乎没有更改我的网络应用程序中的任何内容，我在 Chrome 中打开它时开始收到 CORS 错误。我尝试添加 Access-Control-Allow-Origin: *标题。然后我得到这个错误
angularjs - $http.post Request header field Access-Control-Allow-Origin is not allowed by Access-Control-Allow-Headers 错误
我正在使用 Ionic Framework 开发应用程序。在后端，我为 api 编写了一个 Flask 应用程序，如下所示: @API.route("/saverez",methods=["POST
javascript - 公理 : Request header field Access-Control-Allow-Methods is not allowed by Access-Control-Allow-Headers in preflight respones
我正在尝试从 onesignal api 发送 POST 请求代码 axios({ method: 'post', url: 'https://onesignal.com/api/v1/no
javascript - 带有 NodeJS GET 请求的 AngularJS 失败 - "Access-Control-Allow-Headers is not allowed by Access-Control-Allow-Headers"
我一直在寻找一些类似的问题来寻找答案，但我找不到。我有一个带有 express 的 node.js 服务器: app.use(function(req, res, next) { res.head
javascript - “Request header field Access-Control-Allow-Origin is not allowed by Access-Control-Allow-Headers in preflight response” 尽管 CORS 配置有效
我使用 Google Cloud Functions 创建了一个 API 端点，并试图从 JS 获取函数中调用它。我遇到了我很确定与 CORS 或输出格式有关的错误，但我不确定发生了什么。其他一些
angular - 如何在 Webhdfs - HDFS - Hadoop - Origin http ://localhost:4200 is not allowed by Access-Control-Allow-Origin 中启用 cors origin allow
当我尝试从我的 Angular 6 应用程序访问 Webhdfs 时，我收到如下所示的错误。在我看来，我几乎尝试了所有方法，包括更改 core-site.xml 和 hdfs-site.xml 中的设
javascript - allowed-control-allow-origin 插件如何工作
我刚刚学习 ajax 和 cors 一些东西，现在我想知道 chrome 插件“allow-control-allow-origin”是如何工作的。当我尝试执行正常的 ajax 请求时，控制台显示错
validation - Firebase 规则 : allow push but not allow update
我正在努力理解如何允许用户在列表中创建新记录，但只允许创建者更新他们自己的帖子。例如。以下结构: post { post1: { author: "user1"
javascript - "is not allowed by Access-Control-Allow-Origin"内网Windows服务器之间
我们的网络上有 2 个内部(内联网)Windows 服务器，仅适用于本地网络。在 server1 上安装了 Spark，我们可以在其中查询 Jabber 信息，如下所示: http://server1
php - Symfony2 路由 : Method Not Allowed (Allow: {Method})
所以在 routing.yml 中我定义了以下路由，以便编辑和删除特定设置: 路由.yml: settings.editDefaults: path: settings/{id}/d
jquery解析Json "Origin null is not allowed by Access-Control-Allow-Origin"
我哪里出错了 title $.get("http://api.angel.co/1/tags/1654/startups?callback=aaa", function(data
java - Spring 安全: Allow a public endpoint and not allow other endpoints
首先，对您可能犯的语法错误表示歉意。我的英语不是很好。我是 Spring 新手，我正在尝试创建基本身份验证安全性。我正在尝试配置一个端点具有公共(public)访问权限，而其他端点则具有用户访问权
mysql - 性能优化 : Null allowed/not allowed vs Performance, 如果不是关键
这个问题已经有答案了: 已关闭11 年前。 Possible Duplicate: NULL in MySQL (Performance & Storage) 如果出现以下情况，您是否强烈建议取消选中
javascript - 交叉请求错误: "Origin is not allowed by Access-Control-Allow-Origin"?
我正在尝试将我的一个网站中的内容加载到另一个网站中: $('#include-from-outside').load('http://lujanventas.com/plugins/banne
javascript - oAuth 失败 : Is not allowed … Access-Control-Allow-Origin
这里出了什么问题？ OPTIONS https://twitter.com/oauth/request_token 401 (Unauthorized) jsOAuth-1.3.4.js:483 XM
jquery - allow-control-allow-origin : * present in response, 但它仍然显示错误
allow-control-allow-origin : * header 存在于 API 响应中，但浏览器仍显示错误。网络 403 错误。这是来自 API 的示例响应 header : Acces
go - Go 中的 slice : why does it allow appending more than the capacity allows?
在 Go 中制作 slice 时的 capacity 参数对我来说意义不大。例如， aSlice := make([]int, 2, 2) //a new slice with length and
git - Gitlab中 "Allowed to push"和 "Allowed to merge"的含义
Gitlab 保护分支中“允许推送”和“允许 merge ”是什么意思最佳答案引用 Gitlab Documentation here Using the "Allowed to push" an

首页

博学

6Ren·AI

商城

python - 在Scrapy的LinkExtractor中使用 "allow"关键字