gpt4 book ai didi

python - scrapy蜘蛛绕过拒绝我的规则

转载 作者:行者123 更新时间:2023-12-01 05:37:47 25 4
gpt4 key购买 nike

嗨,我正在尝试使用crawlspider,并且我创建了自己的拒绝规则

class MySpider(CrawlSpider): 
name = "craigs"
allowed_domains = ["careers-cooperhealth.icims.com"]
start_urls = ["careers-cooperhealth.icims.com"]
d= [0-9]
path_deny_base = [ '.(login)', '.(intro)', '(candidate)', '(referral)', '(reminder)', '(/search)',]
rules = (Rule (SgmlLinkExtractor(deny = path_deny_base,
allow=('careers-cooperhealth.icims.com/jobs/…;*')),
callback="parse_items",
follow= True), )

我的蜘蛛仍然爬行类似 https://careers-cooperhealth.icims.com/jobs/22660/registered-nurse-prn/login 的页面登录名不应被抓取的地方有什么问题?

最佳答案

就这样改变它(没有点和括号):

deny = ['login', 'intro', 'candidate', 'referral', 'reminder', 'search']
allow = ['jobs']

rules = (Rule (SgmlLinkExtractor(deny = deny,
allow=allow,
restrict_xpaths=('*')),
callback="parse_items",
follow= True),)

这意味着提取的链接中没有loginintro等,仅提取其中包含jobs的链接。

这是抓取链接https://careers-cooperhealth.icims.com/jobs/intro?hashed=0并打印“YAHOO!”的完整蜘蛛代码:

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule


class MySpider(CrawlSpider):
name = "craigs"
allowed_domains = ["careers-cooperhealth.icims.com"]
start_urls = ["https://careers-cooperhealth.icims.com"]

deny = ['login', 'intro', 'candidate', 'referral', 'reminder', 'search']
allow = ['jobs']

rules = (Rule (SgmlLinkExtractor(deny = deny,
allow=allow,
restrict_xpaths=('*')),
callback="parse_items",
follow= True),)

def parse_items(self, response):
print "YAHOO!"

希望有帮助。

关于python - scrapy蜘蛛绕过拒绝我的规则,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/18482813/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com