gpt4 book ai didi

python - Scrapy不同数量的url返回

转载 作者:太空宇宙 更新时间:2023-11-04 06:04:32 24 4
gpt4 key购买 nike

我已经为固定域内的爬虫构建了一个爬虫,并提取与修复正则表达式匹配的 url。如果看到特定的 url,爬虫就会跟踪该链接。爬虫可以完美地提取 url,但每次我运行爬虫时,它都会返回不同数量的链接,即每次运行时链接的数量都不同。我正在使用 Scrapy 进行抓取。这是 scrapy 的问题吗?代码是:

class MySpider(CrawlSpider):
name = "xyz"
allowed_domains = ["xyz.nl"]
start_urls = ["http://www.xyz.nl/Vacancies"]
rules = (Rule(SgmlLinkExtractor(allow=[r'\/V-\d{7}\/[\w\S]+']), callback='parse_item'),Rule(SgmlLinkExtractor(allow=[r'\?page\=\d+\&sortCriteria\=1']), follow=True),)



def parse_item(self, response):

outputfile = open('urllist.txt','a')
print response.url
outputfile.write(response.url+'\n')

最佳答案

与其手动编写链接并在 parse_item() 方法中使用 a 模式打开文件,不如使用 scrapy 的内置 item exporters .定义一个带有链接字段的项目:

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.item import Item, Field


class MyItem(Item):
url = Field()


class MySpider(CrawlSpider):
name = "xyz"
allowed_domains = ["xyz.nl"]
start_urls = ["http://www.xyz.nl/Vacancies"]
rules = (Rule(SgmlLinkExtractor(allow=[r'\/V-\d{7}\/[\w\S]+']), callback='parse_item'),
Rule(SgmlLinkExtractor(allow=[r'\?page\=\d+\&sortCriteria\=1']), follow=True),)

def parse_item(self, response):
item = MyItem()
item['url'] = response.url
yield item

关于python - Scrapy不同数量的url返回,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22912259/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com