gpt4 book ai didi

python - Scrapy 抓取刀无法抓取过第一页

转载 作者:太空宇宙 更新时间:2023-11-03 15:12:53 25 4
gpt4 key购买 nike

我正在学习 scrapy 教程 here 。我相信,我得到了与教程相同的代码,但我的抓取工具仅抓取第一页,然后将有关我的第一个 Request 的以下消息提供给另一个页面,然后完成。我是否可能在错误的位置得到了第二个 yield 语句?

DEBUG: Filtered offsite request to 'newyork.craigslist.org': https://newyork.craigslist.org/search/egr?s=120>

2017-05-20 18:21:31 [scrapy.core.engine] INFO: Closing spider (finished)

这是我的代码:

import scrapy
from scrapy import Request


class JobsSpider(scrapy.Spider):
name = "jobs"
allowed_domains = ["https://newyork.craigslist.org/search/egr"]
start_urls = ['https://newyork.craigslist.org/search/egr/']

def parse(self, response):
jobs = response.xpath('//p[@class="result-info"]')

for job in jobs:
title = job.xpath('a/text()').extract_first()
address = job.xpath('span[@class="result-meta"]/span[@class="result-hood"]/text()').extract_first("")[2:-1]
relative_url = job.xpath('a/@href').extract_first("")
absolute_url = response.urljoin(relative_url)

yield {'URL': absolute_url, 'Title': title, 'Address': address}

# scrape all pages
next_page_relative_url = response.xpath('//a[@class="button next"]/@href').extract_first()
next_page_absolute_url = response.urljoin(next_page_relative_url)

yield Request(next_page_absolute_url, callback=self.parse)

最佳答案

好吧,我明白了。我必须更改这一行:

allowed_domains = ["https://newyork.craigslist.org/search/egr"]

对此:

allowed_domains = ["newyork.craigslist.org"]

现在它可以工作了。

关于python - Scrapy 抓取刀无法抓取过第一页,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44088922/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com