gpt4 book ai didi

python - Scrapy 蜘蛛在设置 'start_urls' 变量后不会产生 feed 输出

转载 作者:太空宇宙 更新时间:2023-11-03 16:21:15 25 4
gpt4 key购买 nike

以下带有固定 start_urls 的蜘蛛可以正常工作:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from funda.items import FundaItem

class PropertyLinksSimpleSpider(CrawlSpider):

name = "property_links_simple"
allowed_domains = ["funda.nl"]

# def __init__(self, place='amsterdam', page='1'):
# self.start_urls = ["http://www.funda.nl/koop/%s/p%s/" % (place, page)]
# self.le1 = LinkExtractor(allow=r'%s+huis-\d{8}' % self.start_urls[0])

start_urls = ["http://www.funda.nl/koop/amsterdam/"]
le1 = LinkExtractor(allow=r'%s+huis-\d{8}' % start_urls[0])
# rules = (Rule(le1, callback='parse_item'), )

def parse(self, response):
links = self.le1.extract_links(response)
for link in links:
if link.url.count('/') == 6 and link.url.endswith('/'):
item = FundaItem()
item['url'] = link.url
yield item

当我通过命令scrapycrapy property_links_simple -o property_links.json使用提要输出运行它时,生成的文件包含预期的链接:

[
{"url": "http://www.funda.nl/koop/amsterdam/huis-49708477-paul-schuitemahof-27/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49826458-buiksloterdijk-270/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49818887-markiespad-19/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49801910-claus-van-amsbergstraat-86/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49801593-jf-berghoefplantsoen-2/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49800159-breezandpad-8/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49805292-nieuwendammerdijk-21/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49890140-talbotstraat-9/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49879212-henri-berssenbruggehof-15/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49728947-emmy-andriessestraat-374/"},
{"url": "http://www.funda.nl/koop/amsterdam/huis-49713458-jan-vrijmanstraat-29/"}
]

但是,我希望能够将不同的 start_urls 传递给蜘蛛,例如 http://www.funda.nl/koop/rotterdam/p2/ 。为此我尝试将其调整如下:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from funda.items import FundaItem

class PropertyLinksSimpleSpider(CrawlSpider):

name = "property_links_simple"
allowed_domains = ["funda.nl"]

def __init__(self, place='amsterdam', page='1'):
self.start_urls = ["http://www.funda.nl/koop/%s/p%s/" % (place, page)]
self.le1 = LinkExtractor(allow=r'%s+huis-\d{8}' % self.start_urls[0])

# start_urls = ["http://www.funda.nl/koop/amsterdam/"]
# le1 = LinkExtractor(allow=r'%s+huis-\d{8}' % start_urls[0])
# rules = (Rule(le1, callback='parse_item'), )

def parse(self, response):
links = self.le1.extract_links(response)
for link in links:
if link.url.count('/') == 6 and link.url.endswith('/'):
item = FundaItem()
item['url'] = link.url
yield item

但是,如果我使用命令 scrapy scrap property_links_simple -a place=amsterdam -a page=1 -o property_links2.json 运行此命令,我会得到一个空的 .json 文件:

[
[

为什么蜘蛛不再产生任何输出?

最佳答案

结果证明这是一个简单的人为错误:在第二个示例中,start_urls[0] 不再相同。我添加了一个 self.base_url 使其再次相同:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from funda.items import FundaItem

class PropertyLinksSimpleSpider(CrawlSpider):

name = "property_links_simple"
allowed_domains = ["funda.nl"]

def __init__(self, place='amsterdam', page='1'):
self.start_urls = ["http://www.funda.nl/koop/%s/p%s/" % (place, page)]
self.base_url = "http://www.funda.nl/koop/%s/" % place
self.le1 = LinkExtractor(allow=r'%s+huis-\d{8}' % self.base_url)

def parse(self, response):
links = self.le1.extract_links(response)
for link in links:
if link.url.count('/') == 6 and link.url.endswith('/'):
item = FundaItem()
item['url'] = link.url
yield item

这使得蜘蛛生成所需的 .json 文件。

关于python - Scrapy 蜘蛛在设置 'start_urls' 变量后不会产生 feed 输出,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38423844/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com