gpt4 book ai didi

python - 爬行深度自动化

转载 作者:行者123 更新时间:2023-12-01 05:16:56 25 4
gpt4 key购买 nike

我的网站包含 3 个级别。

  • 国家/地区
    • 城市
      • 街道

我想从所有街道页面上抓取数据。为此我 build 了一个蜘蛛。现在,我如何从乡村到达街道,而不在 start_url 字段中添加一百万个 URL。

我要为国家构建一个蜘蛛,一个为城市构建一个蜘蛛,一个为街道构建一个蜘蛛吗?爬行的整个思想不就是爬虫跟踪所有链接到一定的深度吗?

将 DEPTH_LIMIT = 3 添加到 settings.py 文件中不会改变任何内容。

我通过以下方式开始抓取:scrapycrawlspidername

<小时/>

编辑

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import Spider
from scrapy.selector import Selector
from winkel.items import WinkelItem

class DmozSpider(CrawlSpider):
name = "dmoz"
allowed_domains = ["mydomain.nl"]
start_urls = [
"http://www.mydomain.nl/Zuid-Holland"
]

rules = (Rule(SgmlLinkExtractor(allow=('*Zuid-Holland*', )), callback='parse_winkel', follow=True),)

def parse_winkel(self, response):
sel = Selector(response)
sites = sel.xpath('//ul[@id="itemsList"]/li')
items = []

for site in sites:
item = WinkelItem()
item['adres'] = site.xpath('.//a/text()').extract(), site.xpath('text()').extract(), sel.xpath('//h1/text()').re(r'winkel\s*(.*)')
items.append(item)
return items

最佳答案

您需要使用CrawlSpider,定义RulesLink Extractors适用于国家、城市和街道。

例如:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector

class MySpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']

rules = (
Rule(SgmlLinkExtractor(allow=('country', )), callback='parse_country'),
Rule(SgmlLinkExtractor(allow=('city', )), callback='parse_city'),
Rule(SgmlLinkExtractor(allow=('street', )), callback='parse_street'),
)

def parse_country(self, response):
self.log('Hi, this is a country page! %s' % response.url)

def parse_city(self, response):
self.log('Hi, this is a city page! %s' % response.url)

def parse_street(self, response):
self.log('Hi, this is a street page! %s' % response.url)

关于python - 爬行深度自动化,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22993423/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com