gpt4 book ai didi

python - 堆叠多个规则 Scrapy 进行深度爬取

转载 作者:太空宇宙 更新时间:2023-11-04 05:34:28 25 4
gpt4 key购买 nike

感谢有人可以帮助我理解深度爬行的规则堆栈。堆叠多个规则是否会导致一次处理一个规则。目的是从 MainPage 获取链接,返回项目和响应,并将其传递给下一个规则,该规则将链接传递给另一个函数等等。

 rules = {
Rule(LinkExtractor(restrict_xpaths=(--some xpath--)), callback='function_a', follow=True)
Rule(linkExtractor(restrict_xpaths=(--some xpath--)),callback='function_b', process_links='function_c', follow=True),
)


def function_a(self, response): --grab sports, games, link3 from main page--
item = ItemA()
i = response.xpath('---some xpath---')
for xpth in i:
item['name'] = xpth('---some xpath--')
yield item, scrapy.Request(url) // yield each item and url link from function_a back to the second rule

def function_b(self, response) -- receives responses from second rule--
//grab links same as function_a

def function_c(self, response) -- does process_links in the rule send the links it received to function_c?

这可以递归完成以深入抓取单个站点吗?我不确定我的规则概念是否正确。我是否必须添加 X 条规则来处理 X 深度页面,或者是否有更好的方法来处理递归深度爬网。

谢谢

最佳答案

来自 the docs以下段落暗示每个规则都适用于每个页面。 (我的斜体)

rules

Which is a list of one (or more) Rule objects. Each Rule defines a certain behaviour for crawling the site. Rules objects are described below. If multiple rules match the same link, the first one will be used, according to the order they’re defined in this attribute.

在您的情况下,将每个规则定位到适当的页面,然后按深度顺序对规则进行排序。

关于python - 堆叠多个规则 Scrapy 进行深度爬取,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36052191/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com