gpt4 book ai didi

python-2.7 - SgmlLinkExtractor 停止在第 3 页

转载 作者:行者123 更新时间:2023-12-02 21:27:48 25 4
gpt4 key购买 nike

继续my question与 SgmlLinkExtractor 问题。

我正在尝试关注 the pages from here虽然它似乎可以工作并提取所有必需的项目,但爬虫程序在解析第三页后停止,没有任何错误消息。

class AltaSpider(CrawlSpider):
name = "altaCra"
allowed_domains = ["alta.ge"]
start_urls = [
"http://alta.ge/index.php?dispatch=categories.view&category_id=297"
]

rules = (Rule (SgmlLinkExtractor(allow=("index.php\?dispatch=categories.view&category_id=297&page=\d*", ))
, callback="parse_items", follow=True),)

def parse_items(self, response):
sel = Selector(response)
titles = sel.xpath('//table[@class="table products cl"]//tr[@valign="middle"]')
items = []
for t in titles:
item = AltaItem()
item["brand"] = t.xpath('td[@class="compact"]/div[@class="cl-prod-name"]/a/text()').re('^([\w\-]+)')
item["model"] = t.xpath('td[@class="compact"]/div[@class="cl-prod-name"]/a/text()').re('\s+(.*)$')
item["price"] = t.xpath('td[@class="cl-price-cont"]//span[4]/text()').extract()

items.append(item)

return(items)

最佳答案

首页中下一页的链接如下所示:

http://alta.ge/index.php?dispatch=categories.view&category_id=297&page=2

而到下一页的链接看起来像:

http://alta.ge/index.php?category_id=297&dispatch=categories.view&page=8

因此,我建议您使用不同的规则,定位具有 name="pagination" 属性的链接,这是所有下一页链接共享的属性:

rules = (
Rule(SgmlLinkExtractor(restrict_xpaths=('//a[@name="pagination"]',)),
callback="parse_items", follow=True),
)

关于python-2.7 - SgmlLinkExtractor 停止在第 3 页,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/23095466/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com