gpt4 book ai didi

python - 为什么scrapy没有给出所有结果并且规则部分也不起作用?

转载 作者:太空宇宙 更新时间:2023-11-03 17:40:44 25 4
gpt4 key购买 nike

此脚本仅向我提供第一个结果,或者如果我将 0 更改为 1,则提供 .extract()[0] 然后是下一项。为什么它不再次迭代整个 xpath?

规则部分也不起作用。我知道问题出在 response.xpath 中。怎么处理呢?

我的其他脚本可以工作,但这个不行

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin


class CompItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
data = scrapy.Field()
name = scrapy.Field()
date = scrapy.Field()



class criticspider(CrawlSpider):
name = "hand"
allowed_domains = ["consumercomplaints.in"]
start_urls = ["http://www.consumercomplaints.in/bysubcategory/mobile-handsets/page/1"]
rules = (
Rule(
SgmlLinkExtractor(allow=('"/bysubcategory/mobile-handsets/page/1/+"',)),
callback="parse_start_url",
follow=True),
)

def parse(self, response):
sites = response.xpath('//table[@width="100%"]')
items = []

for site in sites:
item = CompItem()
item['date'] = site.xpath('.//td[@class="small"]/text()').extract()[1]
item['name'] = site.xpath('.//td[@class="small"]//a/text()').extract()[0]
item['title'] = site.xpath('.//td[@class="complaint"]/h4/a/text()').extract()[0]

item['link'] = site.xpath('.//td[@class="complaint"]/h4/a/@href').extract()[0]
if item['link']:
if 'http://' not in item['link']:
item['link'] = urljoin(response.url, item['link'])
yield scrapy.Request(item['link'],
meta={'item': item},
callback=self.anchor_page)

items.append(item)

def anchor_page(self, response):
old_item = response.request.meta['item']

old_item['data'] = response.xpath('.//td[@class="compl-text"]/div/text()').extract()
yield old_item

最佳答案

问题在于如何定义站点

目前,只需 //table[@width="100%"] 即可生成要匹配的完整表格。相反,直接在 td 标记内查找具有 id 属性的所有 div 元素:

sites = response.xpath("//td/div[@id]")
<小时/>

至于 rules 部分 - 这是我将采取的方法 - 使用与 parse 不同的回调来收集搜索结果。带有更多改进的完整代码:

from urlparse import urljoin

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor


class CompItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
data = scrapy.Field()
name = scrapy.Field()
date = scrapy.Field()



class criticspider(CrawlSpider):
name = "hand"
allowed_domains = ["consumercomplaints.in"]
start_urls = ["http://www.consumercomplaints.in/bysubcategory/mobile-handsets"]
rules = (
Rule(LinkExtractor(restrict_xpaths="//div[@class='pagelinks']"), follow=True, callback="parse_results"),
)

def parse_results(self, response):
sites = response.xpath("//td/div[@id]")
for site in sites:
item = CompItem()
item['date'] = site.xpath('.//td[@class="small"]/text()').extract()[1]
item['name'] = site.xpath('.//td[@class="small"]//a/text()').extract()[0]
item['title'] = site.xpath('.//td[@class="complaint"]/h4/a/text()').extract()[0]

item['link'] = site.xpath('.//td[@class="complaint"]/h4/a/@href').extract()[0]
if item['link']:
if 'http://' not in item['link']:
item['link'] = urljoin(response.url, item['link'])
yield scrapy.Request(item['link'],
meta={'item': item},
callback=self.anchor_page)

def anchor_page(self, response):
old_item = response.request.meta['item']

old_item['data'] = response.xpath('.//td[@class="compl-text"]/div/text()').extract()
yield old_item

关于python - 为什么scrapy没有给出所有结果并且规则部分也不起作用?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30588989/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com