gpt4 book ai didi

python - Scrapy:列表和分页迭代失败

转载 作者:太空宇宙 更新时间:2023-11-03 17:49:42 25 4
gpt4 key购买 nike

我的目标是提取每页的所有 25 行(每行 6 个项目),然后迭代 40 页中的每一页。

目前,我的蜘蛛从第 1-3 页提取第一行(请参阅 CSV 输出图像)。

我假设 list_iterator() 函数会迭代每一行;但是,我的 ruleslist_iterator() 函数中似乎存在错误,不允许废弃每页的所有行。

非常感谢任何帮助或建议!

propub_spider.py:

import scrapy 
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from propub.items import PropubItem
from scrapy.http import Request

class propubSpider(CrawlSpider):
name = 'prop$'
allowed_domains = ['https://projects.propublica.org']
max_pages = 40
start_urls = [
'https://projects.propublica.org/docdollars/search?state%5Bid%5D=33',
'https://projects.propublica.org/docdollars/search?page=2&state%5Bid%5D=33',
'https://projects.propublica.org/docdollars/search?page=3&state%5Bid%5D=33']

rules = (Rule(SgmlLinkExtractor(allow=('\\search?page=\\d')), 'parse_start_url', follow=True),)

def list_iterator(self):
for i in range(self.max_pages):
yield Request('https://projects.propublica.org/docdollars/search?page=d' % i, callback=self.parse)

def parse(self, response):
for sel in response.xpath('//*[@id="payments_list"]/tbody'):
item = PropubItem()
item['payee'] = sel.xpath('tr[1]/td[1]/a[2]/text()').extract()
item['link'] = sel.xpath('tr[1]/td[1]/a[1]/@href').extract()
item['city'] = sel.xpath('tr[1]/td[2]/text()').extract()
item['state'] = sel.xpath('tr[1]/td[3]/text()').extract()
item['company'] = sel.xpath('tr[1]/td[4]').extract()
item['amount'] = sel.xpath('tr[1]/td[7]/span/text()').extract()
yield item

管道.py:

import csv

class PropubPipeline(object):

def __init__(self):
self.myCSV = csv.writer(open('C:\Users\Desktop\propub.csv', 'wb'))
self.myCSV.writerow(['payee', 'link', 'city', 'state', 'company', 'amount'])

def process_item(self, item, spider):
self.myCSV.writerow([item['payee'][0].encode('utf-8'),
item['link'][0].encode('utf-8'),
item['city'][0].encode('utf-8'),
item['state'][0].encode('utf-8'),
item['company'][0].encode('utf-8'),
item['amount'][0].encode('utf-8')])
return item

项目.py:

import scrapy
from scrapy.item import Item, Field

class PropubItem(scrapy.Item):
payee = scrapy.Field()
link = scrapy.Field()
city = scrapy.Field()
state = scrapy.Field()
company = scrapy.Field()
amount = scrapy.Field()
pass

CSV 输出:

enter image description here

最佳答案

需要修复多个问题:

  • 使用start_requests()方法代替list_iterator()
  • 此处缺少 %:

    yield Request('https://projects.propublica.org/docdollars/search?page=%d' % i, callback=self.parse)
    # HERE^
  • 您不需要 CrawlSpider,因为您是通过 start_requests() 提供分页链接 - 使用常规 scrapy.Spider
  • 如果 XPath 表达式能够通过类属性匹配单元格,会更可靠

修复版本:

import scrapy

from propub.items import PropubItem


class propubSpider(scrapy.Spider):
name = 'prop$'
allowed_domains = ['projects.propublica.org']
max_pages = 40

def start_requests(self):
for i in range(self.max_pages):
yield scrapy.Request('https://projects.propublica.org/docdollars/search?page=%d' % i, callback=self.parse)

def parse(self, response):
for sel in response.xpath('//*[@id="payments_list"]//tr[@data-payment-id]'):
item = PropubItem()
item['payee'] = sel.xpath('td[@class="name_and_payee"]/a[last()]/text()').extract()
item['link'] = sel.xpath('td[@class="name_and_payee"]/a[1]/@href').extract()
item['city'] = sel.xpath('td[@class="city"]/text()').extract()
item['state'] = sel.xpath('td[@class="state"]/text()').extract()
item['company'] = sel.xpath('td[@class="company"]/text()').extract()
item['amount'] = sel.xpath('td[@class="amount"]/text()').extract()
yield item

关于python - Scrapy:列表和分页迭代失败,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29267395/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com