gpt4 book ai didi

python - Scrapy:不收集所有页面的数据

转载 作者:行者123 更新时间:2023-12-01 02:50:27 25 4
gpt4 key购买 nike

请帮助理解错误是什么。转到页面.../?start = 0,/?start = 25,/?start = 50仅从最后一页 (50) 收集数据。我的代码:

from scrapy import FormRequest
from scrapy import Request
import scrapy
from scrapy.spiders import CrawlSpider

from ..items import GetDomainsItem


def pages_range(start, step):
stop = 50
r = start
while r <= stop:
yield r
r += step

class GetUrlDelDomSpider(CrawlSpider):
name = 'get_domains'

allowed_domains = ["member.expireddomains.net"]

paginate = pages_range(0, 25)

start_urls = list(map(lambda i: 'https://member.expireddomains.net/domains/expiredcom201612/?start=%s' % i, paginate))
def start_requests(self):
for start_url in self.start_urls:
yield Request(start_url, dont_filter=True)

def parse(self, response):
yield FormRequest.from_response(response,
formnumber=1,
formdata={'login': 'xxx', 'password': '*****', 'rememberme': '1'},
callback=self.parse_login,
dont_filter=True)
def parse_login(self, response):
if b'The supplied login information are unknown.' not in response.body:
item = GetDomainsItem()
for each in response.selector.css('table.base1 tbody '):
item['domain'] = each.xpath('tr/td[@class="field_domain"]/a/text()').extract()
return item

感谢您的帮助。

最佳答案

parse_login 方法中的

return item 打破了循环:

for each in response.selector.css('table.base1 tbody '):
item['domain'] = each.xpath('tr/td[@class="field_domain"]/a/text()').extract()
return item
^

因此,您应该在循环的每次迭代中创建一个项目并产生它:

for each in response.selector.css('table.base1 tbody '):
item = GetDomainsItem()
item['domain'] = each.xpath('tr/td[@class="field_domain"]/a/text()').extract()
yield item

关于python - Scrapy:不收集所有页面的数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44865162/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com