gpt4 book ai didi

python - 使用 scrapy 爬取时的动态起始 url 列表

转载 作者:行者123 更新时间:2023-12-01 22:18:24 25 4
gpt4 key购买 nike

class SomewebsiteProductSpider(scrapy.Spider):
name = "somewebsite"
allowed_domains = ["somewebsite.com"]


start_urls = [

]

def parse(self, response):
items = somewebsiteItem()

title = response.xpath('//h1[@id="title"]/span/text()').extract()
sale_price = response.xpath('//span[contains(@id,"ourprice") or contains(@id,"saleprice")]/text()').extract()
category = response.xpath('//a[@class="a-link-normal a-color-tertiary"]/text()').extract()
availability = response.xpath('//div[@id="availability"]//text()').extract()
items['product_name'] = ''.join(title).strip()
items['product_sale_price'] = ''.join(sale_price).strip()
items['product_category'] = ','.join(map(lambda x: x.strip(), category)).strip()
items['product_availability'] = ''.join(availability).strip()
fo = open("C:\\Users\\user1\PycharmProjects\\test.txt", "w")
fo.write("%s \n%s \n%s" % (items['product_name'], items['product_sale_price'], self.start_urls))
fo.close()
print(items)
yield items

测试.py

process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(SomewebsiteProductSpider)
process.start()

在启动抓取过程之前,如何将动态 start_urls 列表传递给 test.py 中的“SomewebsiteProductSpiders”对象?任何帮助,将不胜感激。谢谢。

最佳答案

process.crawl 接受传递给蜘蛛构造函数的可选参数,因此您可以从蜘蛛的 __init__ 填充 start_urls 或使用自定义 start_requests 过程。例如

测试.py

...
process.crawl(SomewebsiteProductSpider, url_list=[...])

somespider.py

class SomewebsiteProductSpider(scrapy.Spider):
...
def __init__(self, *args, **kwargs):
self.start_urls = kwargs.pop('url_list', [])
super(SomewebsiteProductSpider, *args, **kwargs)

关于python - 使用 scrapy 爬取时的动态起始 url 列表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42137689/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com