gpt4 book ai didi

python - 将包含要抓取的域列表的输入文件传递给 scrapy

转载 作者:太空宇宙 更新时间:2023-11-03 15:16:14 25 4
gpt4 key购买 nike

我看到了这个链接 [一个链接] ( Pass Scrapy Spider a list of URLs to crawl via .txt file )!这会更改起始网址列表。我想为每个域(从一个文件)抓取网页并将结果放入一个单独的文件(以域命名)。我已经为一个网站抓取了数据,但我在蜘蛛本身中指定了起始 url 和 allowed_domains。如何使用输入文件更改此设置。

更新 1:

这是我试过的代码:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field

class AppleItem(Item):
reference_link = Field()
rss_link = Field()

class AppleSpider(CrawlSpider):

name = 'apple'
allowed_domains = []
start_urls = []

def __init__(self):
for line in open('./domains.txt', 'r').readlines():
self.allowed_domains.append(line)
self.start_urls.append('http://%s' % line)

rules = [Rule(SgmlLinkExtractor(allow=()), follow=True, callback='parse_item')]

def parse_item(self, response):
sel = HtmlXPathSelector(response)
rsslinks = sel.select('//a[contains(@href, "pdf")]/@href').extract()
items = []
for rss in rsslinks:
item = AppleItem()
item['reference_link'] = response.url
item['rss_link'] = rsslinks
items.append(item)
filename = response.url.split("/")[-2]
open(filename+'.csv', 'wb').write(items)

运行时出现错误:AttributeError: 'AppleSpider' object has no attribute '_rules'

最佳答案

您可以使用蜘蛛类的__init__方法来读取文件并写入start_urlsallowed_domains

假设我们有文件domains.txt,内容为:

example1.com
example2.com
...

示例:

class MySpider(BaseSpider):
name = "myspider"
allowed_domains = []
start_urls = []

def __init__(self):
for line in open('./domains.txt', 'r').readlines():
self.allowed_domains.append(line)
self.start_urls.append('http://%s' % line)

def parse(self, response):
# here you will get data parsing page
# than put your data into single file
# from scrapy toturial http://doc.scrapy.org/en/latest/intro/tutorial.html
filename = response.url.split("/")[-2]
open(filename, 'wb').write(your_data)

关于python - 将包含要抓取的域列表的输入文件传递给 scrapy,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20702732/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com