gpt4 book ai didi

python - 使用 scrapy python 进行递归抓取

转载 作者:行者123 更新时间:2023-12-01 02:17:38 25 4
gpt4 key购买 nike

我已经做了一个抓取器,可以从多个页面抓取数据。我的问题是我有一堆 url(大约 10 个 url),我每次都需要传递它们。

这是我的代码,

# -*- coding: utf-8 -*-
import scrapy
import csv
import re
import sys
import os
from scrapy.linkextractor import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider
from datablogger_scraper.items import DatabloggerScraperItem


class DatabloggerSpider(CrawlSpider):
# The name of the spider
name = "datablogger"

# The domains that are allowed (links to other domains are skipped)
allowed_domains = ["cityofalabaster.com"]
print type(allowed_domains)

# The URLs to start with
start_urls = ["http://www.cityofalabaster.com/"]
print type(start_urls)

# This spider has one rule: extract all (unique and canonicalized) links, follow them and parse them using the parse_items method
rules = [
Rule(
LinkExtractor(
canonicalize=True,
unique=True
),
follow=True,
callback="parse_items"
)
]

# Method which starts the requests by visiting all URLs specified in start_urls
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, callback=self.parse, dont_filter=True)

# Method for parsing items
def parse_items(self, response):
# The list of items that are found on the particular page
items = []
# Only extract canonicalized and unique links (with respect to the current page)
links = LinkExtractor(canonicalize=True, unique=True).extract_links(response)
# Now go through all the found links
for link in links:
# Check whether the domain of the URL of the link is allowed; so whether it is in one of the allowed domains
is_allowed = False
for allowed_domain in self.allowed_domains:
if allowed_domain in link.url:
is_allowed = True
# If it is allowed, create a new item and add it to the list of found items
if is_allowed:
item = DatabloggerScraperItem()
item['url_from'] = response.url
item['url_to'] = link.url
items.append(item)
# Return all the found items
return items

如果你看看我的代码,您可以看到允许的域和 start_urls“链接”是手动传递的。相反,我有 csv,其中包含要传递的网址。

输入:-

http://www.daphneal.com/
http://www.digitaldecatur.com/
http://www.demopolisal.com/
http://www.dothan.org/
http://www.cofairhope.com/
http://www.florenceal.org/
http://www.fortpayne.org/
http://www.cityofgadsden.com/
http://www.cityofgardendale.com/
http://cityofgeorgiana.com/Home/
http://www.goodwater.org/
http://www.guinal.org/
http://www.gulfshoresal.gov/
http://www.guntersvilleal.org/index.php
http://www.hartselle.org/
http://www.headlandalabama.org/
http://www.cityofheflin.org/
http://www.hooveral.org/

以下是将 URL 和域传递给 Start_urls 和 allowed_domains 的代码。

import csv
import re
import sys
import os

with open("urls.csv") as csvfile:
csvreader = csv.reader(csvfile, delimiter=",")
for line in csvreader:
start_urls = line[0]
start_urls1 = start_urls.split()
print start_urls1
print type(start_urls1)
if start_urls[7:10] == 'www':
p = re.compile(ur'(?<=http://www.).*(?=\/|.*)')

elif start_urls[7:10] != 'www' and start_urls[-1] == '/' :
p = re.compile(ur'(?<=http://).*(?=\/|\s)')

elif start_urls[7:10] != 'www' and start_urls[-1] != '/' :
p = re.compile(ur'(?<=http://).*(?=\/|.*)')
else:
p = re.compile(ur'(?<=https://).*(?=\/|.*)')


allowed_domains = re.search(p,start_urls).group()
allowed_domains1 = allowed_domains.split()
print allowed_domains1
print type(allowed_domains1)

上面的代码将读取每个 url ,将每个 url 转换为列表(格式)并传递给 start_url ,并通过应用正则表达式获取域并将其传递给 allowed_domain (格式)

我应该如何将上述代码集成到我的主代码中以避免手动传递 allowed_domains 和 start_urls ???

提前致谢!!!

最佳答案

您可以从Python脚本运行蜘蛛,查看更多here :

if __name__ == '__main__':
from scrapy.crawler import CrawlerProcess
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

# parse from csv file
allowed_domains = ...
start_urls = ...

DatabloggerSpider.allowed_domains = allowed_domains
DatabloggerSpider.start_urls = start_urls
process.crawl(DatabloggerSpider)
process.start()

关于python - 使用 scrapy python 进行递归抓取,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48246935/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com