- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
我有多个带有不同项目的蜘蛛,我想将每个项目导出到不同的 csv 文件中。我使用了 How can scrapy export items to separate csv files per item 中的代码示例,但是有一个问题。
现在我的蜘蛛只会写“页面”项目。所有项目都已填充到外壳中,但文件仍然是空的。我调试了管道,但到目前为止我没有发现错误。
这是我的蜘蛛:
import csv
import scrapy
from BeautifulSoup import BeautifulSoup
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.spiders import Rule
from DataSpiders import CSV_PATH
from ScrapingItems import TrierDeItem
from SuperSpider import SuperSpider
HTML_PATH = 'pages/trier.de/'
class TrierDeSpider(scrapy.Spider, SuperSpider):
name = 'trierDeSpider'
allowed_domains = ['trier.de']
denied_domains = []
start_urls = [
'https://www.trier.de/rathaus-buerger-in/trier-in-zahlen/',
'https://trier.de/startseite/',
'https://www.trier.de/leben-in-trier/',
'https://www.trier.de/kultur-freizeit/',
'https://www.trier.de/wirtschaft-arbeit/',
'https://www.trier.de/bildung-wissenschaft/',
'https://www.trier.de/bauen-wohnen/',
'https://www.trier.de/umwelt-verkehr/',
]
# Set starting point for the spider and starts crawling from start_urls
rules = (Rule(LxmlLinkExtractor(allow=()), callback='parse', follow=True),)
def parse(self, response):
"""
Parse for Links Page Body. Follow allowed Domains by adding them to the request. Parse the current page with
callback and the method parse_page.
:param response:
:return:
"""
for link in LxmlLinkExtractor(allow=self.allowed_domains, deny=self.denied_domains).extract_links(response):
yield scrapy.Request(response.urljoin(link.url), callback=self.parse_page)
def parse_page(self, response):
"""
Parse the current page for information.
:param response:
:return:
"""
trier_de_item = TrierDeItem()
yield self.parse_general_page_info(response, HTML_PATH)
# extract the page url
trier_de_item["url"] = response.url
# extract the crawling datetime
trier_de_item["crawling_date_time"] = response.headers['Date']
# extract page title
trier_de_item["title"] = response.css('title::text').extract()
# extract description tags
trier_de_item["description"] = response.xpath('//meta[@name="description"]/@content').extract()
trier_de_item["og_description"] = response.xpath('//meta[@name="og:description"]/@content').extract()
# extract all page headers
trier_de_item["news_title"] = response.xpath('//div[@class="dachzeile"]/text()').extract()
# extract topic
trier_de_item["topic"] = response.xpath('//div[@class="topic"]/text()').extract()
# extract headlines
trier_de_item['headlines'] = response.xpath('//h1/text()').extract()
# check if page contains a table
table = response.xpath('//table[@class="datentabelle"]').extract()
if len(table) > 0:
self.parse_table(response.body, trier_de_item['headlines'][0])
yield trier_de_item
@staticmethod
def parse_table(body_html, title):
'''
Parse HTML Page with table and save to csv file
:param body_html:
:param title:
:return:
'''
title = title.replace('/', '')
try:
# Create Filename from title
filename = title + '.csv'
soup = BeautifulSoup(body_html)
soup.prettify('utf-8')
content = []
# find all tables in html
tables = soup.findAll('table')
for table in tables:
# find reach table row
for row in table.findAll('tr'):
# extract each table header and row and extract text to line from each row
line = []
for header in row.findAll('th'):
if ' ' in header.text:
line.append('')
else:
line.append(header.text)
for row in row.findAll('td'):
if ' ' in row.text:
line.append('')
else:
line.append(row.text)
content.append(line)
# Open a new csv file an write each line to the file
with open(CSV_PATH + filename, 'wb') as csv_file:
wr = csv.writer(csv_file)
for line in content:
wr.writerow(line)
except Exception as e:
print(e)
pass
super 蜘蛛:
import urlparse
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from DataSpiders import write_html
from DataSpiders.ScrapingItems import PageItem, BaseItem
ALLOWED_FILE_TYPES = ('.pdf', '.csv', '.xls', '.xlsx')
class SuperSpider:
def __init__(self):
pass
def url_join(self, urls, response):
'''
Join URL with response
:param urls:
:param response:
:return:
'''
joined_urls = []
for url in urls:
joined_urls.append(response.urljoin(url))
return joined_urls
def parse_general_page_info(self, response, HTML_PATH):
page_item = PageItem()
page_item["url"] = response.url
# extract respones body
if 'jsp' in response.url:
url = response.url.split('.jsp')
write_html(url[0], response.body, HTML_PATH)
elif '?' in response.url:
url = response.url.split('?')
write_html(url[0], response.body, HTML_PATH)
else:
write_html(response.url, response.body, HTML_PATH)
# Search for files that contain any allowed file type
found_files = []
domain = response.url.split('/')[2]
for a in response.xpath('//a[@href]/@href'):
link = a.extract()
if link.endswith(ALLOWED_FILE_TYPES):
link = urlparse.urljoin(domain, link)
found_files.append(link)
# extract all refering links
extractor = LxmlLinkExtractor()
linklist = []
for link in extractor.extract_links(response):
# extract links which contain a file in url and add those to 'found_files' for downloading
if '?imgUid' in link.url:
fullpath = link.url
path = fullpath.split('.de')[1]
found_files.append(urlparse.urljoin(domain, path))
else:
linklist.append(link.url)
page_item["links"] = linklist
# add all files to lokaloItem
page_item["file_urls"] = self.url_join(found_files, response)
# extract page title
page_item["title"] = response.css('title::text').extract()
# extract all image urls
relative_img_urls = response.css("img::attr(src)").extract()
page_item["image_urls"] = self.url_join(relative_img_urls, response)
return page_item
def parse_base_page_information(self, response):
baseItem = BaseItem()
baseItem["url"] = response.url
# extract page title
baseItem["title"] = response.css('title::text').extract()
baseItem["crawling_date_time"] = response.headers['Date']
# extract description tags
baseItem["description"] = response.xpath('//meta[@name="description"]/@content').extract()
baseItem["og_description"] = response.xpath('//meta[@name="og:description"]/@content').extract()
baseItem['headlines'] = response.xpath('//h1/text()').extract()
return baseItem
抓取项目:
from scrapy import Item, Field
class PageItem(Item):
url = Field()
title = Field()
image_urls = Field()
file_urls = Field()
links = Field()
class BaseItem(Item):
url = Field()
title = Field()
crawling_date_time = Field()
description = Field()
og_description = Field()
headlines = Field()
class TrierDeItem(BaseItem):
news_title = Field()
tag = Field()
topic = Field()
以及多 CSV 管道:
class MultiCSVItemPipeline(object):
CSVPath = "csv_data/"
SaveTypes = ['page', 'base', 'trierde', 'triermitgestalten', 'teleport', 'lokalocomment', 'lokalo', 'lokalonews']
def __init__(self):
dispatcher.connect(self.spider_opened, signal=signals.spider_opened)
dispatcher.connect(self.spider_closed, signal=signals.spider_closed)
def spider_opened(self, spider):
self.files = dict([(name, open(self.CSVPath + name + '.csv', 'ab')) for name in self.SaveTypes])
self.exporters = dict([(name, CsvItemExporter(self.files[name])) for name in self.SaveTypes])
[e.start_exporting() for e in self.exporters.values()]
def spider_closed(self, spider):
[e.finish_exporting() for e in self.exporters.values()]
[f.close() for f in self.files.values()]
def process_item(self, item, spider):
what = item_type(item)
if what in set(self.SaveTypes):
self.exporters[what].export_item(item)
return item
def item_type(item):
'''
Returns the scraping item name
:param item:
:return:
'''
return type(item).__name__.replace('Item', '').lower()
我现在还没有找到解决方案,但我尝试了一些失败的方法。
因此,与我尝试过的这些选项相关,我相信管道本身存在一些错误......我感谢任何人可以提供的帮助。
其他信息:在将管道更改为 MultiCSV 之前,我能够将每个项目保存到 csv。
最佳答案
在我无法解决 Scrapy 导出器的问题后,我决定创建自己的导出器。
以下代码适用于想要将多个不同项目导出到一个或多个蜘蛛中的不同 csv 文件的人。到目前为止,它对我有用,但我仍在检查代码是否有错误。如果您有任何改进的想法,请随时回复。
class MultiCSVItemPipeline(object):
# Subfolder path, where the csv files are stored
CSVPath = "csv_data/"
# All allowed items
SaveTypes = ['page', 'base', 'trierde', 'triermitgestalten', 'teleport', 'lokalocomment', 'lokalo', 'lokalonews']
# List for already checked csv headers
CheckedHeaders = []
def __init__(self):
import sys
reload(sys)
sys.setdefaultencoding('utf8')
dispatcher.connect(self.spider_opened, signal=signals.spider_opened)
dispatcher.connect(self.spider_closed, signal=signals.spider_closed)
def spider_opened(self, spider):
# Check if items exists and create new ones if not
for file in set(self.SaveTypes):
f = open(self.CSVPath + file + '.csv', 'a+')
f.close()
def spider_closed(self, spider):
# not needed anymore
# [e.finish_exporting() for e in self.exporters.values()]
# [f.close() for f in self.files.values()]
pass
def process_item(self, item, spider):
what = item_type(item)
if what in set(self.SaveTypes):
try:
# Check if csv file contains header, but only those, that aren't checked
if what not in self.CheckedHeaders:
self.check_header(what, item)
self.write_item_to_row(item, what)
except Exception as e:
logging.error("########################################################")
logging.error("Error writing to " + what + ".csv file ")
logging.error("Error Message: " + e.message)
logging.error("Error Reason: " + e.reason)
logging.error("Error Object: " + e.object)
logging.error("########################################################")
return item
def write_item_to_row(self, item, what):
"""
Write a single item to a row in csv file
:param item:
:param what:
:return:
"""
ofile = open(self.CSVPath + what + '.csv', "ab")
writer = csv.writer(ofile, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
item_dict = item.__dict__['_values']
row = []
for k in item_dict:
d = item_dict[k]
# Ig item is not a list join the element to string, replace all delimiters and set encoding to utf-8
if not isinstance(d, types.ListType):
value = ''.join(item_dict[k]).replace('\t', '').replace('\n', '').encode('utf8')
else:
value = ','.join(item_dict[k]).replace('\t', '').replace('\n', '').encode('utf8')
row.append(value)
writer.writerow(row)
ofile.close()
def check_header(self, what, item):
"""
Check if the file contains header elements and create if missing
:param what:
:param item:
:return:
"""
try:
with open(self.CSVPath + what + '.csv', 'ab+') as csvfile:
writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
item_dict = item.__dict__['_values']
# If file is empty, create new csv header
if os.stat(self.CSVPath + what + '.csv').st_size == 0:
self.write_csv_header(item_dict, writer)
else:
# Read first row and check header elements
read_csv = csv.reader(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
first_row = read_csv.next()
# if not all headers are set in the csv file, print warning
if not self.check_key_in_csv_header(item_dict, first_row):
# TODO: Add missing header to the csv file
logging.warning("Wrong headers for file " + what + ".csv")
self.CheckedHeaders.append(what)
csvfile.close()
return True
except Exception as e:
logging.error(e.message)
return False
@staticmethod
def write_csv_header(item_dict, writer):
"""
Write header of a csv file.
Header is writen from each keys in the scrapy item
:param item_dict:
:param writer:
:return:
"""
first_row = []
for k in item_dict:
# Join each Key to a string, delete delimiters and encode to utf-8
value = ''.join(k).replace('\t', '').replace('\n', '').encode('utf8')
first_row.append(value)
writer.writerow(first_row)
@staticmethod
def check_key_in_csv_header(item_dict, row):
"""
Check, for each item key, if it's contained in the first line of the csv
k (key) stands for each dictionary key of the scrapy item.
:param item_dict:
:param row:
:return:
"""
for k in item_dict:
if k not in row:
return False
return True
关于python - Scrapy MultiCSVItemPipeline 导出一些空项目,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49782869/
我正在使用 NetBeans 开发 Java 中的 WebService,并使用 gradle 作为依赖管理。 我找到了this article关于使用 gradle 开发 Web 项目。它使用 Gr
我正在将旧项目从 ant 迁移到 gradle(以使用其依赖项管理和构建功能),并且在生成 时遇到问题>eclipse 项目。今天的大问题是因为该项目有一些子项目被拆分成 war 和 jar 包部署到
我已经为这个错误苦苦挣扎了很长时间。如果有帮助的话,我会提供一些问题的快照。请指导我该怎么办????在我看来,它看起来一团糟。 *** glibc detected *** /home/shivam/
我在 Ubuntu 12.10 上运行 NetBeans 7.3。我正在学习 Java Web 开发类(class),因此我有一个名为 jsage8 的项目,其中包含我为该类(class)所做的工作。
我想知道 Codeplex、GitHub 等中是否有任何突出的项目是 C# 和 ASP.NET,甚至只是 C# API 与功能测试 (NUnit) 和模拟(RhinoMocks、NMock 等)。 重
我创建了一个 Maven 项目,包装类型为“jar”,名为“Y”我已经完成了“Maven 安装”,并且可以在我的本地存储库中找到它.. 然后,我创建了另一个项目,包装类型为“war”,称为“X”。在这
我一直在关注the instructions用于将 facebook SDK 集成到我的应用程序中。除了“helloFacebookSample”之外,我已经成功地编译并运行了所有给定的示例应用程序。
我想知道,为什么我们(Java 社区)需要 Apache Harmony 项目,而已经有了 OpenJDK 项目。两者不是都是在开源许可下发布的吗? 最佳答案 事实恰恰相反。 Harmony 的成立是
我正在尝试使用 Jsoup HTML Parser 从网站获取缩略图 URL我需要提取所有以 60x60.jpg(或 png)结尾的 URL(所有缩略图 URL 都以此 URL 结尾) 问题是我让它在
我无法构建 gradle 项目,即使我编辑 gradle 属性,我也会收到以下错误: Error:(22, 1) A problem occurred evaluating root project
我有这个代码: var NToDel:NSArray = [] var addInNToDelArray = "Test1 \ Test2" 如何在 NToDel:NSArray 中添加 addInN
如何在单击显示更多(按钮)后将主题列表限制为 5 个(项目)。 还有 3(项目),依此类推到列表末尾,然后它会显示显示更少(按钮)。 例如:在 Udemy 过滤器选项中,当您点击查看更多按钮时,它仅显
如何将现有的 Flutter 项目导入为 gradle 项目? “导入项目”向导要求 Gradle 主路径。 我有 gradle,安装在我的系统中。但是这里需要设置什么(哪条路径)。 这是我正在尝试的
我有一个关于 Bitbucket 的项目。只有源被提交。为了将项目检索到新机器上,我在 IntelliJ 中使用了 Version Control > Checkout from Ve
所以,我想更改我公司的一个项目,以使用一些与 IDE 无关的设置。我在使用 Tomcat 设置 Java 应用程序方面有非常少的经验(我几乎不记得它是如何工作的)。 因此,为了帮助制作独立于 IDE
我有 2 个独立的项目,一个在 Cocos2dx v3.6 中,一个在 Swift 中。我想从 Swift 项目开始游戏。我该怎么做? 我已经将整个 cocos2dx 项目复制到我的 Swift 项目
Cordova 绝对是新手。这些是我完成的步骤: checkout 现有项目 运行cordova build ios 以上生成此构建错误: (node:10242) UnhandledPromiseR
我正在使用 JQuery 隐藏/显示 li。我的要求是,当我点击任何 li 时,它应该显示但隐藏所有其他 li 项目。当我将鼠标悬停在文本上时 'show all list item but don
我想将我所有的java 项目(223 个项目)迁移到gradle 项目。我正在使用由 SpringSource STS 团队开发的 Gradle Eclipse 插件。 目前,我所有的 java 项目
我下载this Eclipse Luna ,对于 Java EE 开发人员,如描述中所见,它支持 Web 应用程序。我找不到 file -> new -> other -> web projects
我是一名优秀的程序员,十分优秀!