gpt4 book ai didi

python - Scrapy pipeline.py 不从蜘蛛向 MYSQL 插入项目

转载 作者:太空宇宙 更新时间:2023-11-04 06:04:40 25 4
gpt4 key购买 nike

我正在使用 scrapy 来抓取新闻标题,我是 scrapy 和整个抓取的新手。几天来我遇到了很大的问题,现在将我抓取的数据通过管道传输到我的 SQL 数据库中。我的 pipelines.py 文件中有 2 个类,一个用于将项目插入数据库,另一个用于出于前端 Web 开发原因将抓取的数据备份到 json 文件中。

这是我的蜘蛛的代码- 从 start_urls 中提取新闻标题- 它使用 extract() 将这些数据提取为字符串,然后循环遍历所有这些数据并使用 strip() 删除空格以更好地格式化

from scrapy.spider import Spider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from Aljazeera.items import AljazeeraItem
from datetime import date, datetime


class AljazeeraSpider(Spider):
name = "aljazeera"
allowed_domains = ["aljazeera.com"]
start_urls = [
"http://www.aljazeera.com/news/europe/",
"http://www.aljazeera.com/news/middleeast/",
"http://www.aljazeera.com/news/asia/",
"http://www.aljazeera.com/news/asia-pacific/",
"http://www.aljazeera.com/news/americas/",
"http://www.aljazeera.com/news/africa/",
"http://blogs.aljazeera.com/"

]

def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//td[@valign="bottom"]')
contents = sel.xpath('//div[@class="indexSummaryText"]')
items = []

for site,content in zip(sites, contents):
item = AljazeeraItem()
item['headline'] = site.xpath('div[3]/text()').extract()
item['content'] = site.xpath('div/a/text()').extract()
item['date'] = str(date.today())
for headline, content in zip(item['content'], item['headline']):
item['headline'] = headline.strip()
item['content'] = content.strip()
items.append(item)
return items

我的pipeline.py代码如下:

import sys
import MySQLdb
import hashlib
from scrapy.exceptions import DropItem
from scrapy.http import Request
import json
import os.path

class SQLStore(object):
def __init__(self):
self.conn = MySQLdb.connect(user='root', passwd='', db='aj_db', host='localhost', charset="utf8", use_unicode=True)
self.cursor = self.conn.cursor()
#log data to json file


def process_item(self, item, spider):

try:
self.cursor.execute("""INSERT INTO scraped_data(headlines, contents, dates) VALUES (%s, %s, %s)""", (item['headline'].encode('utf-8'), item['content'].encode('utf-8'), item['date'].encode('utf-8')))
self.conn.commit()

except MySQLdb.Error, e:
print "Error %d: %s" % (e.args[0], e.args[1])

return item



#log runs into back file
class JsonWriterPipeline(object):

def __init__(self):
self.file = open('backDataOfScrapes.json', "w")

def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write("item === " + line)
return item

settings.py 如下:

BOT_NAME = 'Aljazeera'

SPIDER_MODULES = ['Aljazeera.spiders']
NEWSPIDER_MODULE = 'Aljazeera.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'Aljazeera (+http://www.yourdomain.com)'

ITEM_PIPELINES = {
'Aljazeera.pipelines.JsonWriterPipeline': 300,
'Aljazeera.pipelines.SQLStore': 300,
}

我的sql设置都没问题。在运行 scrapy crawl aljazeera 之后,它可以正常工作,甚至以 json 格式输出项目,如下所示:

item === {"headline": "Turkey court says Twitter ban violates rights", "content": "Although ruling by Turkey's highest court is binding, it is unclear whether the government will overturn the ban.", "date": "2014-04-02"}

我真的不知道或看不到我在这里缺少什么。如果你们能帮助我,我将不胜感激。

谢谢你的时间,

最佳答案

您在 SQLStore 管道中的缩进是错误的。我已经用正确的缩进进行了测试,并且工作正常。复制下面的内容,它应该是完美的。

class SQLStore(object):
def __init__(self):
self.conn = MySQLdb.connect(user='root', passwd='', db='aj_db', host='localhost', charset="utf8", use_unicode=True)
self.cursor = self.conn.cursor()
#log data to json file


def process_item(self, item, spider):

try:
self.cursor.execute("""INSERT INTO scraped_data(headlines, contents, dates) VALUES (%s, %s, %s)""", (item['headline'].encode('utf-8'), item['content'].encode('utf-8'), item['date'].encode('utf-8')))
self.conn.commit()

except MySQLdb.Error, e:
print "Error %d: %s" % (e.args[0], e.args[1])

return item

关于python - Scrapy pipeline.py 不从蜘蛛向 MYSQL 插入项目,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22822095/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com