python - Scrapy pipeline.py 不从蜘蛛向 MYSQL 插入项目-6ren

python - Scrapy pipeline.py 不从蜘蛛向 MYSQL 插入项目

转载作者：太空宇宙更新时间：2023-11-04 06:04:40

25

4

我正在使用 scrapy 来抓取新闻标题，我是 scrapy 和整个抓取的新手。几天来我遇到了很大的问题，现在将我抓取的数据通过管道传输到我的 SQL 数据库中。我的 pipelines.py 文件中有 2 个类，一个用于将项目插入数据库，另一个用于出于前端 Web 开发原因将抓取的数据备份到 json 文件中。

这是我的蜘蛛的代码- 从 start_urls 中提取新闻标题- 它使用 extract() 将这些数据提取为字符串，然后循环遍历所有这些数据并使用 strip() 删除空格以更好地格式化

from scrapy.spider import Spider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from Aljazeera.items import AljazeeraItem
from datetime import date, datetime


class AljazeeraSpider(Spider):
    name = "aljazeera"
    allowed_domains = ["aljazeera.com"]
    start_urls = [
        "http://www.aljazeera.com/news/europe/",
        "http://www.aljazeera.com/news/middleeast/",
        "http://www.aljazeera.com/news/asia/",
        "http://www.aljazeera.com/news/asia-pacific/",
        "http://www.aljazeera.com/news/americas/",
        "http://www.aljazeera.com/news/africa/",
        "http://blogs.aljazeera.com/"

    ]

    def parse(self, response):
        sel = Selector(response)
        sites = sel.xpath('//td[@valign="bottom"]')
        contents = sel.xpath('//div[@class="indexSummaryText"]')
        items = []

        for site,content in zip(sites, contents):
            item = AljazeeraItem()
            item['headline'] = site.xpath('div[3]/text()').extract()
            item['content'] = site.xpath('div/a/text()').extract()
            item['date'] = str(date.today())
            for headline, content in zip(item['content'], item['headline']):
              item['headline'] = headline.strip()
              item['content'] = content.strip()
              items.append(item)
        return items

我的pipeline.py代码如下:

import sys
import MySQLdb
import hashlib
from scrapy.exceptions import DropItem
from scrapy.http import Request
import json
import os.path

class SQLStore(object):
  def __init__(self):
    self.conn = MySQLdb.connect(user='root', passwd='', db='aj_db', host='localhost', charset="utf8", use_unicode=True)
    self.cursor = self.conn.cursor()
    #log data to json file


def process_item(self, item, spider): 

    try:
        self.cursor.execute("""INSERT INTO scraped_data(headlines, contents, dates) VALUES (%s, %s, %s)""", (item['headline'].encode('utf-8'), item['content'].encode('utf-8'), item['date'].encode('utf-8')))
        self.conn.commit()

    except MySQLdb.Error, e:
        print "Error %d: %s" % (e.args[0], e.args[1])

        return item



#log runs into back file 
class JsonWriterPipeline(object):

    def __init__(self):
        self.file = open('backDataOfScrapes.json', "w")

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write("item === " + line)
        return item

settings.py 如下:

BOT_NAME = 'Aljazeera'

SPIDER_MODULES = ['Aljazeera.spiders']
NEWSPIDER_MODULE = 'Aljazeera.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'Aljazeera (+http://www.yourdomain.com)'

ITEM_PIPELINES = {
    'Aljazeera.pipelines.JsonWriterPipeline': 300,
    'Aljazeera.pipelines.SQLStore': 300,
}

我的sql设置都没问题。在运行 scrapy crawl aljazeera 之后，它可以正常工作，甚至以 json 格式输出项目，如下所示:

item === {"headline": "Turkey court says Twitter ban violates rights", "content": "Although ruling by Turkey's highest court is binding, it is unclear whether the government will overturn the ban.", "date": "2014-04-02"}

我真的不知道或看不到我在这里缺少什么。如果你们能帮助我，我将不胜感激。

谢谢你的时间，

最佳答案

您在 SQLStore 管道中的缩进是错误的。我已经用正确的缩进进行了测试，并且工作正常。复制下面的内容，它应该是完美的。

class SQLStore(object):
def __init__(self):
    self.conn = MySQLdb.connect(user='root', passwd='', db='aj_db', host='localhost', charset="utf8", use_unicode=True)
    self.cursor = self.conn.cursor()
    #log data to json file


def process_item(self, item, spider): 

    try:
        self.cursor.execute("""INSERT INTO scraped_data(headlines, contents, dates) VALUES (%s, %s, %s)""", (item['headline'].encode('utf-8'), item['content'].encode('utf-8'), item['date'].encode('utf-8')))
        self.conn.commit()

    except MySQLdb.Error, e:
        print "Error %d: %s" % (e.args[0], e.args[1])

        return item

关于python - Scrapy pipeline.py 不从蜘蛛向 MYSQL 插入项目，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/22822095/

25

4

0

文章推荐： c - 用 C 进行冒泡排序

文章推荐： c - void 指向文字常量的指针

文章推荐：更改顺序修复了我的错误。为什么？

文章推荐： python - 如何阻止 xbee 数据包之间的干扰

python - manage.py 在同一文件夹中继续使用 django 的空白 settings.py 而不是 settings.py manage.py 位于
我使用的是 Windows 8.1 和 Python 2.7，我在特定文件路径中设置了所有文件(希望正确)，但每当我运行 python manage.py runserver 时，我都会收到此错误。
python - 如何为基本包设置配置 __main__.py、__init__.py 和 __setup__.py？
背景: 我有一个像这样的目录结构: Package/ setup.py src/ __init__.py __main__.py cod
pytest - py.test 给出 Coverage.py 警告 : Module sample. py 从未导入
我从这个线程运行了一个示例代码。 How to properly use coverage.py in Python? 但是，当我执行此命令时 py.test test.py --cov=sample
ipython - 为什么 `ipython foo.py bar.py` 只打印 `foo.py` 的输出？
IPython 0.13.1 文档说: $ ipython -h ... Usage ipython [subcommand] [options] [files] If invoked
python - 当我们需要使用 sudo python xxx.py 或只是 python xxx.py 或 xxx.py
我写了一个网站，让我困惑的是当我运行这个网站时，首先我需要启动应用程序，所以有 3 种方法: sudo python xxx.py python xxx.py xxx.py 每一个我都不清楚怎么用，目
Python:从 day_one.py 导入一个文件到 main.py，然后在 day_one 中我从 main.py 导入一个函数。错误无法导入
我不确定为什么它不起作用，这可能是一个您无法解决的问题，但我只是想知道为什么它不起作用。如果我浪费了您的时间，或者没有正确地提出问题，我很抱歉，我 16 岁，对 Python 还算陌生。在main.
Django 模型管理器.py 和模型.py
鉴于以下情况:models.py from .managers import PersonManager from django.db import models class Person(model
web.py - web.py 处理程序类的参数
有没有办法将参数传递给 web.py 处理程序类构造函数？例如。这些参数可能来自命令行(当主 web.py 脚本运行时)，在第一个参数(作为端口号)之后最佳答案当然，这取决于你的意思。毕竟都是p
manage.py - manage.py 文件究竟做了什么
我对 python/django 编程很陌生，因为我没有编程背景。我正在在线上课，我只想确切地知道 manage.py 文件的作用。我试过用谷歌搜索它，但除了在 django-admin.py 周围放
python - 如何使用 models.py 、 serializers.py 和 views.py 将解析的 json 数组值保存到 django python Rest api 中的数据库
我想将类别及其子类别保存到数据库中，这里每个类别都有多个子类别。您能帮我保存与类别相对应的用户、类别和多个子类别吗？Models.py、Serializers.py、Views .py 并附加传入请求
discord.py - 如何在 discord.py 中使用高级命令处理
所以我的机器人开始有很多命令，并且在 main.py 上变得有点困惑。我知道有一种方法可以将命令存储在其他文件中，然后在 discord.js 上触发它们时将它们应用于 main.py。在 disco
discord.py - 如何让我的 discord.py 机器人计算某个人发送的消息数量？
我正在尝试制作一个类似于 mee6 的 Discord 机器人，因为它会按特定时间间隔计算用户在我的 Discord 服务器中发送的消息。我已经在网上搜索过，但即使有类似的问题也找不到我要找的东西。例
discord.py - 有没有办法在 discord.py 中创建线程？
我正在尝试制作一个机器人，它根据特定 channel 中的消息创建线程。如果有在 discord.py 中的文本 channel 中创建线程的方法，请告诉我。最佳答案是的，但有一个问题。当前版本
discord.py - 分页 - Discord.py 重写
我一直在尝试制作一个命令来显示一些信息，然后当我对表情使用react时，它应该会显示另一组信息。我尝试使用 this 的部分内容，特别是第 335 到 393 行的部分让它工作。但是，它什么也不做。
discord.py - 当有人提到它时，我如何让机器人做出回应？不和谐.py
这是我试过的代码: @client.event async def on_message(message): if client.user.mention in message.content
discord.py - 用 discord.py 重写的数字猜谜游戏
我试过这段代码，机器人说猜但没有回应我的猜测。 @commands.command() async def game(self, ctx): number = random.randint(0
discord.py - 如何检查机器人是否连接到 channel ？ |不和谐.py
我决定尝试让我的不和谐机器人播放音乐，但我已经卡住了。主要是因为我找不到任何资源来帮助当前版本，我一直在从文档中获取所有内容。但是，我不知道如何检查机器人是否已连接到语音 channel 。我试过
python - 执行 .py 文件也会运行另一个 .py 文件
我在一个目录中有三个文件: # Untitled-1.py print("UTITLEDPY") if __name__== "__main__": from telegram.ext imp
python - 在 .py 文件内使用 .py 文件
我对 python 相当陌生，并且一直只使用 Jupyter Notebooks。当我需要运行我已保存在计算机中某处的 .py 文件时，我通常所做的就是使用魔术命令 %run %run '/home/
python - manage.py 和 other.py 文件不在同级文件中
我有 Django 1.4 和 Python 2.6.6当我使用“django-amdin.py startproject djproject”时，请按照网页中的步骤操作 https://www.ib

首页

博学

6Ren·AI

商城

python - Scrapy pipeline.py 不从蜘蛛向 MYSQL 插入项目