gpt4 book ai didi

python - Scrapy 仅针对某些网站连接 MySQL

转载 作者:行者123 更新时间:2023-11-29 18:03:42 25 4
gpt4 key购买 nike

当我抓取'http://www.example.com时',我能够连接到 MySQL 并将值插入到数据库中。

但是,当我尝试抓取到不同的网站时,即 - ' https://www.nytimes.com ',我失去了联系。我不明白为什么:

尝试抓取时出错 https://www.nytimes.com :

2018-01-04 14:38:01 [scrapy.middleware] INFO: Enabled item pipelines:
['properties.pipelines.MysqlWriter']

2018-01-04 14:38:02 [basic] ERROR: Can't connect to MySQL:mysql://root:password@localhost:3306/cat

我的管道:

    import traceback

import dj_database_url
import MySQLdb

from twisted.internet import defer
from twisted.enterprise import adbapi
from scrapy.exceptions import NotConfigured


class MysqlWriter(object):
"""
A spider that writes to MySQL databases
"""

@classmethod
def from_crawler(cls, crawler):
"""Retrieves scrapy crawler and accesses pipeline's settings"""

# Get MySQL URL from settings
mysql_url = crawler.settings.get('MYSQL_PIPELINE_URL', None)

# If doesnt exist, disable the pipeline
if not mysql_url:
raise NotConfigured

# Create the class
return cls(mysql_url)

def __init__(self, mysql_url):
"""Opens a MySQL connection pool"""

# Store the url for future reference
self.mysql_url = mysql_url
# Report connection error only once
self.report_connection_error = True

# Parse MySQL URL and try to initialize a connection
conn_kwargs = MysqlWriter.parse_mysql_url(mysql_url)
self.dbpool = adbapi.ConnectionPool('MySQLdb',
charset='utf8',
use_unicode=True,
connect_timeout=5,
**conn_kwargs)

def close_spider(self, spider):
"""Discard the database pool on spider close"""
self.dbpool.close()

@defer.inlineCallbacks
def process_item(self, item, spider):
"""Processes the item. Does insert into MySQL"""

logger = spider.logger

try:
yield self.dbpool.runInteraction(self.do_replace, item)
except MySQLdb.OperationalError:
if self.report_connection_error:
logger.error("Can't connect to MySQL: %s" % self.mysql_url)
self.report_connection_error = False
except:
print traceback.format_exc()

# Return the item for the next stage
defer.returnValue(item)

@staticmethod
def do_replace(tx, item):
"""Does the actual REPLACE INTO"""

sql = """REPLACE INTO text2 (url, text)
VALUES (%s,%s)"""

args = (
item["url"],
item["words"],
)

tx.execute(sql, args)

@staticmethod
def parse_mysql_url(mysql_url):
"""
Parses mysql url and prepares arguments for
adbapi.ConnectionPool()
"""

params = dj_database_url.parse(mysql_url)

conn_kwargs = {}
conn_kwargs['host'] = params['HOST']
conn_kwargs['user'] = params['USER']
conn_kwargs['passwd'] = params['PASSWORD']
conn_kwargs['db'] = params['NAME']
conn_kwargs['port'] = params['PORT']

# Remove items with empty values
conn_kwargs = dict((k, v) for k, v in conn_kwargs.iteritems() if v)

return conn_kwargs

我的蜘蛛:

from scrapy.loader.processors import MapCompose, Join
from scrapy.loader import ItemLoader
from properties.items import PropertiesItem
import datetime
import urlparse
import socket
import scrapy


class BasicSpider(scrapy.Spider):
name = "basic"
allowed_domains = ["web"]

# Start on a property page
#start_urls = [i.strip() for i in open('urls.txt').readlines()]
start_urls = ('http://www.nytimes.com',)

def parse(self, response):
""" This function parses a property page.
@url http://web:9312/properties/property_000000.html
@returns items 1
@scrapes title price description address image_urls
@scrapes url project spider server date
"""
# Create the loader using the response
l = ItemLoader(item=PropertiesItem(), response=response)

# Load fields using XPath expressions
l.add_xpath('words', '//p/text()',
MapCompose(unicode.strip, unicode.title))

# Housekeeping fields
l.add_value('url', response.url)
# l.add_value('project', self.settings.get('BOT_NAME'))
#l.add_value('spider', self.name)
#l.add_value('server', socket.gethostname())
#l.add_value('date', datetime.datetime.now())

return l.load_item()

最佳答案

我发现出了什么问题 - 我试图将一个列表插入到 mysql 中的单个列中。我的旧代码:

sql = """REPLACE INTO text2 (url, text)
VALUES (%s,%s)"""

args = (
item["url"],
item["words"],

我的新代码:

sql = """REPLACE INTO text2 (url, text)
VALUES (%s,%s)"""

args = (
item["url"],
str(item["words"]),

关于python - Scrapy 仅针对某些网站连接 MySQL,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48103402/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com