gpt4 book ai didi

python - scrapy如何防止重复数据插入数据库

转载 作者:可可西里 更新时间:2023-11-01 08:30:02 26 4
gpt4 key购买 nike

谁能帮我解决这个问题,我对 scrapy/python 有点陌生。我似乎无法阻止将重复数据插入数据库。举些例子。如果我的数据库中有马自达的价格为 4000 美元。如果 'car' 已经存在或者 'price with car' 存在,我不希望蜘蛛再次插入爬取的数据。

price | car
-------------
$4000 | Mazda <----
$3000 | Mazda 3 <----
$4000 | BMW
$4000 | Mazda 3 <---- I also dont want to have two results like this
$4000 | Mazda <---- I don't want to have two results any help will be greatly appreciated - Thanks


pipeline.py
-------------------
from scrapy import log
#from scrapy.core.exceptions import DropItem
from twisted.enterprise import adbapi
from scrapy.http import Request
from scrapy.exceptions import DropItem
from scrapy.contrib.pipeline.images import ImagesPipeline
import time
import MySQLdb
import MySQLdb.cursors
import socket
import select
import sys
import os
import errno

----------------------------------
when I put this peace of code, the crawled data does not save. but when removed it does save into the database.



class DuplicatesPipeline(object):

def __init__(self):
self.car_seen = set()

def process_item(self, item, spider):
if item['car'] in self.car_seen:
raise DropItem("Duplicate item found: %s" % item)
else:
self.car_seen.add(item['car'])
return item
--------------------------------------

class MySQLStorePipeline(object):

def __init__(self):
self.dbpool = adbapi.ConnectionPool('MySQLdb',
db = 'test',
user = 'root',
passwd = 'test',
cursorclass = MySQLdb.cursors.DictCursor,
charset = 'utf8',
use_unicode = False
)

def _conditional_insert(self, tx, item):
if item.get('price'):
tx.execute(\
"insert into data ( \
price,\
car \
) \
values (%s, %s)",
(item['price'],
item['car'],
)
)

def process_item(self, item, spider):
query = self.dbpool.runInteraction(self._conditional_insert, item)
return item



settings.py
------------
SPIDER_MODULES = ['car.spiders']
NEWSPIDER_MODULE = 'car.spiders'
ITEM_PIPELINES = ['car.pipelines.MySQLStorePipeline']

最佳答案

发现问题。确保 duplicatespipeline 是第一个。

settings.py
ITEM_PIPELINES = {
'car.pipelines.DuplicatesPipeline': 100,
'car.pipelines.MySQLStorePipeline': 200,
}

关于python - scrapy如何防止重复数据插入数据库,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29440137/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com