gpt4 book ai didi

python - 从 Scrapy 管道中删除重复项

转载 作者:行者123 更新时间:2023-12-04 03:46:57 26 4
gpt4 key购买 nike

我的scrapy爬虫从ptt网站收集数据,并使用gspread将爬取数据输入到google电子表格中。我的ptt蜘蛛每天解析ptt网站上最新的40个帖子,现在我想在这个最新的40个帖子中删除重复的数据,例如,如果post_title或post_link与昨天相同,则不需要解析这个发布到谷歌电子表格。
我知道我应该在 Scarpy 中使用 DropItem,但实际上我不知道如何修复我的代码(我是 Python 的新手),想寻求帮助,谢谢。

This is my ppt spider code

    # -*- coding: utf-8 -*-
import scrapy
# from scrapy.exceptions import CloseSpider
from myFirstScrapyProject.items import MyfirstscrapyprojectItem

class PttSpider(scrapy.Spider):
count_page = 1
name = 'ptt'
allowed_domains = ['www.ptt.cc/']
start_urls = ['https://www.ptt.cc/bbs/e-shopping/search?q=%E8%9D%A6%E7%9A%AE']+['https://www.ptt.cc/bbs/e-seller/search?q=%E8%9D%A6%E7%9A%AE']
# start_urls = ['https://www.ptt.cc/bbs/e-shopping/index.html']

def parse(self, response):
items = MyfirstscrapyprojectItem()
for q in response.css('div.r-ent'):
items['push']=q.css('div.nrec > span.h1::text').extract_first()
items['title']=q.css('div.title > a::text').extract_first()
items['href']=q.css('div.title> a::attr(href)').extract_first()
items['date']=q.css('div.meta > div.date ::text').extract_first()
items['author']=q.css('div.meta > div.author ::text').extract_first()
yield(items)

and this is my pipeline

from myFirstScrapyProject.exporters import GoogleSheetItemExporter
from scrapy.exceptions import DropItem

class MyfirstscrapyprojectPipeline(object):
def open_spider(self, spider):
self.exporter = GoogleSheetItemExporter()
self.exporter.start_exporting()

def close_spider(self, spider):
self.exporter.finish_exporting()

def process_item(self, item, spider):
self.exporter.export_item(item)
return item

thanks to sharmiko, i rewrite it, but it seems doesn't work, what should i fix?

from myFirstScrapyProject.exporters import GoogleSheetItemExporter
from scrapy.exceptions import DropItem

class MyfirstscrapyprojectPipeline(object):

def open_spider(self, spider):
self.exporter = GoogleSheetItemExporter()
self.exporter.start_exporting()

def close_spider(self, spider):
self.exporter.finish_exporting()

# def process_item(self, item, spider):
# self.exporter.export_item(item)
# return item

#class DuplicatesTitlePipeline(object):
def __init__(self):
self.article = set()
def process_item(self, item, spider):
href = item['href']
if href in self.article:
raise DropItem('duplicates href found %s', item)
self.exporter.export_item(item)
return(item)

this is the code for export to google sheet

import gspread
from oauth2client.service_account import ServiceAccountCredentials
from scrapy.exporters import BaseItemExporter

class GoogleSheetItemExporter(BaseItemExporter):
def __init__(self):
scope = ['https://spreadsheets.google.com/feeds','https://www.googleapis.com/auth/drive']
credentials = ServiceAccountCredentials.from_json_keyfile_name('pythonupload.json', scope)
gc = gspread.authorize(credentials)
self.spreadsheet = gc.open('Community')
self.worksheet = self.spreadsheet.get_worksheet(1)

def export_item(self, item):
self.worksheet.append_row([item['push'], item['title'],
item['href'],item['date'],item['author']])

最佳答案

您应该修改您的 process_item检查重复元素的功能,如果找到,您可以删除它。

from scrapy.exceptions import DropItem
...
def process_item(self, item, spider):
if [ your duplicate check logic goes here]:
raise DropItem('Duplicate element found')
else:
self.exporter.export_item(item)
return item
丢弃的项目不再传递给其他管道组件。您可以阅读有关管道的更多信息 here .

关于python - 从 Scrapy 管道中删除重复项,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/64999990/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com