gpt4 book ai didi

python - 避免从已抓取的页面中抓取数据

转载 作者:行者123 更新时间:2023-12-01 04:46:03 24 4
gpt4 key购买 nike

大家晚上好,

我仍在使用我的蜘蛛从新闻网站抓取数据,但遇到了另一个问题,我原来的问题发布在这里:Scrapy outputs [ into my .json file但已经解决了。

我已经设法走得更远了,不得不考虑到空项目并添加搜索功能,我现在尝试只抓取我尚未抓取的文章,(记住我可能仍然想提取来自他们的链接)。我不知道将代码放在哪里:

a.) 定义上次爬网完成的时间b.) 将文章的日期与上次抓取的日期进行比较。

我可能只是在逻辑上挣扎,所以我转向你。

我的蜘蛛:

# tabbing in python is apparently VERY important so be aware and make sure 
# things that should line up do so

# import the CrawlSpider Class, along with it's Rules, (this lets us recursively
# crawl pages)

from scrapy.contrib.spiders import CrawlSpider, Rule

#import the link extractor, this extracts links from pages

from scrapy.contrib.linkextractors import LinkExtractor

# import our items as defined in items.py

from basic.items import BasicItem

# import datetime so that we can get the current date and time

import time

# import re which allows us to compare strings

import re

# create a new Spider with the CrawlSpider Class

class BasicSpiderSpider(CrawlSpider):

# Name of the spider, this is used to run it, (i.e Scrapy Crawl basic_spider)

name = "basic_spider"

# domains that the spider is allowed to crawl over

allowed_domains = ["news24.com"]

# where to start crawling from

start_urls = [
'http://www.news24.com',
]

# Rules for the link extractor, (i.e where it's allowed to look for links,
# what to do once it's found them, and whether it's allowed to follow them

rules = (Rule (LinkExtractor(), callback="parse_items", follow= True),
)

# defining the callback function

def parse_items(self, response):

# defines the Top level XPath where all of our information can be found, needs to be
# as specific as possible to avoid duplicates

for title in response.xpath('//*[@id="aspnetForm"]'):

# List of keywords to search through.

key = re.compile("joburg|durban", re.IGNORECASE)

# extracting the data to compare with the keywords, this is for the
# headlines, the join converts it from a list type to a string type

headlist = title.xpath('//*[@id="article_special"]//h1/text()').extract()
head = ''.join(headlist)

# and this is for the article.

artlist = title.xpath('//*[@id="article-body"]//text()').extract()
art = ''.join(artlist)

# if any keywords are found in the headline:

if key.search(head):
if last_crawled > response.xpath('//*[@id="spnDate"]/text()').extract()
# define the top level xpath again as python won't look outside
# it's current fuction

for thing in response.xpath('//*[@id="aspnetForm"]'):

# fills the items defined in items.py with relevant data

item = BasicItem()
item['Headline'] = thing.xpath('//*[@id="article_special"]//h1/text()').extract()
item["Article"] = thing.xpath('//*[@id="article-body"]/p[1]/text()').extract()
item["Date"] = thing.xpath('//*[@id="spnDate"]/text()').extract()
item["Link"] = response.url

# I found that even with being careful about my XPaths I
# still got empty fields and lines, the below fixes that

if item['Headline']:
if item["Article"]:
if item["Date"]:
last_crawled = (time.strftime("%Y-%m-%d %H:%M"))
yield item

# if the headline item doesn't match, check the article item.

elif key.search(art):
if last_crawled > response.xpath('//*[@id="spnDate"]/text()').extract()
for thing in response.xpath('//*[@id="aspnetForm"]'):
item = BasicItem()
item['Headline'] = thing.xpath('//*[@id="article_special"]//h1/text()').extract()
item["Article"] = thing.xpath('//*[@id="article-body"]/p[1]/text()').extract()
item["Date"] = thing.xpath('//*[@id="spnDate"]/text()').extract()
item["Link"] = response.url

if item['Headline']:
if item["Article"]:
if item["Date"]:
last_crawled = (time.strftime("%Y-%m-%d %H:%M"))
yield item

它不起作用,但正如我提到的,我无论如何都对逻辑持怀疑态度,有人可以让我知道我是否走在正确的轨道上吗?

再次感谢您的帮助。

最佳答案

您似乎完全断章取义地使用了last_crawled。但不要太在意它,使用 deltafetch 会更好。中间件,专为您想要做的事情而创建:

This is a spider middleware to ignore requests to pages containing items seen in previous crawls of the same spider, thus producing a "delta crawl" containing only new items.

要使用deltafetch,请先安装scrapylib:

pip install scrapylib

然后,在settings.py中启用它:

SPIDER_MIDDLEWARES = {
'scrapylib.deltafetch.DeltaFetch': 100,
}

DELTAFETCH_ENABLED = True

关于python - 避免从已抓取的页面中抓取数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29396942/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com