gpt4 book ai didi

python - 从 softpedia.com 获取 Scrapy 下载安装程序

转载 作者:太空宇宙 更新时间:2023-11-03 18:44:45 24 4
gpt4 key购买 nike

目前,我可以从 softpedia.com 获取无尽的爬网链接(包括所需的安装程序链接,例如 http://hotdownloads.com/trialware/download/Download_a1keylogger.zip?item=33649-3&affiliate=22260 )。

spider.py如下:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

class MySpider(CrawlSpider):
""" Crawl through web sites you specify """
name = "softpedia"

# Stay within these domains when crawling
allowed_domains = ["www.softpedia.com"]

start_urls = [
"http://win.softpedia.com/",]

download_delay = 2

# Add our callback which will be called for every found link
rules = [
Rule(SgmlLinkExtractor(), follow=True)
]

items.py、pipelines.py、settings.py 为默认值,但在 settings.py 中添加了一行:

FILES_STORE = '/home/test/softpedia/downloads'

使用 urllib2,我可以判断链接是否是安装程序,在本例中,我在 content_type 中得到“application”:

>>> import urllib2
>>> url = 'http://hotdownloads.com/trialware/download/Download_a1keylogger.zip?item=33649-3&affiliate=22260'
>>> response = urllib2.urlopen(url)
>>> content_type = response.info().get('Content-Type')
>>> print content_type
application/zip

我的问题是,如何收集所需的安装程序链接,并将它们下载到我的目标文件夹?提前致谢!

PS:

我现在找到了两种方法,但我无法让它们工作:

1. https://stackoverflow.com/a/7169241/2092480 ,我按照这个答案将以下代码添加到蜘蛛中:

def parse_installer(self, response):
# extract links
lx = SgmlLinkExtractor()
urls = lx.extract_links(response)
for url in urls:
yield Request(url, callback=self.save_installer)

def save_installer(self, response):
path = self.get_path(response.url)
with open(path, "wb") as f: # or using wget
f.write(response.body)

蜘蛛只是走了,因为这些代码根本不存在,而且我没有下载文件,有人可以看到哪里出了问题吗?

2. https://groups.google.com/forum/print/msg/scrapy-users/kzGHFjXywuY/O6PIhoT3thsJ ,当我在 [“file_urls”] 中提供预定义链接时,此方法本身正在工作。但是如何设置 scrapy 来收集 ["file_urls"] 的所有安装程序链接?另外,我想对于这么简单的任务,上面的方法应该足够了。

最佳答案

我结合了提到的两种方法来获取实际/镜像安装程序下载,然后使用文件下载管道进行实际下载。但是,如果文件下载 URL 是动态/复杂的,例如,它似乎不起作用。 http://www.softpedia.com/dyn-postdownload.php?p=00000&t=0&i=1 。但它适用于更简单的链接,例如http://www.ietf.org/rfc/rfc2616.txt

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request
from scrapy.contrib.loader import XPathItemLoader
from scrapy import log
from datetime import datetime
from scrapy.conf import settings
from myscraper.items import SoftpediaItem

class SoftpediaSpider(CrawlSpider):
name = "sosoftpedia"
allowed_domains = ["www.softpedia.com"]
start_urls = ['http://www.softpedia.com/get/Antivirus/']
rules = Rule(SgmlLinkExtractor(allow=('/get/', ),allow_domains=("www.softpedia.com"), restrict_xpaths=("//td[@class='padding_tlr15px']",)), callback='parse_links', follow=True,),


def parse_start_url(self, response):
return self.parse_links(response)

def parse_links(self, response):
print "PRODUCT DOWNLOAD PAGE: "+response.url
hxs = HtmlXPathSelector(response)
urls = hxs.select("//a[contains(@itemprop, 'downloadURL')]/@href").extract()
for url in urls:
item = SoftpediaItem()
request = Request(url=url, callback=self.parse_downloaddetail)
request.meta['item'] = item
yield request

def parse_downloaddetail(self, response):
item = response.meta['item']
hxs = HtmlXPathSelector(response)
item["file_urls"] = hxs.select('//p[@class="fontsize16"]/b/a/@href').extract() #["http://www.ietf.org/rfc/rfc2616.txt"]
print "ACTUAL DOWNLOAD LINKS "+ hxs.select('//p[@class="fontsize16"]/b/a/@href').extract()[0]
yield item

关于python - 从 softpedia.com 获取 Scrapy 下载安装程序,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/19774912/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com