- android - 多次调用 OnPrimaryClipChangedListener
- android - 无法更新 RecyclerView 中的 TextView 字段
- android.database.CursorIndexOutOfBoundsException : Index 0 requested, 光标大小为 0
- android - 使用 AppCompat 时,我们是否需要明确指定其 UI 组件(Spinner、EditText)颜色
我知道我已经问过类似的问题,但它是一个新的蜘蛛,我也有同样的问题( Crawling data successfully but cannot scraped or write it into csv )...我将我的另一个蜘蛛放在这里,并提供了我应该拥有的输出示例和所有信息我通常需要获取输出文件...有人可以帮助我吗?我必须在周五完成这只蜘蛛...所以,我很着急!!
奇怪的是,我的 Fnac.csv 已创建但始终为空...所以我尝试直接在我想要抓取的页面示例上运行我的蜘蛛,并且我拥有我需要的所有信息...所以,我不明白...也许问题只是来 self 的规则
或其他什么?
我的蜘蛛:
# -*- coding: utf-8 -*-
# Every import is done for a specific use
import scrapy # Once you downloaded scrapy, you have to import it in your code to use it.
import re # To use the .re() function, which extracts just a part of the text you crawl. It's using regex (regular expressions)
import numbers # To use mathematics things, in this case : numbers.
from fnac.items import FnacItem # To return the items you want. Each item has a space allocated in the momery, created in the items.py file, which is in the second cdiscount_test directory.
from urllib.request import urlopen # To use urlopen, which allow the spider to find the links in a page that is in the actual page.
from scrapy.spiders import CrawlSpider, Rule # To use rules and LinkExtractor, which allowed the spider to follow every url on the page you crawl.
from scrapy.linkextractors import LinkExtractor # Look above.
from bs4 import BeautifulSoup # To crawl an iframe, which is a page in a page in web prgrammation.
# Your spider
class Fnac(CrawlSpider):
name = 'FnacCom' # Name of your spider. You call it in the anaconda prompt.
allowed_domains = ['fnac.com'] # Web domains allowed by you, your spider cannot enter on a page which is not in that domain.
start_urls = ['https://www.fnac.com/Index-Vendeurs-MarketPlace/A/'] # The first link you crawl.
# To allow your spider to follow the urls that are on the actual page.
rules = (
Rule(LinkExtractor(), callback='parse_start_url'),
)
# Your function that crawl the actual page you're on.
def parse_start_url(self, response):
item = FnacItem() # The spider now knowws that the items you want have to be stored in the item variable.
# First data you want which are on the actual page.
nb_sales = response.xpath('//body//table[@summary="données détaillée du vendeur"]/tbody/tr/td/span/text()').re(r'([\d]*) ventes')
country = response.xpath('//body//table[@summary="données détaillée du vendeur"]/tbody/tr/td/text()').re(r'([A-Z].*)')
# To store the data in their right places.
item['nb_sales'] = ''.join(nb_sales).strip()
item['country'] = ''.join(country).strip()
# Find a specific link on the actual page and launch this function on it. It's the place where you will find your two first data.
test_list = response.xpath('//a/@href')
for test_list in response.xpath('.//div[@class="ProductPriceBox-item detail"]'):
temporary = response.xpath('//div[@class="ProductPriceBox-item detail"]/div/a/@href').extract()
for i in range(len(temporary)):
scrapy.Request(temporary[i], callback=self.parse_start_url, meta={'dont_redirect': True, 'item': item})
# To find the iframe on a page, launch the next function.
yield scrapy.Request(response.url, callback=self.parse_iframe, meta={'dont_redirect': True, 'item': item})
# Your function that crawl the iframe on a page
def parse_iframe(self, response):
f_item1 = response.meta['item'] # Just to use the same item location you used above.
# Find all the iframe on a page.
soup = BeautifulSoup(urlopen(response.url), "lxml")
iframexx = soup.find_all('iframe')
# If there's at least one iframe, launch the next function on it
if (len(iframexx) != 0):
for iframe in iframexx:
yield scrapy.Request(iframe.attrs['src'], callback=self.extract_or_loop, meta={'dont_redirect': True, 'item': f_item1})
# If there's no iframe, launch the next function on the link of the page where you looked after the potential iframe.
else:
yield scrapy.Request(response.url, callback=self.extract_or_loop, meta={'dont_redirect': True, 'item': f_item1})
# Function to find the other data.
def extract_or_loop(self, response):
f_item2 = response.meta['item'] # Just to use the same item location you used above.
# The rest of the data you want.
address = response.xpath('//body//div/p/text()').re(r'.*Adresse \: (.*)\n?.*')
email = response.xpath('//body//div/ul/li[contains(text(),"@")]/text()').extract()
name = response.xpath('//body//div/p[@class="customer-policy-label"]/text()').re(r'Infos sur la boutique \: ([a-zA-Z0-9]*\s*)')
phone = response.xpath('//body//div/p/text()').re(r'.*Tél \: ([\d]*)\n?.*')
siret = response.xpath('//body//div/p/text()').re(r'.*Siret \: ([\d]*)\n?.*')
vat = response.xpath('//body//div/text()').re(r'.*TVA \: (.*)')
# If the name of the seller exist, then return the data.
if (len(name) != 0):
f_item2['name'] = ''.join(name).strip()
f_item2['address'] = ''.join(address).strip()
f_item2['phone'] = ''.join(phone).strip()
f_item2['email'] = ''.join(email).strip()
f_item2['vat'] = ''.join(vat).strip()
f_item2['siret'] = ''.join(siret).strip()
yield f_item2
# If not, there was no data on the page and you have to find all the links on your page and launch the first function on them.
else:
for sel in response.xpath('//html/body'):
list_urls = sel.xpath('//a/@href').extract()
list_iframe = response.xpath('//div[@class="ProductPriceBox-item detail"]/div/a/@href').extract()
if (len(list_iframe) != 0):
for list_iframe in list_urls:
yield scrapy.Request(list_iframe, callback=self.parse_start_url, meta={'dont_redirect': True})
for url in list_urls:
yield scrapy.Request(response.urljoin(url), callback=self.parse_start_url, meta={'dont_redirect': True})
我的设置:
BOT_NAME = 'fnac'
SPIDER_MODULES = ['fnac.spiders']
NEWSPIDER_MODULE = 'fnac.spiders'
DOWNLOAD_DELAY = 2
COOKIES_ENABLED = False
ITEM_PIPELINES = {
'fnac.pipelines.FnacPipeline': 300,
}
我的管道:
# -*- coding: utf-8 -*-
from scrapy import signals
from scrapy.exporters import CsvItemExporter
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
# Define your output file.
class FnacPipeline(CsvItemExporter):
def __init__(self):
self.files = {}
@classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
f = open('..\\..\\..\\..\\Fnac.csv', 'w').close()
file = open('..\\..\\..\\..\\Fnac.csv', 'wb')
self.files[spider] = file
self.exporter = CsvItemExporter(file)
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
我的元素:
# -*- coding: utf-8 -*-
import scrapy
# Define here the models for your scraped items
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
class FnacItem(scrapy.Item):
# define the fields for your items :
# name = scrapy.Field()
name = scrapy.Field()
nb_sales = scrapy.Field()
country = scrapy.Field()
address = scrapy.Field()
siret = scrapy.Field()
vat = scrapy.Field()
phone = scrapy.Field()
email = scrapy.Field()
我在提示符中编写的运行蜘蛛的命令是:
scrapy爬取FnacCom
输出示例如下:
2017-08-08 10:21:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/TV-Panasonic/TV-par-marque/nsh474980/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:21:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/TV-Philips/TV-par-marque/nsh474981/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:21:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/TV-Sony/TV-par-marque/nsh475001/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/TV-LG/TV-par-marque/nsh474979/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/TV-Samsung/TV-par-marque/nsh474984/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/TV-Television/TV-par-marque/shi474972/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/TV-Television/TV-par-prix/shi474946/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/TV-Television/TV-par-taille-d-ecran/shi474945/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/TV-Television/TV-par-Technologie/shi474944/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/Smart-TV-TV-connectee/TV-par-Technologie/nsh474953/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/TV-QLED/TV-par-Technologie/nsh474948/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/TV-4K-UHD/TV-par-Technologie/nsh474947/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/Toutes-les-TV/TV-Television/nsh474940/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:26 [scrapy.extensions.logstats] INFO: Crawled 459 pages (at 24 pages/min), scraped 0 items (at 0 items/min)
2017-08-08 10:22:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/TV-Television/shi474914/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:28 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/partner/canalplus#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/Meilleures-ventes-TV/TV-Television/nsh474942/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/Toutes-nos-Offres/Offres-de-remboursement/shi159784/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/Offres-Adherents/Toutes-nos-Offres/nsh81745/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/labofnac#bl=MMtvh#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/Lecteur-et-Enregistreur-DVD-Blu-Ray/Lecteur-DVD-Blu-Ray/shi475063/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/TV-OLED/TV-par-Technologie/nsh474949/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/Lecteur-DVD-Portable/Lecteur-et-Enregistreur-DVD-Blu-Ray/nsh475064/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/Home-Cinema/Home-Cinema-par-marque/shi475116/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/Univers-TV/Univers-Ecran-plat/cl179/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/Casque-TV-HiFi/Casque-par-usage/nsh450507/w-4#bl=MMtvh> (referer: https://www.fnac.com)
非常感谢您的帮助!!!
最佳答案
我编写了一个小代码重构来展示如何在不使用crawlspider和使用常见的scrapy习惯用法的情况下显式地编写spider:
class Fnac(Spider):
name = 'fnac.com'
allowed_domains = ['fnac.com']
start_urls = ['https://www.fnac.com/Index-Vendeurs-MarketPlace/0/'] # The first link you crawl.
def parse(self, response):
# parse sellers
sellers = response.xpath("//h1[contains(selftext(),'MarketPlace')]/following-sibling::ul/li/a/@href").extract()
for url in sellers:
yield Request(url, callback=self.parse_seller)
# parse other pages A-Z
pages = response.css('.pagerletter a::attr(href)').extract()
for url in pages:
yield Request(url, callback=self.parse)
def parse_seller(self, response):
nb_sales = response.xpath('//body//table[@summary="données détaillée du vendeur"]/tbody/tr/td/span/text()').re(r'([\d]*) ventes')
country = response.xpath('//body//table[@summary="données détaillée du vendeur"]/tbody/tr/td/text()').re(r'([A-Z].*)')
item = FnacItem()
# To store the data in their right places.
item['nb_sales'] = ''.join(nb_sales).strip()
item['country'] = ''.join(country).strip()
# go to details page now
details_url = response.xpath("//iframe/@src[contains(.,'retour')]").extract_first()
yield Request(details_url, self.parse_seller_details,
meta={'item': item}) # carry over our item to next response
def parse_seller_details(self, response):
item = response.meta['item'] # get item that's got filled in `parse_seller`
address = response.xpath('//body//div/p/text()').re(r'.*Adresse \: (.*)\n?.*')
email = response.xpath('//body//div/ul/li[contains(text(),"@")]/text()').extract()
# parse here
yield item
关于python - 爬行时清空输出文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45558429/
SQLite、Content provider 和 Shared Preference 之间的所有已知区别。 但我想知道什么时候需要根据情况使用 SQLite 或 Content Provider 或
警告:我正在使用一个我无法完全控制的后端,所以我正在努力解决 Backbone 中的一些注意事项,这些注意事项可能在其他地方更好地解决......不幸的是,我别无选择,只能在这里处理它们! 所以,我的
我一整天都在挣扎。我的预输入搜索表达式与远程 json 数据完美配合。但是当我尝试使用相同的 json 数据作为预取数据时,建议为空。点击第一个标志后,我收到预定义消息“无法找到任何内容...”,结果
我正在制作一个模拟 NHL 选秀彩票的程序,其中屏幕右侧应该有一个 JTextField,并且在左侧绘制弹跳的选秀球。我创建了一个名为 Ball 的类,它实现了 Runnable,并在我的主 Draf
这个问题已经有答案了: How can I calculate a time span in Java and format the output? (18 个回答) 已关闭 9 年前。 这是我的代码
我有一个 ASP.NET Web API 应用程序在我的本地 IIS 实例上运行。 Web 应用程序配置有 CORS。我调用的 Web API 方法类似于: [POST("/API/{foo}/{ba
我将用户输入的时间和日期作为: DatePicker dp = (DatePicker) findViewById(R.id.datePicker); TimePicker tp = (TimePic
放宽“邻居”的标准是否足够,或者是否有其他标准行动可以采取? 最佳答案 如果所有相邻解决方案都是 Tabu,则听起来您的 Tabu 列表的大小太长或您的释放策略太严格。一个好的 Tabu 列表长度是
我正在阅读来自 cppreference 的代码示例: #include #include #include #include template void print_queue(T& q)
我快疯了,我试图理解工具提示的行为,但没有成功。 1. 第一个问题是当我尝试通过插件(按钮 1)在点击事件中使用它时 -> 如果您转到 Fiddle,您会在“内容”内看到该函数' 每次点击都会调用该属
我在功能组件中有以下代码: const [ folder, setFolder ] = useState([]); const folderData = useContext(FolderContex
我在使用预签名网址和 AFNetworking 3.0 从 S3 获取图像时遇到问题。我可以使用 NSMutableURLRequest 和 NSURLSession 获取图像,但是当我使用 AFHT
我正在使用 Oracle ojdbc 12 和 Java 8 处理 Oracle UCP 管理器的问题。当 UCP 池启动失败时,我希望关闭它创建的连接。 当池初始化期间遇到 ORA-02391:超过
关闭。此题需要details or clarity 。目前不接受答案。 想要改进这个问题吗?通过 editing this post 添加详细信息并澄清问题. 已关闭 9 年前。 Improve
引用这个plunker: https://plnkr.co/edit/GWsbdDWVvBYNMqyxzlLY?p=preview 我在 styles.css 文件和 src/app.ts 文件中指定
为什么我的条形这么细?我尝试将宽度设置为 1,它们变得非常厚。我不知道还能尝试什么。默认厚度为 0.8,这是应该的样子吗? import matplotlib.pyplot as plt import
当我编写时,查询按预期执行: SELECT id, day2.count - day1.count AS diff FROM day1 NATURAL JOIN day2; 但我真正想要的是右连接。当
我有以下时间数据: 0 08/01/16 13:07:46,335437 1 18/02/16 08:40:40,565575 2 14/01/16 22:2
一些背景知识 -我的 NodeJS 服务器在端口 3001 上运行,我的 React 应用程序在端口 3000 上运行。我在 React 应用程序 package.json 中设置了一个代理来代理对端
我面临着一个愚蠢的问题。我试图在我的 Angular 应用程序中延迟加载我的图像,我已经尝试过这个2: 但是他们都设置了 src attr 而不是 data-src,我在这里遗漏了什么吗?保留 d
我是一名优秀的程序员,十分优秀!