gpt4 book ai didi

python - 使用 python scrapy 框架在主项目中抓取外部网站

转载 作者:太空宇宙 更新时间:2023-11-04 03:30:06 25 4
gpt4 key购买 nike

我一直在寻找一种更好的方法来从另一个主要来源网站抓取外部网站。为了更好地解释它,让我使用 yelp.com 的示例来解释我正在尝试做什么(尽管我的目标不是 yelp)。

  1. 我会抓取标题和地址
  2. 访问标题指向的链接以获取公司网站
  3. 我想从主网站的源代码中提取电子邮件。 (我知道这很困难,但我并没有抓取所有页面,我假设大多数网站在其 url 中都有联系方式,例如 site.com/contact.php)
  4. 关键是在从 yelp 中抓取数据并将数据存储在字段中时,我想从公司的主网站获取外部数据。

下面是我的代码不知道如何使用 scrapy 来完成。

# -*- coding: utf-8 -*-
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from comb.items import CombItem, SiteItem

class ComberSpider(CrawlSpider):
name = "comber"
allowed_domains = ["example.com"]
query = 'shoe'
page = 'http://www.example.com/corp/' + query + '/1.html'
start_urls = (
page,
)
rules = (Rule(LinkExtractor(allow=(r'corp/.+/\d+\.html'), restrict_xpaths=("//a[@class='next']")),
callback="parse_items", follow=True),
)


def parse_items(self, response):

for sel in response.xpath("//div[@class='item-main']"):
item = CombItem()
item['company_name'] = sel.xpath("h2[@class='title']/a/text()").extract()
item['contact_url'] = sel.xpath("div[@class='company']/a/@href").extract()[0]
item['gold_supplier'] = sel.xpath("div[@class='item-title']/a/@title").extract()[0]
company_details = sel.xpath("div[@class='attrs']/div[@class='attr']/span['name']/text()").extract()

item = self.parse_meta(sel, item, company_details)
request = scrapy.Request(item['contact_url'], callback=self.parse_site)
request.meta['item'] = item

yield request

def parse_meta(self, sel, item, company_details):

if (company_details):
if "Products:" in company_details:
item['products'] = sel.xpath("./div[@class='value']//text()").extract()
if "Country/Region:" in company_details:

item['country'] = sel.xpath("./div[@class='right']"
+ "/span[@data-coun]/text()").extract()
if "Revenue:" in company_details:
item['revenue'] = sel.xpath("./div[@class='right']/"
+ "span[@data-reve]/text()").extract()
if "Markets:" in company_details:
item['markets'] = sel.xpath("./div[@class='value']/span[@data-mark]/text()").extract()
return item

def parse_site(self, response):
item = response.meta['item']
# this value of item['websites'] would be http://target-company.com, http://any-other-website.com
# my aim is to jump to the http://company.com and scrap data from it's contact page and
# store it as an item like item['emails'] = [info@company.com, sales@company.com]

# Please how can this be done in this same project
# the only thing i can think of is store the item['websites'] and other values of item and make another project
# even with that it would still not work because of the allowed_domains and start_urls

item['websites'] = response.xpath("//div[@class='company-contact-information']/table/tr/td/a/@href").extract()
print(item)
print('*'* 50)
yield item



"""

from scrapy.item import Item, Field


class CombItem(Item):
company_name = Field()
main_products = Field()
contact_url = Field()
revenue = Field()
gold_supplier = Field()
country = Field()
markets= Field()
Product_Home = Field()
websites = Field()
"""
#emails = Field() not implemented because emails need to be extracted from websites which is different from start_url

最佳答案

当您发出Request 时,传递 dont_filter=True将关闭 OffSiteMiddleware 并且不会使用 allowed_domains 过滤 url:

If the request has the dont_filter attribute set, the offsite middleware will allow the request even if its domain is not listed in allowed domains.

关于python - 使用 python scrapy 框架在主项目中抓取外部网站,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31393322/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com