gpt4 book ai didi

python - Scrapy 多个搜索词

转载 作者:太空宇宙 更新时间:2023-11-03 11:30:41 24 4
gpt4 key购买 nike

我是 Python 的新手,我正在学习如何抓取网页(1 天)。我要实现的任务是循环遍历 2000 家公司的列表并提取收入数据和员 worker 数。我从使用 scrapy 开始,我已经设法让工作流程为一家公司工作(不优雅,但至少我正在尝试)——但我不知道如何加载公司列表并循环执行多次搜索。我感觉这是一个相当简单的过程。

所以,我的主要问题是 - 我应该在蜘蛛类中的什么地方定义要循环的公司查询数组?我不知道确切的 URL,因为每个公司都有一个唯一的 ID 并且属于特定的市场 - 所以我不能将它们输入为 start_urls。
Scrapy 是正确的工具还是我应该使用 mechanize - 来完成这类任务?

这是我当前的代码。

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import FormRequest
from scrapy.http import Request
from tutorial.items import DmozItem
import json

class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["proff.se"]
start_urls = ["http://www.proff.se"]

# Search on the website, currently I have just put in a static search term here, but I would like to loop over a list of companies.

def parse(self, response):
return FormRequest.from_response(response, formdata={'q': rebtel},callback=self.search_result)

# I fetch the url from the search result and convert it to correct Financial-url where the information is located.

def search_result(self,response):
sel = HtmlXPathSelector(response)
link = sel.xpath('//ul[@class="company-list two-columns"]/li/a/@href').extract()
finance_url=str(link[0]).replace("/foretag","http://www.proff.se/nyckeltal")
return Request(finance_url,callback=self.parse_finance)

# I Scraped the information of this particular company, this is hardcoded and will not
# work for other responses. I had some issues with the encoding characters
# initially since they were Swedish. I also tried to target the Json element direct by
# revenue = sel.xpath('#//*[@id="accountTable1"]/tbody/tr[3]/@data-chart').extract()
# but was not able to parse it (error - expected string or buffer - tried to convert it
# to a string by str() with no luck, something off with the formatting, which is messing the the data types.

def parse_finance(self, response):
sel = HtmlXPathSelector(response)
datachart = sel.xpath('//tr/@data-chart').extract()
employees=json.loads(datachart[36])
revenue = json.loads(datachart[0])
items = []
item = DmozItem()
item['company']=response.url.split("/")[-5]
item['market']=response.url.split("/")[-3]
item['employees']=employees
item['revenue']=revenue
items.append(item)
return item

最佳答案

常用的方法是使用命令行参数来执行此操作。给蜘蛛的 __init__ 方法一个参数:

class ProffSpider(BaseSpider):
name = "proff"
...

def __init__(self, query):
self.query = query

def parse(self, response):
return FormRequest.from_response(response,
formdata={'q': self.query},
callback=self.search_result
)

...

然后启动你的蜘蛛(也许用 Scrapyd):

$ scrapy crawl proff -a query="something"
$ scrapy crawl proff -a query="something else"

如果你想通过从一个文件中传递参数来一次运行一堆蜘蛛,你可以创建一个新的命令来运行一个蜘蛛的多个实例。这只是将内置的 crawl 命令与 example code for running multiple spiders 混合使用使用单个爬虫:

your_project/settings.py

COMMANDS_MODULE = 'your_project_module.commands'

your_project/commands/__init__.py

# empty file

your_project/commands/crawl_many.py

import os
import csv

from scrapy.commands import ScrapyCommand
from scrapy.utils.python import without_none_values
from scrapy.exceptions import UsageError


class Command(ScrapyCommand):
requires_project = True

def syntax(self):
return '[options]'

def short_desc(self):
return 'Run many instances of a spider'

def add_options(self, parser):
ScrapyCommand.add_options(self, parser)

parser.add_option('-f', '--input-file', metavar='FILE', help='CSV file to load arguments from')
parser.add_option('-o', '--output', metavar='FILE', help='dump scraped items into FILE (use - for stdout)')
parser.add_option('-t', '--output-format', metavar='FORMAT', help='format to use for dumping items with -o')

def process_options(self, args, opts):
ScrapyCommand.process_options(self, args, opts)

if not opts.output:
return

if opts.output == '-':
self.settings.set('FEED_URI', 'stdout:', priority='cmdline')
else:
self.settings.set('FEED_URI', opts.output, priority='cmdline')

feed_exporters = without_none_values(self.settings.getwithbase('FEED_EXPORTERS'))
valid_output_formats = feed_exporters.keys()

if not opts.output_format:
opts.output_format = os.path.splitext(opts.output)[1].replace('.', '')

if opts.output_format not in valid_output_formats:
raise UsageError('Unrecognized output format "%s". Valid formats are: %s' % (opts.output_format, tuple(valid_output_formats)))

self.settings.set('FEED_FORMAT', opts.output_format, priority='cmdline')

def run(self, args, opts):
if args:
raise UsageError()

with open(opts.input_file, 'rb') as handle:
for spider_options in csv.DictReader(handle):
spider = spider_options.pop('spider')
self.crawler_process.crawl(spider, **spider_options)

self.crawler_process.start()

你可以这样运行它:

$ scrapy crawl_many -f crawl_options.csv -o output_file.jsonl

抓取选项 CSV 的格式很简单:

spider,query,arg2,arg3
proff,query1,value2,value3
proff,query2,foo,bar
proff,query3,baz,asd

关于python - Scrapy 多个搜索词,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20938659/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com