gpt4 book ai didi

ruby - 使用 Kimurai gem 进行网页抓取

转载 作者:行者123 更新时间:2023-12-04 01:34:48 24 4
gpt4 key购买 nike

我正在使用 Kimurai Ruby gem 进行网络抓取.我有这个脚本,效果很好:

require 'kimurai'

class SimpleSpider < Kimurai::Base
@name = "simple_spider"
@engine = :selenium_chrome
@start_urls = ["https://apply.workable.com/taxjar/"]

def parse(response, url:, data: {})
# Update response to current response after interaction with a browser
count = 0
# browser.click_button "Show more"
doc = browser.current_response
returned_jobs = doc.css('.careers-jobs-list-styles__jobsList--3_v12')
returned_jobs.css('li').each do |char_element|
# puts char_element
title = char_element.css('a')[0]['aria-label']
link = "https://apply.workable.com" + char_element.css('a')[0]['href']

#click on job link and get description
browser.visit(link)
job_page = browser.current_response
description = job_page.xpath('/html/body/div[1]/div/div[1]/div[2]/div[2]/div[2]').text

puts '*******'
puts title
puts link
puts description
puts count += 1
end
puts "There are #{count} jobs total"
end
end

SimpleSpider.crawl!

但是,我希望这一切都返回一个对象数组……在本例中为作业。我想在解析方法中创建一个作业数组并执行类似 jobs << [title, link, description, company] 的操作在returned_jobs里面循环并在我调用 SimpleSpider.crawl! 时返回它但这不起作用。

感谢任何帮助。

最佳答案

您可以像这样稍微修改您的代码:

class SimpleSpider < Kimurai::Base
@name = "simple_spider"
@engine = :selenium_chrome
@start_urls = ["https://apply.workable.com/taxjar/"]

def parse(response, url:, data: {})
# Update response to current response after interaction with a browser
count = 0
# browser.click_button "Show more"
doc = browser.current_response
returned_jobs = doc.css('.careers-jobs-list-styles__jobsList--3_v12')

jobs = []
returned_jobs.css('li').each do |char_element|
# puts char_element
title = char_element.css('a')[0]['aria-label']
link = "https://apply.workable.com" + char_element.css('a')[0]['href']

#click on job link and get description
browser.visit(link)
job_page = browser.current_response
description = job_page.xpath('/html/body/div[1]/div/div[1]/div[2]/div[2]/div[2]').text

jobs << [title, link, description]
end

puts "There are #{jobs.count} jobs total"
puts jobs
end
end

我不确定公司的情况,因为我在您的代码中没有看到该变量。但是,您可以在上面看到调用数组并对其进行处理的想法。

这是在终端中运行的部分输出:

Screen

我还有一篇博文here关于如何在 Ruby on Rails 应用程序中使用 Kimurai 框架。

关于ruby - 使用 Kimurai gem 进行网页抓取,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59956507/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com