gpt4 book ai didi

ruby - 从 HTML 中抓取轨道数据?

转载 作者:数据小太阳 更新时间:2023-10-29 08:39:12 24 4
gpt4 key购买 nike

我希望能够从 1001tracklists 的轨道列表页面抓取数据。一个 URL 示例是:

http://www.1001tracklists.com/tracklist/25122_lange-intercity-podcast-115-2013-03-06.html

下面是数据如何在页面上显示的示例:

Above & Beyond - Black Room Boy (Above & Beyond Club Mix) [ANJUNABEATS]

我想以下列格式从该页面中提取所有歌曲:

$byArtist - $name [$publisher]

查看此页面的 HTML 后,我所寻找的内容似乎以 HTML5 元微数据格式存储:

<td class="" id="tlptr_433662">
<a name="tlp_433662"></a>
<div itemprop="tracks" itemscope itemtype="http://schema.org/MusicRecording" id="tlp5_content">
<meta itemprop="byArtist" content="Above &amp; Beyond">
<meta itemprop="name" content="Black Room Boy (Above &amp; Beyond Club Mix)">
<meta itemprop="publisher" content="ANJUNABEATS">
<meta itemprop="url" content="/track/103905_above-beyond-black-room-boy-above-beyond-club-mix/index.html">
<span class="tracklistTrack floatL"id="tr_103905" ><a href="/track/103905_above-beyond-black-room-boy-above-beyond-club-mix/index.html" class="">Above &amp; Beyond - Black Room Boy (Above &amp; Beyond Club Mix)</a>&thinsp;</span><span class="floatL">[<a href="/label/1037_anjunabeats/index.html" title="Anjunabeats">ANJUNABEATS</a>]</span>
<div id="tlp5_actions" class="floatL" style="margin-top:1px;">

有一个值为“tlp_433662”的 CSS 选择器。页面上的每首歌曲都有自己唯一的 ID。一个将具有“tlp_433662”,下一个将具有“tlp_433628”或类似的东西。

有没有办法使用 Nokogiri 和 XPath 提取轨道列表页面上列出的所有歌曲?我可能想对下面列出的“数据”“做”一个“每个”,以便 scraper 循环提取每组相关数据的数据。这是我的 Ruby 程序的开始:

require 'rubygems'
require 'nokogiri'
require 'open-uri'

url = "http://www.1001tracklists.com/tracklist/25122_lange-intercity-podcast-115-2013-03-06.html"
data = Nokogiri::HTML(open(url))
# what do do next? print out xpath loop code which extracts my data.
# code block I need help with
data.xpath.........each do |block|
block.xpath("...........").each do |span|
puts stuff printing out what I want.
end
end

我知道如何实现的最终目标是将此 Ruby 脚本带到 Sinatra 以“网络化”数据并添加一些不错的 Twitter Bootstrap CSS,如以下 youtube 视频所示:http://www.youtube.com/watch?v=PWI1PIvy4A8

你能帮我处理 XPath 代码块,以便我可以抓取数据并打印数组吗?

最佳答案

下面是一些将信息收集到哈希数组中的代码。

我更喜欢使用 CSS 访问器而不是 XPath,因为如果您有任何 HTML/CSS 或 jQuery 经验,它们的可读性更好。

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open('http://www.1001tracklists.com/tracklist/25122_lange-intercity-podcast-115-2013-03-06.html'))
data = doc.search('tr.tlpItem div[itemtype="http://schema.org/MusicRecording"]').each_with_object([]) do |div, array|
hash = div.search('meta').each_with_object({}) do |m, h|
h[m['itemprop']] = m['content']
end

link = div.at('span a')
hash['tracklistTrack'] = [ link['href'], link.text ]

title = div.at('span.floatL a')
hash['title'] = [title['href'], title.text ]

array << hash
end

pp data[0, 2]

输出页面数据的一个子集。经过一些按摩后,结构如下所示:

[
{
"byArtist"=>"Markus Schulz",
"name"=>"The Spiritual Gateway (Transmission 2013 Theme)",
"publisher"=>"COLDHARBOUR RECORDINGS",
"url"=>"/track/108928_markus-schulz-the-spiritual-gateway-transmission-2013-theme/index.html",
"tracklistTrack"=>[
"/track/108928_markus-schulz-the-spiritual-gateway-transmission-2013-theme/index.html",
"Markus Schulz - The Spiritual Gateway (Transmission 2013 Theme)"
],
"title"=>[
"/track/108928_markus-schulz-the-spiritual-gateway-transmission-2013-theme/index.html",
"Markus Schulz - The Spiritual Gateway (Transmission 2013 Theme)"
]
},
{
"byArtist"=>"Lange & Audrey Gallagher",
"name"=>"Our Way Home (Noah Neiman Remix)",
"publisher"=>"LANGE RECORDINGS",
"url"=>"/track/119667_lange-audrey-gallagher-our-way-home-noah-neiman-remix/index.html",
"tracklistTrack"=>[
"/track/119667_lange-audrey-gallagher-our-way-home-noah-neiman-remix/index.html",
"Lange & Audrey Gallagher - Our Way Home (Noah Neiman Remix)"
],
"title"=>[
"/track/119667_lange-audrey-gallagher-our-way-home-noah-neiman-remix/index.html",
"Lange & Audrey Gallagher - Our Way Home (Noah Neiman Remix)"
]
}
]

关于ruby - 从 HTML 中抓取轨道数据?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/15262997/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com