gpt4 book ai didi

html - 如何将连续节点与 Nokogiri 匹配?

转载 作者:太空宇宙 更新时间:2023-11-03 23:31:03 26 4
gpt4 key购买 nike

我需要使用 Nokogiri 和 CSS 或 XPath 选择器来匹配来自以下 HTML 的文本。它应该从 <div> 开始匹配标记在哪里 class="propsBar"并在 <div> 的结束端结束比赛class="oddsInfoBottom" 所在的标签.应该这样做以识别与此模式的所有匹配项:

<div class="timeBar"></div>
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
<!--
BUY POINTS
-->
<input id="events[X2036-907-Yes-No-081414]" type="hidden" value="X2036-907-Yes-No-081414^No^Yes^Nationals (S Strasburg) @ Met…l there be a score in the 1st Inning?^8/14/2014^7:10 PM^2036" name="events[X2036-907-Yes-No-081414]"></input>
<div class="timeBar"></div>
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
<!--
BUY POINTS
-->
<input id="events[X2036-915-Yes-No-081414]" type="hidden" value="X2036-915-Yes-No-081414^No^Yes^Astros (S Feldman) @ Red Sox …l there be a score in the 1st Inning?^8/14/2014^7:10 PM^2036" name="events[X2036-915-Yes-No-081414]"></input>
<div class="timeBar"></div>
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
<!--
BUY POINTS
-->
<input id="events[X2036-917-Yes-No-081414]" type="hidden" value="X2036-917-Yes-No-081414^No^Yes^Rays (J Odorizzi) @ Rangers (…l there be a score in the 1st Inning?^8/14/2014^8:05 PM^2036" name="events[X2036-917-Yes-No-081414]"></input>
<div class="timeBar"></div>

上面的 HTML 应该返回三个匹配项。

到目前为止,我能够做到这一点的唯一方法是:

one = html.xpath("//div[@class='propsBar']")
two = html.xpath("//div[@class='oddsInfoTop']")
three = html.xpath("//div[@class='oddsInfoBottom']")

one.zip(two, three).flatten.each_slice(3).map(&:join)

这样做的缺点是只返回文本,不再作为 Nokogiri 元素。此外,我认为以这种方式解析是危险的,如果页面具有不同数量的匹配 one, two, three 的元素。它会破裂。

最佳答案

我会这样写:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<div class="timeBar"></div>
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
<!--
BUY POINTS
-->
<div class="timeBar"></div>
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
<!--
BUY POINTS
-->
<div class="timeBar"></div>
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
<!--
BUY POINTS
-->
<div class="timeBar"></div>
EOT

found_nodes = doc.search('div.propsBar').map{ |node|
nodes = [node]
loop do
node = node.next_sibling
nodes << node
break if node['class'] == 'oddsInfoBottom'
end
nodes
}

(请注意,我删除了 <input> 标签,因为它们只会使输入 HTML 变得杂乱无章。当您提供输入数据时,请删除所有杂音。)

运行返回找到的节点作为数组的数组。每个子数组包含在顺序遍历兄弟链后找到的各个节点:

require 'pp'
pp found_nodes
# >> [[#(Element:0x3ff00a4936a0 {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a037c28 { name = "class", value = "propsBar" })]
# >> }),
# >> #(Text "\n"),
# >> #(Element:0x3ff00a49363c {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a03629c { name = "class", value = "oddsInfoTop" })]
# >> }),
# >> #(Text "\n"),
# >> #(Element:0x3ff00a4935b0 {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a4668f8 { name = "class", value = "oddsInfoBottom" })]
# >> })],
# >> [#(Element:0x3ff00a49354c {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a45c808 { name = "class", value = "propsBar" })]
# >> }),
# >> #(Text "\n"),
# >> #(Element:0x3ff00a4934e8 {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a45b084 { name = "class", value = "oddsInfoTop" })]
# >> }),
# >> #(Text "\n"),
# >> #(Element:0x3ff00a49345c {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a8710ec { name = "class", value = "oddsInfoBottom" })]
# >> })],
# >> [#(Element:0x3ff00a4933f8 {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a4979d0 { name = "class", value = "propsBar" })]
# >> }),
# >> #(Text "\n"),
# >> #(Element:0x3ff00a493394 {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a47e188 { name = "class", value = "oddsInfoTop" })]
# >> }),
# >> #(Text "\n"),
# >> #(Element:0x3ff00a493308 {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a458f00 { name = "class", value = "oddsInfoBottom" })]
# >> })]]

请记住,经过解析后,文档是一个节点链表。如果原始 XML 或 HTML 中有换行符,则将有一个文本节点至少包含一个换行符(“\n”)。因为它是一个列表,我们可以使用 next_sibling 向前和向后移动。和 previous_sibling 分别。这使得抓取小块真的变得容易,即使它们不是包含您想要的内容的 block 标签。

如果您希望返回值类似于 search 的输出, css xpath 方法,内部变量nodes将需要从数组更改为 NodeSet :

found_nodes = doc.search('div.propsBar').map{ |node|
nodes = Nokogiri::XML::NodeSet.new(doc, [node])
loop do
node = node.next_sibling
nodes << node
break if node['class'] == 'oddsInfoBottom'
end
nodes
}

require 'pp'
pp found_nodes.map(&:to_html)

运行结果:

# >> ["<div class=\"propsBar\"></div>\n<div class=\"oddsInfoTop\"></div>\n<div class=\"oddsInfoBottom\"></div>",
# >> "<div class=\"propsBar\"></div>\n<div class=\"oddsInfoTop\"></div>\n<div class=\"oddsInfoBottom\"></div>",
# >> "<div class=\"propsBar\"></div>\n<div class=\"oddsInfoTop\"></div>\n<div class=\"oddsInfoBottom\"></div>"]

最后,请注意我使用的是 CSS 选择器而不是 XPath。我更喜欢它们,因为它们通常更具可读性和简洁性。 XPath 更强大,而且因为它是为剖析 XML 而设计的,所以在 CSS 选择器只让我们接近我们想要的东西之后,它通常可以完成我们在 Ruby 中必须完成的所有繁重工作。使用任何能为您完成工作的方法,同时考虑更易于阅读和维护的方法。

关于html - 如何将连续节点与 Nokogiri 匹配?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25317136/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com