gpt4 book ai didi

xpath - XPath:如何从此标签和下一个标签获取文本?

转载 作者:行者123 更新时间:2023-12-03 16:07:05 25 4
gpt4 key购买 nike

我有这样的HTML:

<h1>Hello1</h1>
<p>World1</p>
<h1>Hello2</h1>
<p>World2</p>
<h1>Hello2</h1>
<p>World2</p>


所以我需要一次与World1取得Hello1,与World2取得Hello2等

更新:我使用Ruby Mechanize库

最佳答案

Ruby库“ Mechanize”使用Nokogiri解析库,因此您可以直接调用Nokogiri。一种潜在的解决方案可能看起来像这样:

require 'mechanize'
require 'pp'

html = "<h1>Hello1</h1>
<p>World1</p>
<h1>Hello2</h1>
<p>World2</p>
<h1>Hello2</h1>
<p>World2</p>"

results = []

Nokogiri::HTML(html).xpath("//h1").each do |header|
p = header.xpath("following-sibling::p[1]").text
results << [header.text, p]
end

pp results


编辑:
此示例已在使用Nokogiri〜v1.4的Mechanize v2.0.1中进行了测试。我还直接针对Nokogiri v1.5.0进行了测试。

编辑2:
本示例回答了原始解决方案的后续问题:

require 'nokogiri'
require 'pp'

html = <<HTML
<h1>
<p>
<font size="4">
<b>abide by (something)</b>
</font>
</p>
</h1>
<p>
<font size="3">- to follow the rules of something</font>
</p>
The cleaning staff must abide by the rules of the school.
<br>
<h1>
<p>
<font size="4">
<b>able to breathe easily again</b>
</font>
</p>
</h1>
<p>
My friend was able to breathe easily again when his company did not go bankrupt.
<br>
HTML

doc = Nokogiri::HTML(html)

results = []

Nokogiri::HTML(html).xpath("//h1").each do |header|
h1 = header.xpath("following-sibling::p/font/b").text
results << h1
end

pp results


带有嵌套元素的 H1标记无效,因此Nokogiri会在解析过程中更正错误。获取以前嵌套的元素的过程与原始解决方案非常相似。

关于xpath - XPath:如何从此标签和下一个标签获取文本?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/6902998/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com