gpt4 book ai didi

ruby - 使用 nokogiri 提取 HTML 标签之间的文本

转载 作者:数据小太阳 更新时间:2023-10-29 07:23:19 25 4
gpt4 key购买 nike

我有这样的 HTML:

<h1> Header is here</h1>
<h2>Header 2 is here</h2>
<p> Extract me!</p>
<p> Extract me too!</p>
<h2> Next Header 2</h2>
<p>not interested</p>
<p>not interested</p>
<h2>Header 2 is here</h2>
<p> Extract me!</p>
<p> Extract me too!</p>

我有一个基本的 Nokogiri CSS 节点搜索返回

内容,但我找不到有关如何定位第 N 个关闭的 H2 和下一个打开的 H2 之间的所有文本的示例。我正在用输出创建一个 CSV,所以我也想读入文件列表并将 URL 作为第一个结果。

最佳答案

require 'rubygems'
require 'nokogiri'

h = '<h1> Header is here</h1>
<h2>Header 2 is here</h2>
<p> Extract me!</p>
<p> Extract me too!</p>
<h2> Next Header 2</h2>
<p>not interested</p>
<p>not interested</p>
<h2>Header 2 is here</h2>
<p> Extract me!</p>
<p> Extract me too!</p>
'

doc = Nokogiri::HTML(h)

# Specify the range between delimiter tags that you want to extract
# triple dot is used to exclude the end point
# 1...2 means 1 and not 2
EXTRACT_RANGES = [
2...3,
4...5
]

# Tags which count as delimiters, not to be extracted
DELIMITER_TAGS = [
"h1",
"h2"
]

extracted_text = []

i = 0
# Change /"html"/"body" to the correct path of the tag which contains this list
(doc/"html"/"body").children.each do |el|

if (DELIMITER_TAGS.include? el.name)
i += 1
else
extract = false
EXTRACT_RANGES.each do |cur_range|
if (cur_range.include? i)
extract = true
break
end
end

if extract
s = el.inner_text.strip
unless s.empty?
extracted_text << el.inner_text.strip
end
end
end

end

# Print out extracted text (each element's inner text is separated by newlines)
puts extracted_text.join("\n")

关于ruby - 使用 nokogiri 提取 HTML 标签之间的文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/7812500/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com