gpt4 book ai didi

java - 如何使用 Jsoup 获取孤立文本?

转载 作者:行者123 更新时间:2023-11-30 06:55:30 24 4
gpt4 key购买 nike

我有一个 html:

<span>This is the first text</span>
More text here
Another line of text
<span>Text in the span</span>
<span>Another text in span</span>
This is another line

我想按顺序获取所有文本,例如这个数组:

[
"Span:This is the first text",
"More text here",
"Another line of text",
"Span:Text in the span",
"Span:Another text in span",
"This is another line",
]

最佳答案

我会采用递归方法,该方法采用起始标记并迭代其子节点。对于每个 TextNode,打印内容。对于每个元素,检查它的子节点。

public static void main(String[] args) throws ParseException, IOException
{
//I put your HTML in the body tag in a local file
Document doc = Jsoup.parse(new File("input/20160505.html"), "UTF-8");
Elements elements = doc.getElementsByTag("body");
Element rootTag = elements.get(0);
printTextOfTag(rootTag);
}

public static void printTextOfTag(Element currentTag)
{
List<Node> nodes = currentTag.childNodes();
for(Node n : nodes)
{
if(n instanceof TextNode)
{
System.out.println(((TextNode)n).text());
}
else if(n instanceof Element)
{
printTextOfTag((Element)n);
}
}
}

输出

This is the first text

More text here Another line of text

Text in the span



Another text in span

This is another line

关于java - 如何使用 Jsoup 获取孤立文本?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41915562/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com