gpt4 book ai didi

java - 从表示为字符串的 HTML 中提取内容

转载 作者:行者123 更新时间:2023-12-02 08:11:34 25 4
gpt4 key购买 nike

我有一个 Big html in String 变量,我想获取 div 的内容。我不能依赖正则表达式,因为它可以有嵌套的 div。所以,假设我有以下字符串 -

String test = "<div><div id=\"mainContent\">foo bar<div>good best better</div>  <div>test test</div></div><div>foo bar</div></div>";

那么我怎样才能用一个简单的java程序得到这个 -

<div id="mainContent">foo bar<div>good best better</div>  <div>test test</div></div>

嗯,我的方法是这样的(可能很可怕,仍在努力纠正)-

public static void main(String[] args) {
int count = 1;
int fl = 0;
String s = "<div><div id=\"mainContent\">foo bar<div>good best better</div> <div>test test</div></div><div>foo bar</div></div>";
String tmp = s;
int len = s.length();
for (int i=0; i<len; i++){
int st = s.indexOf("div>");
if(st > -1) {
char c = s.charAt(st-1);
if(c == '/') {
count--;
} else {
count++;
}
s = s.substring(st+4);
System.out.println(s);
i = i + st;
System.out.println(c + " -- " + st + " -- " + count + " -- " + i);
if (count == 0) {
fl = i;
break;
}
}
}
System.out.println("final ind - " + fl);
s = tmp.substring(0, fl + 4);
System.out.println("final String - " + s);
}

最佳答案

我建议使用JSoup解析 HTML 并找到您要查找的内容。

它肯定满足简单要求。只需几行代码即可完成您想要的事情!

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.

scrape and parse HTML from a URL, file, or string

find and extract data, using DOM traversal or CSS selectors

jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; jsoup will create a sensible parse tree.

使用selector syntax使查找和提取数据变得极其简单。

public static void main(final String[] args)
{
final String s = "<div><div id=\"mainContent\">foo bar<div>good best better</div> <div>test test</div></div><div>foo bar</div></div>";
final Document d = Jsoup.parse(s);
final Elements e = d.select("#mainContent");
System.out.println(e.get(0));
}

输出

  <div id="mainContent">
foo bar
<div>
good best better
</div>
<div>
test test
</div>
</div>

没有比这更简单的了!

关于java - 从表示为字符串的 HTML 中提取内容,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/7324633/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com