gpt4 book ai didi

java - 用java清理html属性

转载 作者:行者123 更新时间:2023-12-01 09:50:59 25 4
gpt4 key购买 nike

我在学校接到一个任务,要删除 html 标签中的所有内容,除了一些属性,如 class、id、alt、src、name 和 href。

例如,我们有一个 HTML 文件:

<div class="wrapper">
<h1 value="something" class=header>Header</h1>
<div id="article1" class="article" name="something" >
<img clsas="mistake" src="picture.jpg" id="pict1" class="image_article" alt="picture" />
<p class="article_text" >Lorem ipsum dolor sit amet, consectetur adipiscing. </p>
<a href="article.html" title="More">Více</a>
</div>

结果应该是这样的:

<div class="wrapper">
<h1 class=header>Header</h1>
<div id="article1" class="article" >
<img src="picture.jpg" id="pict1" class="image_article" alt="picture" />
<p class="article_text" >Lorem ipsum dolor sit amet, consectetur adipiscing. </p>
<a href="article.html">Více</a>
</div>

我尝试过这样的事情:

String opr = html.replaceAll("<([a-zA-Z]+)[^<>]*(class|id)(=\".+?\")[^<]*(class|id)(=\".+?\")[^<]*>", "<$1 $2$3 $4$5 >");

但它只适用于同时具有 class 和 id 属性的 HTML 标签。有人可以帮忙吗?

最佳答案

避免使用正则表达式来满足这种需要,因为如果你想要正确的话,它会非常复杂,因此很难维护。您应该使用 HTML 解析器,例如 Jsoup然后通过删除所有不需要的属性来清理每个元素,如下所示:

Document doc = Jsoup.parse("<html>\n" +
" <head></head>\n" +
" <body>\n" +
"<table><div class=\"wrapper\">\n" +
"<h1 value=\"something\" class=header>Header</h1>\n" +
"<div id=\"article1\" class=\"article\" name=\"something\" >\n" +
"<img clsas=\"mistake\" src=\"picture.jpg\" id=\"pict1\" class=\"image_article\" alt=\"picture\" />\n" +
"<p class=\"article_text\" >Lorem ipsum dolor sit amet, consectetur adipiscing. </p>\n" +
"<a href=\"article.html\" title=\"More\">Více</a>\n" +
"</div></body></html>"
);
for (Element element : doc.getAllElements()) {
for (Attribute attribute : element.attributes()) {
switch (attribute.getKey()) {
case "class":
case "id":
case "alt":
case "src":
case "name":
case "href":
break;
default:
element.removeAttr(attribute.getKey());
}
}
}
System.out.println(doc);

输出:

<html>
<head></head>
<body>
<div class="wrapper">
<h1 class="header">Header</h1>
<div id="article1" class="article" name="something">
<img src="picture.jpg" id="pict1" class="image_article" alt="picture">
<p class="article_text">Lorem ipsum dolor sit amet, consectetur adipiscing. </p>
<a href="article.html">Více</a>
</div>
</div>
<table></table>
</body>
</html>

关于java - 用java清理html属性,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37595095/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com