gpt4 book ai didi

java - 使用 Jsoup 删除所有 HTML 但保留行

转载 作者:行者123 更新时间:2023-11-28 02:00:54 24 4
gpt4 key购买 nike

我有一个 String,其中包含一封电子邮件的一些内容,我想从这个 String 中删除所有 HTML 编码。

这是我目前的代码:

public static String html2text(String html) {

Document document = Jsoup.parse(html);
document = new Cleaner(Whitelist.basic()).clean(document);
document.outputSettings().escapeMode(EscapeMode.xhtml);
document.outputSettings().charset("UTF-8");
html = document.body().html();

html = html.replaceAll("<br />", "");

splittedStr = html.split("Geachte heer/mevrouw,");

html = splittedStr[1];

html = "Geachte heer/mevrouw,"+html;

return html;
}

此方法删除所有 HTML,保留行和大部分布局。但它也会返回一些 &nbsp; 标签,这些标签没有被完全删除。看下面的输出,你可以看到 String 中仍然有一些标签,甚至是标签的一部分。我该如何摆脱这些?

  Loonheffingen       &amp;n= bsp; Naam
nr         in administratie         &amp;nbs= p;           meldingen
 nummer

1          &amp;n= bsp;            = ;     0            &amp;= nbsp;           &amp;nbs= p;           1
     123456789L01

编辑:

<span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">De afgekeurde meldingen zijn opgenomen in de bijlage: Afgekeurde meldingen.</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">

<span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">Wilt u zo spoedig mogelijk zorgdragen dat deze</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">
<span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">meldingen gecorrigeerd worden aangeleverd?</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">
<span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">mer</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">
<span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">Volg &nbsp; &nbsp; Aantal verwerkt &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Aantal afgekeurde</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">
<span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">&nbsp;Loonheffingen &nbsp; &nbsp; &nbsp; &nbsp; Naam</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">
<span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">nr &nbsp; &nbsp; &nbsp; &nbsp; in administratie &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; meldingen</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">
<span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">&nbsp;nummer</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">
<br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif"><span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">1 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;0 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;1</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">

这是我要解析的 HTML 的一部分。我想删除所有 HTML,但保留原始电子邮件的布局。

感谢任何帮助,

谢谢!

已解决

        Document xmlDoc = Jsoup.parse(file, "", Parser.xmlParser());
Elements spans= xmlDoc.select("span");

for (Element link : spans) {
String html = textPlus(link);
System.out.println(html);
}


public static String textPlus(Element elem) {
List<TextNode> textNodes = elem.textNodes();
if (textNodes.isEmpty()) {
return "";
}

StringBuilder result = new StringBuilder();
// start at the first text node
Node currentNode = textNodes.get(0);
while (currentNode != null) {
// append deep text of all subsequent nodes
if (currentNode instanceof TextNode) {
TextNode currentText = (TextNode) currentNode;
result.append(currentText.text());
} else if (currentNode instanceof Element) {
Element currentElement = (Element) currentNode;
result.append(currentElement.text());
}
currentNode = currentNode.nextSibling();
}
return result.toString();
}

代码是作为 this 问题的答案提供的。

最佳答案

您需要遍历 JSoup 返回的 HTML 结构并整理文本节点,而不是这样做。这样您就可以让 JSoup 确定什么是真正的文本,实体编码将为您处理(例如 & -> & 等)。

参见 this SO question了解更多信息。

关于java - 使用 Jsoup 删除所有 HTML 但保留行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/13988452/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com