gpt4 book ai didi

java - Jsoup - 使用字符集 iso-8859-1 解析 HTML 文件

转载 作者:行者123 更新时间:2023-11-30 11:22:04 24 4
gpt4 key购买 nike

我在处理特殊字符和 charset = iso-8859-1 时遇到问题。我在这里使用的相同代码适用于 UTF-8,所以我不明白我做错了什么。

代码如下:

File input = new File("/users/marcioapf/example.html");
Document doc = Jsoup.parse(input, "iso-8859-1", "");
Elements elements = doc.select("span.DEPUTADO") ;
System.out.println(elements.toString());

这是输出:

<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Jo&atilde;ozinho Pereira</span> 
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Isnaldo Bulh&otilde;es</span>
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Antonio Albuquerque</span>
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Jeferson Morais</span>
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">In&aacute;cio Loiola</span>

它应该是这样的:

<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Joãozinho Pereira</span> 
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Isnaldo Bulhões</span>
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Antonio Albuquerque</span>
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Jeferson Morais</span>
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Inácio Loiola</span>

我该如何解决?

最佳答案

使用 EscapeMode.xhtml 将为您提供没有实体的输出。试试这段代码

  File input = new File("/users/marcioapf/example.html");
Document doc = Jsoup.parse(input, "iso-8859-1", "");
doc.outputSettings().escapeMode(EscapeMode.xhtml);
Elements elements = doc.select("span.DEPUTADO") ;
System.out.println(elements.toString());

关于java - Jsoup - 使用字符集 iso-8859-1 解析 HTML 文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21974758/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com