gpt4 book ai didi

java - Jsoup 中 umlaute 的编码具有奇怪的行为

转载 作者:行者123 更新时间:2023-11-30 08:01:00 26 4
gpt4 key购买 nike

我对 JSoup 库的编码行为有一些疑问。

我想解析网页的内容,因此我必须插入一些人名,其中还可以包含德语变音符号如 ä、ö 等。

这是我使用的代码:

doc = Jsoup.parse(new URL(searchURL).openStream(), "UTF-8", searchURL);

解析resp的html。网页。

但是当我查看文档时,ä 显示如下:

凯瑟

我在编码方面做错了什么?

该网页具有以下 header :

<!doctype html>
<html>
<head lang="en">
<title>Käse - Semantic Scholar</title>
<meta charset="utf-8">
</html>

有人帮忙吗?谢谢 :)

编辑:我尝试了 Stephans 的回答,它适用于网页 www.semanticscholar.org,但我也在解析另一个网页, http://www.authormapper.com/

如果作者的名字包含德语变音符号,则相同的代码不适用于此网页。有谁知道为什么这不起作用?不知道这个很尴尬....

最佳答案

这是 Jsoup 的已知问题。以下是加载 Jsoup 内容的两个选项:

选项 1:仅限 JDK

InputStream is = null;

try {
// Connect to website
URL tmp = new URL(url);
HttpURLConnection connection = (HttpURLConnection) tmp.openConnection();
connection.setReadTimeout(10000);
connection.setConnectTimeout(10000);
connection.setRequestMethod("GET");
connection.connect();

// Load content for Jsoup
is = connection.getInputStream(); // We suppose connection.getResponseCode() == 200

int n;
char[] buffer = new char[4096];
Reader r = new InputStreamReader(is, "UTF-8");
Writer w = new StringBuilderWriter();
while (-1 != (n = r.read(buffer))) {
w.write(buffer, 0, n);
}

// Parse html
String html = w.toString();
Document doc = Jsoup.parse(html, searchURL);
} catch(IOException e) {
// Handle exception ...
} finally {
try {
if (is != null) {
is.close();
}
} catch (final IOException ioe) {
// ignore
}
}

选项 2:使用 Commons IO

InputStream is = null;

try {
// Connect to website
URL tmp = new URL(url);
HttpURLConnection connection = (HttpURLConnection) tmp.openConnection();
connection.setReadTimeout(10000);
connection.setConnectTimeout(10000);
connection.setRequestMethod("GET");
connection.connect();

// Load content for Jsoup
is = connection.getInputStream(); // We suppose connection.getResponseCode() == 200
String html = IOUtils.toString(is, "UTF-8")

// Parse html
Document doc = Jsoup.parse(html, searchURL);
} catch(IOException e) {
// Handle exception ...
} finally {
IOUtils.closeQuietly(is);
}

最后的想法:

- Never rely on website encoding if you didn't check manually (when possible) the real encoding in use.
- Never rely on Jsoup to find somehow the right encoding.
- You can [automate encoding guessing][2]. See the previous link for details.

关于java - Jsoup 中 umlaute 的编码具有奇怪的行为,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38041415/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com