gpt4 book ai didi

java - android & Java 中的外来字符

转载 作者:行者123 更新时间:2023-11-29 06:23:22 25 4
gpt4 key购买 nike

我正在尝试下载和解析包含外文(中文)字符的网页。我不确定我是否应该使用“utf-8”或其他东西。但这些似乎都不适合我。我使用了 getUrlContent() 的示例维基百科代码。

public void onCreate(Bundle savedInstanceState) {
super.onCreate(savedInstanceState);
setContentView(R.layout.main);
mText = (TextView) findViewById(R.id.textview1);
huaren.prepareUserAgent(this);
String test = new String("fail");

try {
test = getUrlContent("http://huaren.us/");
} catch (ApiException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
byte[] b = new byte[100000];

try {
b = test.getBytes("utf-8");
} catch (UnsupportedEncodingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

char[] charArr = (new String(b)).toCharArray();
CharSequence seq = java.nio.CharBuffer.wrap(charArr);

mText.setText(charArr, 0, 1000);//.setText(seq);
}

protected static synchronized String getUrlContent(String url) throws ApiException {
if (sUserAgent == null) {
throw new ApiException("User-Agent string must be prepared");
}

// Create client and set our specific user-agent string
HttpClient client = new DefaultHttpClient();
HttpGet request = new HttpGet(url);
request.setHeader("User-Agent", sUserAgent);

try {
HttpResponse response = client.execute(request);

// Check if server response is valid
StatusLine status = response.getStatusLine();
if (status.getStatusCode() != HTTP_STATUS_OK) {
throw new ApiException("Invalid response from server: " +
status.toString());
}

// Pull content stream from response
HttpEntity entity = response.getEntity();
InputStream inputStream = entity.getContent();

ByteArrayOutputStream content = new ByteArrayOutputStream();

// Read response into a buffered stream
int readBytes = 0;
while ((readBytes = inputStream.read(sBuffer)) != -1) {
content.write(sBuffer, 0, readBytes);
}

// Return result from buffered stream
return new String(content.toByteArray(), "utf-8");
} catch (IOException e) {
throw new ApiException("Problem communicating with API", e);
}
}

最佳答案

字符集在 the page 中定义本身:

<meta http-equiv="Content-Type" content="text/html; charset=gb2312" /> 

一般来说,有 3 种方法来指定 HTTP 服务器 HTML 页面的编码:

HTTP的Content-Type header

Content-Type: text/html; charset=utf-8

在 XML 声明中编码伪属性

<?xml version="1.0" encoding="utf-8" ?>

head 内的元标记

<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />

参见 Character Encodings详情

因此您应该尝试评估每个可能的声明以找到合适的编码。如果遇到 Content-Type 声明元标记,您可以尝试使用 utf-8 解析页面并重新启动。

关于java - android & Java 中的外来字符,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/2091762/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com