gpt4 book ai didi

java - 在 Java 中获取不可用 URL 的 pageContent 时出现问题

转载 作者:行者123 更新时间:2023-12-01 16:04:57 24 4
gpt4 key购买 nike

我有一个从 URL 获取页面内容的代码:

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;

public class GetPageFromURLAction extends Thread {

public String stringPageContent;
public String targerURL;

public String getPageContent(String targetURL) throws IOException {
String returnString="";
URL urlString = new URL(targetURL);
URLConnection openConnection = urlString.openConnection();
String temp;
BufferedReader in = new BufferedReader( newInputStreamReader(openConnection.getInputStream()));
while ((temp = in.readLine()) != null)
{
returnString += temp + "\n";
}
in.close();
// String nohtml = sb.toString().replaceAll("\\<.*?>","");
return returnString;

}

public String getStringPageContent() {
return stringPageContent;
}

public void setStringPageContent(String stringPageContent) {
this.stringPageContent = stringPageContent;
}

public String getTargerURL() {
return targerURL;
}

public void setTargerURL(String targerURL) {
this.targerURL = targerURL;
}

@Override
public void run() {
try {
this.stringPageContent=this.getPageContent(targerURL);
} catch (IOException e) {
e.printStackTrace();
}
}

}

有时我会收到 405 或 403 的 HTTP 错误,并且结果字符串为 null。我尝试检查连接到 URL 的权限:

    URLConnection openConnection = urlString.openConnection();
openConnection.getPermission()

但它通常返回 null。这是否意味着我无权访问该链接?

我尝试使用以下方式剥离 URL 的 query 部分:

String nohtml = sb.toString().replaceAll("\\<.*?>","");

其中sb是一个Stringbulder,但它似乎并没有剥离整个查询子字符串。

在一个不相关的问题中,我想在这里使用线程,因为我必须检索许多 URL;如何创建多线程客户端来提高速度?

最佳答案

相关error definitions是:

403 Forbidden

The server understood the request, but is refusing to fulfill it. Authorization will not help and the request SHOULD NOT be repeated. If the request method was not HEAD and the server wishes to make public why the request has not been fulfilled, it SHOULD describe the reason for the refusal in the entity. If the server does not wish to make this information available to the client, the status code 404 (Not Found) can be used instead.

405 Method Not Allowed

The method specified in the Request-Line is not allowed for the resource identified by the Request-URI. The response MUST include an Allow header containing a list of valid methods for the requested resource.

所以,是的,403 意味着您没有权限,并且剥离查询可能根本没有帮助。

405 意味着您没有正确地制定 GET,但是如果有服务器在返回 405 时实际上意味着 403,我不会感到惊讶。

在这两种情况下,您可能应该考虑该 URL 永久无法访问。

关于java - 在 Java 中获取不可用 URL 的 pageContent 时出现问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/2807564/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com