gpt4 book ai didi

java - 尝试获取《纽约时报》页面时,使用 GAE 的 URLFetchService 返回 null

转载 作者:行者123 更新时间:2023-12-02 13:15:18 26 4
gpt4 key购买 nike

我正在使用以下代码来获取《纽约时报》页面的 html,不幸的是,这返回了 null。我尝试过其他网站(CNN、卫报等),它们工作得很好。我正在使用 Google App Engine 中的 URLFetchService。

这是代码片段。请告诉我我做错了什么?

//url = https://www.nytimes.com/2017/05/02/us/politics/health-care-paul-ryan-fred-upton-congress.html

private String extractFromUrl(String url, boolean forced) throws java.io.IOException, org.xml.sax.SAXException,
de.l3s.boilerpipe.BoilerpipeProcessingException {

Future<HTTPResponse> urlFuture = getMultiResponse(url);

HTTPResponse urlResponse = null;
try {
urlResponse = urlFuture.get(); // Returns null here
} catch ( InterruptedException ie ) {
ie.printStackTrace();
} catch ( ExecutionException ee ) {
ee.printStackTrace();
}

String urlResponseString = new String(urlResponse.getContent());
return urlResponseString;
}

public Future<HTTPResponse> getMultiResponse(String website) {
URLFetchService fetcher = URLFetchServiceFactory.getURLFetchService();
URL url = null;
try {
url = new URL(website);
} catch (MalformedURLException e) {
e.printStackTrace();
}

FetchOptions fetchOptions = FetchOptions.Builder.followRedirects();
HTTPRequest request = new HTTPRequest(url, HTTPMethod.GET, fetchOptions);
Future<HTTPResponse> futureResponse = fetcher.fetchAsync(request);
return futureResponse;
}

我得到的异常是这样的:

java.util.concurrent.ExecutionException: java.io.IOException: Could not fetch URL: https://www.nytimes.com/2017/05/02/us/politics/health-care-paul-ryan-fred-upton-congress.html, error: Received exception executing http method GET against URL https://www.nytimes.com/2017/05/02/us/politics/health-care-paul-ryan-fred-upton-congress.html: null
[INFO] at com.google.appengine.api.utils.FutureWrapper.setExceptionResult(FutureWrapper.java:66)
[INFO] at com.google.appengine.api.utils.FutureWrapper.get(FutureWrapper.java:97)
[INFO] at main.java.com.myapp.app.MyServlet.extractFromUrl(MyServlet.java:10)

最佳答案

查看curl的详细输出,您可以看到该网站尝试设置cookie并在cookie不被接受的情况下重定向您。

看来时代会在你放弃之前重新引导你7次 -

$ curl --verbose -L "https://www.nytimes.com/2017/05/02/us/politics/health-care-paul-ryan-fred-upton-congress.html" 2>&1 | grep 303 | wc -l
7

UrlFetch 的最大重定向次数似乎为 5 [0]。

为了成功抓取 www.nytimes.com,您必须禁用以下重定向并自行处理 cookie 逻辑。一些灵感在这里 [1] 和这里 [2]

[0] https://groups.google.com/forum/#!topic/google-appengine/F2dX3LqOrhY

[1] https://groups.google.com/d/msg/google-appengine-java/pE0xak7LRxg/M__U-SM3YMMJ

[2] https://stackoverflow.com/a/13588616/7947020

关于java - 尝试获取《纽约时报》页面时,使用 GAE 的 URLFetchService 返回 null,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43803016/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com