gpt4 book ai didi

java - 仅抓取 HTML 页面,同时检查响应 header

转载 作者:行者123 更新时间:2023-12-02 08:14:52 25 4
gpt4 key购买 nike

我试图获取所有标题为 Content-Type:text/html 的 url,因此我检查每个 url 的响应 header ,如果它们具有 content-type: text/html,那么我想打印它内容类型为:text/html 的 url。但是在我的代码中,当我检查 header 是否具有 Content-Type 时,它​​不会显示任何内容。如果我删除 if 循环,那么它会打印与我想要抓取的特定 url 相关的每个链接及其响应 header ..

public class MyCrawler extends WebCrawler {

Pattern filters = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g"
+ "|png|tiff?|mid|mp2|mp3|mp4" + "|wav|avi|mov|mpeg|ram|m4v|pdf"
+ "|rm|smil|wmv|swf|wma|zip|rar|gz))$");


/*
Pattern filters = Pattern.compile("(\\.(html))");
*/
public MyCrawler() {
}

public boolean shouldVisit(WebURL url) {
String href = url.getURL().toLowerCase();
//System.out.println("Href: " +href);
/*
if (filters.matcher(href).matches()) {
return false;
}*/
if (href.startsWith("http://www.somehost.com/")) {
return true;
}
return false;
}

public void visit(Page page) {

int docid = page.getWebURL().getDocid();

String url = page.getWebURL().getURL();
String text = page.getText();
List<WebURL> links = page.getURLs();
int parentDocid = page.getWebURL().getParentDocid();


//HttpGet httpget = new HttpGet(url);


try {
URL url1 = new URL(url);
URLConnection connection = url1.openConnection();

Map responseMap = connection.getHeaderFields();
for (Iterator iterator = responseMap.keySet().iterator(); iterator.hasNext();)
{
String key = (String) iterator.next();
if(key==("Content-Type")) //(Anything wrong with this if loop)
{
System.out.print(key + " = ");

List values = (List) responseMap.get(key);
for (int i = 0; i < values.size(); i++) {
Object o = values.get(i);
System.out.print(o + ", ");
}
System.out.println("");
System.out.println(url1);
}

}
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}


//System.out.println("Docid: " + docid);
//System.out.println("URL: " + url);
//System.out.println("Text length: " + text.length());
//System.out.println("Number of links: " + links.size());
//System.out.println("Docid of parent page: " + parentDocid);
System.out.println("=============");
}
}

最佳答案

key变量包含:

内容类型=[text/html;字符集=ISO-8859-1]

因此无法用 ==.equals("Content-Type") 捕获

如果您尝试运行以下代码,请查看它打印出的内容

URLConnection connection = url1.openConnection();

Map responseMap = connection.getHeaderFields();
Iterator iterator = responseMap.entrySet().iterator();
while (iterator.hasNext())
{
String key = iterator.next().toString();
if (key.contains("Content-Type"))
{
System.out.println(key);
// Content-Type=[text/html; charset=ISO-8859-1]
if (filters.matcher(key) != null){
System.out.println(url1);
// http://google.com
}
}
}

这是输出:

Content-Type=[text/html; charset=ISO-8859-1]
http://google.com

看起来您也可以只使用一个 if 语句,如下所示:

while (iterator.hasNext())
{
String key = iterator.next().toString();
if (key.contains("text/html"))
{
System.out.println(url1);
// http://google.com
}
}

顺便说一句,Java 中的字符串比较 is very intuitive ,一直让我着迷!

关于java - 仅抓取 HTML 页面,同时检查响应 header ,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/6630906/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com