gpt4 book ai didi

java - 爬取content-type非text/html的URL

转载 作者:行者123 更新时间:2023-12-02 08:14:50 25 4
gpt4 key购买 nike

我可以获得所有内容/类型为text/html的url,但是如果我想要那些内容/类型不是text/html的url。那我们该如何检查呢。至于字符串,我们可以使用 contains 方法,但它没有像 notcontains 这样的东西。任何建议将不胜感激。还有

The key variable contains:

Content-Type=[text/html; charset=ISO-8859-1]

这是下面的代码,用于检查 text/html,我也尝试了非 text/html 的内容类型,但它也打印出了内容类型也是 text/html 的内容。

    try {
URL url1 = new URL(url);
System.out.println("URL:- " +url1);
URLConnection connection = url1.openConnection();

Map responseMap = connection.getHeaderFields();
Iterator iterator = responseMap.entrySet().iterator();
while (iterator.hasNext())
{
String key = iterator.next().toString();

if (key.contains("text/html") || key.contains("text/xhtml"))
{
System.out.println(key);
// Content-Type=[text/html; charset=ISO-8859-1]
if (filters.matcher(key) != null){
System.out.println(url1);
try {
final File parentDir = new File("crawl_html");
parentDir.mkdir();
final String hash = MD5Util.md5Hex(url1.toString());
final String fileName = hash + ".txt";
final File file = new File(parentDir, fileName);
boolean success =file.createNewFile(); // Creates file crawl_html/abc.txt


System.out.println("hash:-" + hash);

System.out.println(file);
// Create file if it does not exist



// File did not exist and was created
FileOutputStream fos = new FileOutputStream(file, true);

PrintWriter out = new PrintWriter(fos);

// Also could be written as follows on one line
// Printwriter out = new PrintWriter(new FileWriter(args[0]));

// Write text to file
Tika t = new Tika();
String content= t.parseToString(new URL(url1.toString()));


out.println("===============================================================");
out.println(url1);
out.println(key);
out.println(success);
out.println(content);

out.println("===============================================================");
out.close();
fos.flush();
fos.close();



} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block

e.printStackTrace();
} catch (TikaException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}


// http://google.com
}
}
else if (!connection.getContentType().startsWith("text/html"))//print duplicate records of each url
//else if (!key.contains("text/html"))
{
if (filters.matcher(key) != null){
try {
final File parentDir = new File("crawl_media");
parentDir.mkdir();
final String hash = MD5Util.md5Hex(url1.toString());
final String fileName = hash + ".txt";
final File file = new File(parentDir, fileName);
// Create file if it does not exist
boolean success =file.createNewFile(); // Creates file crawl_html/abc.txt


System.out.println("hash:-" + hash);

Tika t = new Tika();
String content_media= t.parseToString(new URL(url1.toString()));



// File did not exist and was created
FileOutputStream fos = new FileOutputStream(file, true);

PrintWriter out = new PrintWriter(fos);

// Also could be written as follows on one line
// Printwriter out = new PrintWriter(new FileWriter(args[0]));

// Write text to file
out.println("===============================================================");
out.println(url1);
out.println(key);
out.println(success);
out.println(content_media);
//out.println("===============================================================");
out.close();
fos.flush();
fos.close();




} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block

e.printStackTrace();
} catch (TikaException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}

}



}
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}



System.out.println("=============");
}
}

一种方法是单独检查每个内容类型,例如 pdf,它是 application/pdf

if (key.contains("application/pdf")

对于 xml 以同样的方式...但除此之外的任何其他方法...

最佳答案

这有帮助吗?

 if (!connection.getContentType.startsWith("text/html"))

关于java - 爬取content-type非text/html的URL,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/6654650/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com