gpt4 book ai didi

java - jsoup 从 amazon.com 链接爬取图像宽度和高度

转载 作者:行者123 更新时间:2023-12-01 11:36:56 24 4
gpt4 key购买 nike

以下是我尝试抓取图像的宽度和高度的示例亚马逊链接:

http://images.amazon.com/images/P/0099441365.01.SCLZZZZZZZ.jpg

我正在使用 jsoup,以下是我的代码:

import java.io.*;
import org.jsoup.*;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class Crawler_main {

/**
* @param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
String filepath = "C:/imagelinks.txt";
try (BufferedReader br = new BufferedReader(new FileReader(filepath))) {
String line;
String width;
//String height;
while ((line = br.readLine()) != null) {
// process the line.
System.out.println(line);
Document doc = Jsoup.connect(line).ignoreContentType(true).get();
//System.out.println(doc.toString());
Elements jpg = doc.getElementsByTag("img");
width = jpg.attr("width");
System.out.println(width);
//String title = doc.title();
}
}
catch (FileNotFoundException ex){
System.out.println("File not found");
}
catch(IOException ex){
System.out.println("Unable to read line");
}
catch (Exception ex){
System.out.println("Exception occured");
}
}

}

已获取 html,但当我提取宽度属性时,它返回 null。当我打印获取的html时,它包含垃圾字符(我猜测它是我称之为垃圾字符的实际图像信息。例如:

我什至无法将 document.toString() 结果粘贴到此编辑器中。救命!

最佳答案

问题是您正在获取 jpg 文件,而不是任何 HTML。对ignoreContentType(true)的调用提供了一条线索,因为它的documentation状态:

Ignore the document's Content-Type when parsing the response. By default this is false, an unrecognised content-type will cause an IOException to be thrown. (This is to prevent producing garbage by attempting to parse a JPEG binary image, for example.)

如果想获取实际jpg文件的宽度,this SO answer可能有用:

BufferedImage bimg = ImageIO.read(new File(filename));
int width = bimg.getWidth();
int height = bimg.getHeight();

关于java - jsoup 从 amazon.com 链接爬取图像宽度和高度,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29875311/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com