gpt4 book ai didi

java - Jsoup reddit 刮刀 429 错误

转载 作者:行者123 更新时间:2023-12-01 07:22:22 26 4
gpt4 key购买 nike

因此,我尝试使用 jsoup 来抓取 Reddit 中的图像,但是当我抓取某些子 Reddits(例如/r/wallpaper)时,我收到 429 错误,并且想知道如何解决此问题。完全理解这段代码很糟糕,这是一个相当菜鸟的问题,但我对此完全陌生。无论如何:

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;

import java.io.*;
import java.net.URL;
import java.util.logging.Level;
import java.util.logging.Logger;
import java.io.*;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.jsoup.Jsoup;

import org.jsoup.nodes.Attributes;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.net.URL;
import java.util.Scanner;

public class javascraper{

public static void main (String[]args) throws MalformedURLException
{
Scanner scan = new Scanner (System.in);
System.out.println("Where do you want to store the files?");
String folderpath = scan.next();
System.out.println("What subreddit do you want to scrape?");
String subreddit = scan.next();
subreddit = ("http://reddit.com/r/" + subreddit);
new File(folderpath + "/" + subreddit).mkdir();

//test

try{
//gets http protocol
Document doc = Jsoup.connect(subreddit).timeout(0).get();

//get page title
String title = doc.title();
System.out.println("title : " + title);

//get all links
Elements links = doc.select("a[href]");

for(Element link : links){

//get value from href attribute
String checkLink = link.attr("href");
Elements images = doc.select("img[src~=(?i)\\.(png|jpe?g|gif)]");
if (imgCheck(checkLink)){ // checks to see if img link j
System.out.println("link : " + link.attr("href"));
downloadImages(checkLink, folderpath);
}
}
}
catch (IOException e){
e.printStackTrace();
}
}

public static boolean imgCheck(String http){
String png = ".png";
String jpg = ".jpg";
String jpeg = "jpeg"; // no period so checker will only check last four characaters
String gif = ".gif";
int length = http.length();

if (http.contains(png)|| http.contains("gfycat") || http.contains(jpg)|| http.contains(jpeg) || http.contains(gif)){
return true;
}
else{
return false;
}
}

private static void downloadImages(String src, String folderpath) throws IOException{
String folder = null;

//Exctract the name of the image from the src attribute

int indexname = src.lastIndexOf("/");

if (indexname == src.length()) {
src = src.substring(1, indexname);
}
indexname = src.lastIndexOf("/");

String name = src.substring(indexname, src.length());

System.out.println(name);

//Open a URL Stream

URL url = new URL(src);

InputStream in = url.openStream();

OutputStream out = new BufferedOutputStream(new FileOutputStream( folderpath+ name));

for (int b; (b = in.read()) != -1;) {

out.write(b);

}

out.close();

in.close();
}

}

最佳答案

您的问题是由于您的抓取工具违反 reddit's API rules 引起的。错误 429 意味着“请求过多”——您请求的页面过多且速度过快。

每2秒可以发出一个请求,并且还需要设置合适的user agent (他们推荐的格式是 <platform>:<app ID>:<version string> (by /u/<reddit username>) )。从目前的情况来看,您的代码运行得太快并且没有指定,因此它将受到严重的速率限制。

<小时/>

要解决这个问题,首先,将其添加到类的开头,主方法之前:

public static final String USER_AGENT = "<PUT YOUR USER AGENT HERE>";

(确保指定实际的用户代理)。

然后,更改此设置(在 downloadImages 中)

URL url = new URL(src);
InputStream in = url.openStream();

对此:

URLConnection connection = (new URL(src)).openConnection();

Thread.sleep(2000); //Delay to comply with rate limiting
connection.setRequestProperty("User-Agent", USER_AGENT);

InputStream in = connection.getInputStream();

您还需要更改此设置(在 main 中)

Document doc = Jsoup.connect(subreddit).timeout(0).get();

对此:

Document doc = Jsoup.connect(subreddit).userAgent(USER_AGENT).timeout(0).get();

那么您的代码应该不会再遇到该错误。

<小时/>

请注意,使用 reddit's API (IE,/r/subreddit.json 而不是/r/subreddit)可能会使这个项目更容易,但这不是必需的,您当前的代码也可以工作。

关于java - Jsoup reddit 刮刀 429 错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32769754/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com