gpt4 book ai didi

java - 使用 Java 进行谷歌搜索

转载 作者:行者123 更新时间:2023-11-30 06:50:23 25 4
gpt4 key购买 nike

这个程序读取一个包含搜索查询的文本文件,用它们查询 Google,然后将所有链接输出到另一个文件。该程序适用于数百个查询,但突然工作并报告错误。

(我将编辑这篇文章并很快发布从我的程序的哪些行返回的错误)。

知道会发生什么吗?

import java.io.*;
import java.net.URL;
import java.net.URLConnection;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.Scanner;

public class GoogleSearcher {
public static void main(String [] args) throws Exception {
Scanner in = new Scanner (System.in);
System.out.println("Input list of queries to search:");
String loc = in.nextLine();
loc = loc.replace("\\", "");
System.out.println("Where to write file?");
String writeLoc = in.nextLine();
writeLoc = writeLoc.replace("\\", " ");
FileInputStream fstream = new FileInputStream(loc);
BufferedReader br = new BufferedReader(new InputStreamReader(fstream));
String line;
PrintWriter pw = new PrintWriter(new FileWriter(writeLoc + "Google Search Results.txt"));
while ((line = br.readLine()) != null) {
System.out.println("Searching: \"" + line + "\"");
ArrayList<String> t = googleSearch(line);
if (t != null){
for (int a = 0; a < t.size(); a++){
pw.write(t.get(a) + System.lineSeparator());
}
}
}
br.close();
pw.close();
}
public static ArrayList<String> googleSearch(String search) throws Exception {
try {
String query = "https://www.google.com/search?q=" + search.replace(" ", "%20");
String page = getSearchContent(query);
ArrayList<String> links = parseLinks(page);
return formatLinks(links);
} catch (Exception e) {
e.printStackTrace();
System.out.println("Error... Trying next search");
return null;
}
}
public static ArrayList<String> formatLinks(ArrayList a){
ArrayList<String> formatted = new ArrayList<String>();
for (int i = 0; i < a.size(); i++){
String t = (String)a.get(i);
t = t.replace("%3F", "?");
t = t.replace("%3D", "=");
formatted.add(t);
}
return formatted;
}
public static String getString(InputStream is) {
StringBuilder sb = new StringBuilder();
BufferedReader br = new BufferedReader(new InputStreamReader(is));
String line;
try {
while ((line = br.readLine()) != null) {
sb.append(line);
}
} catch (IOException e) {
e.printStackTrace();
} finally {
if (br != null) {
try {
br.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
return sb.toString();
}
public static String getSearchContent(String path) throws Exception {
final String agent = "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)";
URL url = new URL(path);
final URLConnection connection = url.openConnection();
connection.setRequestProperty("User-Agent", agent);
final InputStream stream = connection.getInputStream();
return getString(stream);
}
public static ArrayList<String> parseLinks(final String html) throws Exception {
ArrayList<String> result = new ArrayList<String>();
String pattern1 = "<h3 class=\"r\"><a href=\"/url?q=";
String pattern2 = "\">";
Pattern p = Pattern.compile(Pattern.quote(pattern1) + "(.*?)" + Pattern.quote(pattern2));
Matcher m = p.matcher(html);
while (m.find()) {
String domainName = m.group(0).trim();
// remove unwanted text
domainName = domainName.substring(domainName.indexOf("/url?q=") + 7);
domainName = domainName.substring(0, domainName.indexOf("&amp;"));
result.add(domainName);
}
return result;
}
}

最佳答案

好的,在你的程序运行了几轮后,我得到了以下错误。

Error... Trying next search
Searching: "autoradiograph"
java.io.IOException: Server returned HTTP response code: 503 for URL: https://ipv4.google.com/sorry/index?continue=https://www.google.com/search%3Fq%3Daustria&q=EgTLe7ahGOKSrcMFIhkA8aeDSylzciRE9l0cz9fUg6u2MeGh-muxMgNyY24
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1876)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1474)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:254)
at application.GoogleSearcher.getSearchContent(GoogleSearcher.java:90)
at application.GoogleSearcher.googleSearch(GoogleSearcher.java:45)
at application.GoogleSearcher.main(GoogleSearcher.java:32)
java.io.IOException: Server returned HTTP response code: 503 for URL: https://ipv4.google.com/sorry/index?continue=https://www.google.com/search%3Fq%3Dautoradiograph&q=EgTLe7ahGOKSrcMFIhkA8aeDS_cQehdQreptc4cInLKEPYpprweeMgNyY24

这种情况正在发生,因为谷歌正在阻止自动搜索以防止 Denial of Service攻击他们的服务器。

Google Captcha Image

Google 可能不允许您执行自动搜索。这是一个 link to their support page. .这是该页面的摘录。

Automated queries

Google's Terms of Service do not allow the sending of automated queries of any sort to our system without express permission in advance from Google. Sending automated queries consumes resources and includes using any software (such as WebPosition Gold) to send automated queries to Google to determine how a website or webpage ranks in Google search results for various queries. In addition to rank checking, other types of automated access to Google without permission are also a violation of our Webmaster Guidelines and Terms of Service.

关于java - 使用 Java 进行谷歌搜索,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41437689/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com