gpt4 book ai didi

java - 我应该如何修改来解析Google新闻搜索文章标题和预览和URL?

转载 作者:行者123 更新时间:2023-11-30 07:08:28 25 4
gpt4 key购买 nike

我想解析 Google 新闻搜索:1)文章名称 2) 预览 3) URL

要执行此操作,我应该对 Web 结构进行修改。

Elements links = Jsoup.connect(google + URLEncoder.encode(search , charset) + news).userAgent(userAgent).get().select( ".g>.r>.a");

主要在这里:

( ".g>.r>.a")

如何修改?

<小时/>

完整代码:

  public static void main(String[] args) throws UnsupportedEncodingException, IOException {

String google = "http://www.google.com/search?q=";

String search = "stackoverflow";

String charset = "UTF-8";

String news="&tbm=nws";


String userAgent = "ExampleBot 1.0 (+http://example.com/bot)"; // Change this to your company's name and bot homepage!

Elements links = Jsoup.connect(google + URLEncoder.encode(search , charset) + news).userAgent(userAgent).get().select( ".g>.r>.a");

for (Element link : links) {
String title = link.text();
String url = link.absUrl("href"); // Google returns URLs in format "http://www.google.com/url?q=<url>&sa=U&ei=<someKey>".
url = URLDecoder.decode(url.substring(url.indexOf('=') + 1, url.indexOf('&')), "UTF-8");

if (!url.startsWith("http")) {
continue; // Ads/news/etc.
}
System.out.println("Title: " + title);
System.out.println("URL: " + url);
}
}

更新

enter image description here

最佳答案

如何选择正确的元素(使用 Chrome)

第一步:在浏览器中禁用 JavaScript(例如,为了方便起见,使用 uMatrix 之类的附加组件),这样您就会看到与 jsoup 相同的结果。

现在右键单击一个元素并选择检查或使用 Ctrl+Shift+I 打开开发工具。当您将鼠标悬停在“元素”选项卡中的源代码上时,您会在呈现的页面中看到相关元素。右键单击源中的 n 元素会提供复制 -> 复制选择器。这是一个很好的起点,但有时过于严格。这里它给出了选择器#rso > div:nth-child(3),因此ID为rso的元素中的第三个直接子div。这太具体了,所以我们概括一下:

我们为 id 为 rso #rso > div 的元素选择所有直接子 div。

然后我们抓取标题 anchor h3 > a、textnode 和属性 href 结果为标题和网址。

接下来,我们获取带有类 st (div.st) 的内部 div,它在其文本节点中包含预览。如果该 div 丢失,我们将跳过该元素。

在请求中使用.data("key","value"),我们不需要手动编码。

示例代码

String userAgent = "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36";
String searchTerm = "stackoverflow";
int numberOfResultpages = 2; // grabs first two pages of search results
String searchUrl = "https://www.google.com/search?";

Document doc;

for (int i = 0; i < numberOfResultpages; i++) {

try {
doc = Jsoup.connect(searchUrl)
.userAgent(userAgent)
.data("q", searchTerm)
.data("tbm", "nws")
.data("start",""+i)
.method(Method.GET)
.referrer("https://www.google.com/").get();

for (Element result : doc.select("#rso > div")) {

if(result.select("div.st").size()==0) continue;

Element h3a = result.select("h3 > a").first();

String title = h3a.text();
String url = h3a.attr("href");
String preview = result.select("div.st").first().text();

// just printing out title and link to demonstate the approach
System.out.println(title + " -> " + url + "\n\t" + preview);
}

} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}

输出

Stack Overflow: Movie Magic -> https://geekdad.com/2016/09/stack-overflow-movie-magic-2/
I got to visit the set of Kubo and the Two Strings and see some of the amazing work that went into creating the film. But well before the ...
Will StackOverflow Documentation Realize Its Lofty Goal? -> https://dzone.com/articles/will-stackoverflow-documentation-realize-its-lofty
With the StackOverflow Documentation project now in beta, how close is it to realizing the lofty goals it has set forth for itself? Can it ever ...
Stack Overflow: Progress Report -> https://geekdad.com/2016/09/stack-overflow-progress-report/
Of the books on my list, the only one I totally finished so far is Kidding Ourselves, which I included in this Stack Overflow. And that perhaps is an ...
....

关于java - 我应该如何修改来解析Google新闻搜索文章标题和预览和URL?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39629545/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com