gpt4 book ai didi

java - 使用 Java 从网页中抓取信息?

转载 作者:可可西里 更新时间:2023-11-01 17:23:10 24 4
gpt4 key购买 nike

我正在尝试从网页中提取数据,例如,假设我希望从 chess.org 获取信息。

我知道玩家的ID是25022,也就是说我可以请求 http://www.chess.org.il/Players/Player.aspx?Id=25022

在该页面中,我可以看到该玩家的 fide ID = 2821109。
从那里,我可以请求这个页面:
http://ratings.fide.com/card.phtml?event=2821109

从中我可以看到 stdRating=1602。

如何从 Java 中给定的“localID”输入获取“stdRating”输出?

(localID、fideID 和 stdRating 是我用来澄清问题的辅助参数)

最佳答案

你可以试试 univocity-html-parser ,它非常易于使用并且避免了很多意大利面条代码。

例如,要获得标准评级,您可以使用此代码:

public static void main(String... args) {
UrlReaderProvider url = new UrlReaderProvider("http://ratings.fide.com/card.phtml?event={EVENT}");
url.getRequest().setUrlParameter("EVENT", 2821109);

HtmlElement doc = HtmlParser.parseTree(url);

String rating = doc.query()
.match("small").withText("std.")
.match("br").getFollowingText()
.getValue();

System.out.println(rating);
}

产生值 1602

但是通过查询单个节点并尝试将所有部分拼接在一起来获取数据并不容易。

我扩展了代码以说明如何使用解析器将更多信息放入记录中。在这里,我为玩家和她的等级详细信息创建了记录,这些记录在第二页的表格中可用。我用了不到 1 小时就完成了这项工作:

public static void main(String... args) {
UrlReaderProvider url = new UrlReaderProvider("http://www.chess.org.il/Players/Player.aspx?Id={PLAYER_ID}");
url.getRequest().setUrlParameter("PLAYER_ID", 25022);

HtmlEntityList entities = new HtmlEntityList();
HtmlEntitySettings player = entities.configureEntity("player");
player.addField("id").match("b").withExactText("מספר שחקן").getFollowingText().transform(s -> s.replaceAll(": ", ""));
player.addField("name").match("h1").followedImmediatelyBy("b").withExactText("מספר שחקן").getText();
player.addField("date_of_birth").match("b").withExactText("תאריך לידה:").getFollowingText();
player.addField("fide_id").matchFirst("a").attribute("href", "http://ratings.fide.com/card.phtml?event=*").getText();

HtmlLinkFollower playerCard = player.addField("fide_card_url").matchFirst("a").attribute("href", "http://ratings.fide.com/card.phtml?event=*").getAttribute("href").followLink();
playerCard.addField("rating_std").match("small").withText("std.").match("br").getFollowingText();
playerCard.addField("rating_rapid").match("small").withExactText("rapid").match("br").getFollowingText();
playerCard.addField("rating_blitz").match("small").withExactText("blitz").match("br").getFollowingText();
playerCard.setNesting(Nesting.REPLACE_JOIN);

HtmlEntitySettings ratings = playerCard.addEntity("ratings");
configureRatingsBetween(ratings, "World Rank", "National Rank ISR", "world");
configureRatingsBetween(ratings, "National Rank ISR", "Continent Rank Europe", "country");
configureRatingsBetween(ratings, "Continent Rank Europe", "Rating Chart", "continent");

Results<HtmlParserResult> results = new HtmlParser(entities).parse(url);
HtmlParserResult playerData = results.get("player");
String[] playerFields = playerData.getHeaders();

for(HtmlRecord playerRecord : playerData.iterateRecords()){
for(int i = 0; i < playerFields.length; i++){
System.out.print(playerFields[i] + ": " + playerRecord.getString(playerFields[i]) +"; ");
}
System.out.println();

HtmlParserResult ratingData = playerRecord.getLinkedEntityData().get("ratings");
for(HtmlRecord ratingRecord : ratingData.iterateRecords()){
System.out.print(" * " + ratingRecord.getString("rank_type") + ": ");
System.out.println(ratingRecord.fillFieldMap(new LinkedHashMap<>(), "all_players", "active_players", "female", "u16", "female_u16"));
}
}
}

private static void configureRatingsBetween(HtmlEntitySettings ratings, String startingHeader, String endingHeader, String rankType) {
Group group = ratings.newGroup()
.startAt("table").match("b").withExactText(startingHeader)
.endAt("b").withExactText(endingHeader);

group.addField("rank_type", rankType);

group.addField("all_players").match("tr").withText("World (all", "National (all", "Rank (all").match("td", 2).getText();
group.addField("active_players").match("tr").followedImmediatelyBy("tr").withText("Female (active players):").match("td", 2).getText();
group.addField("female").match("tr").withText("Female (active players):").match("td", 2).getText();
group.addField("u16").match("tr").withText("U-16 Rank (active players):").match("td", 2).getText();
group.addField("female_u16").match("tr").withText("Female U-16 Rank (active players):").match("td", 2).getText();
}

输出将是:

id: 25022; name: יעל כהן; date_of_birth: 02/02/2003; fide_id: 2821109; rating_std: 1602; rating_rapid: 1422; rating_blitz: 1526; 
* world: {all_players=195907, active_players=94013, female=5490, u16=3824, female_u16=586}
* country: {all_players=1595, active_players=1024, female=44, u16=51, female_u16=3}
* continent: {all_players=139963, active_players=71160, female=3757, u16=2582, female_u16=372}

希望对你有帮助

披露:我是这个库的作者。它是商业封闭源代码,但可以为您节省大量开发时间。

关于java - 使用 Java 从网页中抓取信息?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50914516/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com