gpt4 book ai didi

JAVA解析表数据

转载 作者:行者123 更新时间:2023-12-01 09:38:28 25 4
gpt4 key购买 nike

我想从页面源中提取一些 html 数据。这是引用文献。链接有一个html链接查看源:http://www.4icu.org/reviews/index2.htm 。我想问一下用JAVA怎样才能只提取大学名和国家名呢?我知道如何提取介于 之间的大学名称,但是如何通过在 class="i"时扫描表格并使用 <... 提取国家/地区(即美国)来使程序更快? alt="美国"/>

<tr>
<td><a name="UNIVERSITIES-BY-NAME"></a><h2>A-Z list of world Universities and Colleges</h2>
</tr>

<tr>
<td class="i"><a href="/reviews/9107.htm"> A.T. Still University</a></td>
<td width="50" align="right" nowrap>us <img src="/i/bg.gif" class="fl flag-us" alt="United States" /></td>
</tr>

提前致谢。

编辑按照 @11thdimension 所说,这是我的 .java 文件

public class University {
public static void main(String[] args) throws Exception {
System.out.println("Started");

URL url = new URL ("http://www.4icu.org/reviews/index2.htm");

URLConnection spoof = url.openConnection();
// Spoof the connection so we look like a web browser
spoof.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; H010818)");

String connect = url.toString();
Document doc = Jsoup.connect(connect).get();

Elements cells = doc.select("td.i");

Iterator<Element> iterator = cells.iterator();

while (iterator.hasNext()) {
Element cell = iterator.next();
String university = cell.select("a").text();
String country = cell.nextElementSibling().select("img").attr("alt");

System.out.printf("country : %s, university : %s %n", country, university);
}
}
}

但是,当我运行它时,它给出了以下错误。

Started
Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=http://www.4icu.org/reviews/index2.htm

编辑2我创建了以下程序来获取 html 站点的标题。

public class Get_Header {
public static void main(String[] args) throws Exception {
URL url = new URL("http://www.4icu.org/reviews/index2.htm");
URLConnection connection = url.openConnection();

Map responseMap = connection.getHeaderFields();
for (Iterator iterator = responseMap.keySet().iterator(); iterator.hasNext();) {
String key = (String) iterator.next();
System.out.println(key + " = ");

List values = (List) responseMap.get(key);
for (int i = 0; i < values.size(); i++) {
Object o = values.get(i);
System.out.println(o + ", ");
}
}
}
}

它返回以下结果。

X-Frame-Options = 
SAMEORIGIN,
Transfer-Encoding =
chunked,
null =
HTTP/1.1 403 Forbidden,
CF-RAY =
2ca61c7a769b1980-HKG,
Server =
cloudflare-nginx,
Cache-Control =
max-age=10,
Connection =
keep-alive,
Set-Cookie =
__cfduid=d4f8d740e0ae0dd551be15e031359844d1469853403; expires=Sun, 30-Jul-17 04:36:43 GMT; path=/; domain=.4icu.org; HttpOnly,
Expires =
Sat, 30 Jul 2016 04:36:53 GMT,
Date =
Sat, 30 Jul 2016 04:36:43 GMT,
Content-Type =
text/html; charset=UTF-8,

虽然可以得到header,但是应该如何将EDIT和EDIT2中的代码组合成一个完整的呢?谢谢。

最佳答案

如果这是一个单次任务,那么您可能应该使用 Javascript。

以下代码将在控制台中记录所需的名称。您必须在浏览器控制台中运行它。

(function () {
var a = [];
document.querySelectorAll("td.i a").forEach(function (anchor) { a.push(anchor.textContent.trim());});

console.log(a.join("\n"));
})();

以下是一个带有 Jsoup selectors 的 Java 示例

Maven 依赖

<dependencies>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.8.3</version>
</dependency>
</dependencies>

Java 代码

import java.io.File;
import java.util.Iterator;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class TestJsoup {
public static void main(String[] args) throws Exception {
System.out.println("Starteed");

File file = new File("A-Z list of 11930 World Colleges & Universities.html");
Document doc = Jsoup.parse(file, "UTF-8");

Elements cells = doc.select("td.i");

Iterator<Element> iterator = cells.iterator();

while (iterator.hasNext()) {
Element cell = iterator.next();
String university = cell.select("a").text();
String country = cell.nextElementSibling().select("img").attr("alt");

System.out.printf("country : %s, university : %s %n", country, university);
}
}
}

关于JAVA解析表数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38641366/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com