gpt4 book ai didi

java - 如何在JSOUP中解析多个html元素?

转载 作者:行者123 更新时间:2023-11-30 02:17:41 24 4
gpt4 key购买 nike

我正在尝试从 java 项目中保存的 HTML 文档中解析来自警察局(Garda 是爱尔兰警察)的简单 html 犯罪统计表。目前我正在尝试解析 html 文档中的内容并将其打印到控制台。我遇到的问题是,我只能打印表中的数字(不包括年份),但我想要实现的是从表中获取犯罪名称,后跟后面的 6 个数字。

Screenshot of the html table (Cannot embed the image as my reputation is too low)

HTML 表格

<html><head><meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<title>Recorded Crime Offences (Number) by Garda Station, Type of Offence and&lt;BR&gt;
Year</title>
</head>
<body>
<table border="">
<tbody><tr align="LEFT">
<th colspan="8">Recorded Crime Offences (Number) by Garda Station, Type of Offence and<br>
Year</th>
</tr>
<tr align="LEFT">
<th colspan="2"> </th>
<th valign="TOP" colspan="1">2011</th>
<th valign="TOP" colspan="1">2012</th>
<th valign="TOP" colspan="1">2013</th>
<th valign="TOP" colspan="1">2014</th>
<th valign="TOP" colspan="1">2015</th>
<th valign="TOP" colspan="1">2016</th>
</tr>
<tr align="RIGHT">
<th align="LEFT" valign="TOP" rowspan="12">Balbriggan, D.M.R. Northern Division</th>
<th align="LEFT">03 ,Attempts/threats to murder, assaults, harassments and related offences</th>
<td>96</td>
<td>89</td>
<td>70</td>
<td>97</td>
<td>103</td>
<td>103</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">04 ,Dangerous or negligent acts</th>
<td>72</td>
<td>67</td>
<td>50</td>
<td>53</td>
<td>45</td>
<td>43</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">05 ,Kidnapping and related offences</th>
<td>0</td>
<td>0</td>
<td>1</td>
<td>3</td>
<td>0</td>
<td>7</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">06 ,Robbery, extortion and hijacking offences</th>
<td>16</td>
<td>19</td>
<td>16</td>
<td>7</td>
<td>11</td>
<td>13</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">07 ,Burglary and related offences</th>
<td>177</td>
<td>190</td>
<td>157</td>
<td>140</td>
<td>151</td>
<td>139</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">08 ,Theft and related offences</th>
<td>510</td>
<td>466</td>
<td>495</td>
<td>542</td>
<td>445</td>
<td>302</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">09 ,Fraud, deception and related offences</th>
<td>66</td>
<td>76</td>
<td>126</td>
<td>114</td>
<td>98</td>
<td>66</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">10 ,Controlled drug offences</th>
<td>113</td>
<td>100</td>
<td>64</td>
<td>55</td>
<td>44</td>
<td>80</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">11 ,Weapons and Explosives Offences</th>
<td>22</td>
<td>18</td>
<td>13</td>
<td>10</td>
<td>19</td>
<td>17</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">12 ,Damage to property and to the environment</th>
<td>257</td>
<td>266</td>
<td>269</td>
<td>203</td>
<td>213</td>
<td>177</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">13 ,Public order and other social code offences</th>
<td>168</td>
<td>115</td>
<td>93</td>
<td>78</td>
<td>79</td>
<td>92</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">15 ,Offences against government, justice procedures and organisation of crime</th>
<td>45</td>
<td>48</td>
<td>39</td>
<td>39</td>
<td>66</td>
<td>50</td>
</tr>
<tr align="LEFT">
<td colspan="8"><a href="http://www.cso.ie/en/methods/crime/recordedcrime/">See Background Notes</a>
</td>
</tr>
</tbody></table>

</body></html>

我目前想出的代码可以像这样打印数字

Figure 0 : 96
Figure 1 : 89
Figure 2 : 70
Figure 3 : 97
Figure 4 : 103
Figure 5 : 103
Figure 6 : 72
Figure 7 : 67
Figure 8 : 50
Figure 9 : 53
Figure 10 : 45
... (Figures 11-66 omitted for conciseness)
Figure 67 : 48
Figure 68 : 39
Figure 69 : 39
Figure 70 : 66
Figure 71 : 50

但是我希望它的显示方式更像

Crime: 03 ,Attempts/threats to murder, assaults, harassments and related offences
Figure 0 : 96
Figure 1 : 89
Figure 2 : 70
Figure 3 : 97
Figure 4 : 103
Figure 5 : 103

Crime: 04 ,Dangerous or negligent acts
Figure 6 : 72
Figure 7 : 67
Figure 8 : 50
Figure 9 : 53
Figure 10 : 45
etc, etc

我尝试了多种不同的方法,例如添加一个 for 循环来访问包含犯罪的 th 元素,然后添加另一个访问包含数字的 td 元素,但这通常会导致类似的错误

Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0, Size: 0  

工作解析器类

import java.io.*;   
import org.jsoup.*;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class ParseCrimeStatistics {

public static void main(String[]args) {
try {

int count = 0;
File input = new File("Balbriggan.html");
Document doc =Jsoup.parse(input, "UTF-8", "http://www.cso.ie");

Elements title = doc.select("td");

for(Element sectd1:title){
Elements ths = sectd1.select("td");

String result = ths.get(0).text();

System.out.println("Figure " + count + " : "+ result);

count++;

}
}catch (IOException e) {
e.printStackTrace();
}
}
}

有人对我如何解决这个问题有任何建议吗?谢谢。

最佳答案

试试这个,

int count = 0;
File input = new File("Balbriggan.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://www.cso.ie");

Elements numbers = doc.select("td");
Elements titles = doc.select("th");


for(int i=9; i<titles.size(); i++)
{
System.out.println("Crime: " + titles.get(i).text());
for(int j=0; j<6; j++)
{
System.out.println("Figure " + count + ":" + numbers.get((i-9)*6+j).text());
count++;
}
}

关于java - 如何在JSOUP中解析多个html元素?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47801470/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com