gpt4 book ai didi

ruby - XPath 使用 Nokogiri 返回空数组

转载 作者:太空宇宙 更新时间:2023-11-03 17:36:30 25 4
gpt4 key购买 nike

我正在尝试使用 Nokogiri、Mechanize 和 XPath 解析页面,但是,无论我尝试什么,我都会收到一个空数组。

Page I'm trying to Parse.

我在 Chrome 中检查了它并获得了 XPath,然后尝试了多种方法来解析它但总是收到一个空数组。

我试过:

puts page.search('/html/body/div/table/tbody/tr[2]/td/table/tbody/tr[2]/td[2]').inspect

puts post_page.parser.xpath('/html/body/div/table/tbody/tr[2]/td/table/tbody/tr[2]/td[2]').inspect

puts post_page.parser.at_xpath('/html/body/div/table/tbody/tr[2]/td/table/tbody/tr[2]/td[2]').inspect

所有带和不带尾随“/text”

这是我要抓取的页面的来源:

<SCRIPT language="JavaScript">
<!--
document.cookie = "IV_JCT=%2FMPIS; path=/";
//-->
</SCRIPT>

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>

<head>
<title>My Schedule</title>

<meta http-equiv="pragma" content="no-cache">
<meta http-equiv="cache-control" content="no-cache">
<meta http-equiv="expires" content="-1">
<meta http-equiv="keywords" content="keyword1,keyword2,keyword3">
<meta http-equiv="description" content="This is my schedule">
<!--
<link rel="stylesheet" type="text/css" href="styles.css">
-->

</head>

<body>
<div align="center">
<strong>My Schedule</strong><br>as of Sun Feb 24 2013 06:43:09 PM CST<br><br>
<div align="left"><pre><br>Employee Name: Johnson Appleseed
Unit = 12345</pre>
<br>
</div>

<table border="0" cellpadding="0" cellspacing="0" width="100%">
<tr>
<td colspan="8" align="center"><b><font size="+1">Schedules may be subject to change based on business needs or demand</font></b></td>
</tr>

<tr><td>
<table border="4" bordercolor="#2D73B9" cellpadding="2" cellspacing="2" width="100%">
<tr bgcolor="#7C9BCF">
<td width="12%" align="center"><b>Sunday</b></td>
<td width="12%" align="center"><b>Monday</b></td>
<td width="12%" align="center"><b>Tuesday</b></td>
<td width="12%" align="center"><b>Wednesday</b></td>
<td width="12%" align="center"><b>Thursday</b></td>
<td width="12%" align="center"><b>Friday</b></td>
<td width="12%" align="center"><b>Saturday</b></td>
<td rowspan="2" width="12%" align="center"><b>Total weekly Hours</b></td>
</tr>

<tr bgcolor="#7C9BCF">

<td width="14%" align="center">2013-02-24</td>

<td width="14%" align="center">2013-02-25</td>

<td width="14%" align="center">2013-02-26</td>

<td width="14%" align="center">2013-02-27</td>

<td width="14%" align="center">2013-02-28</td>

<td width="14%" align="center">2013-03-01</td>

<td width="14%" align="center">2013-03-02</td>

</tr>

<tr bgcolor="#FFFFFF">

<td width="14%" align="left"><pre>&nbsp;</pre></td>

<td width="14%" align="left"><pre><b>Shift: </b>
5:30 PM - 9:00 PM
<b>Meal:</b>
- </pre></td>


<td width="14%" align="left"><pre>&nbsp;</pre></td>

<td width="14%" align="left"><pre>&nbsp;</pre></td>

<td width="14%" align="left"><pre>&nbsp;</pre></td>

<td width="14%" align="left"><pre><b>Shift: </b>
2:00 PM - 9:15 PM
<b>Meal:</b>
5:45 PM - 6:30 PM</pre></td>

<td width="14%" align="left"><pre><b>Shift: </b>
4:45 PM - 9:15 PM
<b>Meal:</b>
- </pre></td>

<td width="12%" align="center">14.5</td>
</tr>

<tr bgcolor="#FFFFFF">

<td width="14%" align="center">0.0</td>

<td width="14%" align="center">3.5</td>

<td width="14%" align="center">0.0</td>

<td width="14%" align="center">0.0</td>

<td width="14%" align="center">0.0</td>

<td width="14%" align="center">6.5</td>

<td width="14%" align="center">4.5</td>

<td width="14%" align="center">Daily Hours</td>
</tr>

</table>
</td></tr>

<tr><td>
<table border="4" bordercolor="#2D73B9" cellpadding="2" cellspacing="2" width="100%">
<tr bgcolor="#7C9BCF">
<td width="12%" align="center"><b>Sunday</b></td>
<td width="12%" align="center"><b>Monday</b></td>
<td width="12%" align="center"><b>Tuesday</b></td>
<td width="12%" align="center"><b>Wednesday</b></td>
<td width="12%" align="center"><b>Thursday</b></td>
<td width="12%" align="center"><b>Friday</b></td>
<td width="12%" align="center"><b>Saturday</b></td>
<td rowspan="2" width="12%" align="center"><b>Total weekly Hours</b></td>
</tr>

<tr bgcolor="#7C9BCF">

<td width="14%" align="center">2013-03-03</td>

<td width="14%" align="center">2013-03-04</td>

<td width="14%" align="center">2013-03-05</td>

<td width="14%" align="center">2013-03-06</td>

<td width="14%" align="center">2013-03-07</td>

<td width="14%" align="center">2013-03-08</td>

<td width="14%" align="center">2013-03-09</td>

</tr>

<tr bgcolor="#FFFFFF">

<td width="14%" align="left"><pre>Sched Not Posted</pre></td>

<td width="14%" align="left"><pre>Sched Not Posted</pre></td>

<td width="14%" align="left"><pre>Sched Not Posted</pre></td>

<td width="14%" align="left"><pre>Sched Not Posted</pre></td>

<td width="14%" align="left"><pre>Sched Not Posted</pre></td>

<td width="14%" align="left"><pre>Sched Not Posted</pre></td>

<td width="14%" align="left"><pre>Sched Not Posted</pre></td>

<td width="12%" align="center">0.0</td>
</tr>

<tr bgcolor="#FFFFFF">

<td width="14%" align="center">0.0</td>

<td width="14%" align="center">0.0</td>

<td width="14%" align="center">0.0</td>

<td width="14%" align="center">0.0</td>

<td width="14%" align="center">0.0</td>

<td width="14%" align="center">0.0</td>

<td width="14%" align="center">0.0</td>

<td width="14%" align="center">Daily Hours</td>
</tr>

</table>
</td></tr>

<tr><td>
<table border="4" bordercolor="#2D73B9" cellpadding="2" cellspacing="2" width="100%">
<tr bgcolor="#7C9BCF">
<td width="12%" align="center"><b>Sunday</b></td>
<td width="12%" align="center"><b>Monday</b></td>
<td width="12%" align="center"><b>Tuesday</b></td>
<td width="12%" align="center"><b>Wednesday</b></td>
<td width="12%" align="center"><b>Thursday</b></td>
<td width="12%" align="center"><b>Friday</b></td>
<td width="12%" align="center"><b>Saturday</b></td>
<td rowspan="2" width="12%" align="center"><b>Total weekly Hours</b></td>
</tr>

<tr bgcolor="#7C9BCF">

<td width="14%" align="center">2013-03-10</td>

<td width="14%" align="center">2013-03-11</td>

<td width="14%" align="center">2013-03-12</td>

<td width="14%" align="center">2013-03-13</td>

<td width="14%" align="center">2013-03-14</td>

<td width="14%" align="center">2013-03-15</td>

<td width="14%" align="center">2013-03-16</td>

</tr>

<tr bgcolor="#FFFFFF">

<td width="14%" align="left"><pre>Sched Not Posted</pre></td>

<td width="14%" align="left"><pre>Sched Not Posted</pre></td>

<td width="14%" align="left"><pre>Sched Not Posted</pre></td>

<td width="14%" align="left"><pre>Sched Not Posted</pre></td>

<td width="14%" align="left"><pre>Sched Not Posted</pre></td>

<td width="14%" align="left"><pre>Sched Not Posted</pre></td>

<td width="14%" align="left"><pre>Sched Not Posted</pre></td>

<td width="12%" align="center">0.0</td>
</tr>

<tr bgcolor="#FFFFFF">

<td width="14%" align="center">0.0</td>

<td width="14%" align="center">0.0</td>

<td width="14%" align="center">0.0</td>

<td width="14%" align="center">0.0</td>

<td width="14%" align="center">0.0</td>

<td width="14%" align="center">0.0</td>

<td width="14%" align="center">0.0</td>

<td width="14%" align="center">Daily Hours</td>
</tr>

</table>
</td></tr>

<tr>
<td colspan="8" align="center"><b><font size="+1">Schedules may be subject to change based on business needs or demand</font></b></td>
</tr>
</table >

<p><br>
</p>
<p class="align_center" >
<input type=button value="Print this page" onClick="javascript:window.print();">
<input type=button value="Close This Window" onClick="javascript:window.close();">
</p>

</div>
</body>

</html>

最佳答案

请注意,在您的 XPath 访问器中,您需要 tbody 成为路径的一部分:

puts page.search('/html/body/div/table/tbody/tr[2]/td/table/tbody/tr[2]/td[2]').inspect
puts post_page.parser.xpath('/html/body/div/table/tbody/tr[2]/td/table/tbody/tr[2]/td[2]').inspect
puts post_page.parser.at_xpath('/html/body/div/table/tbody/tr[2]/td/table/tbody/tr[2]/td[2]').inspect

HTML 没有 tbody 标签,导致查找失败。

尝试简化您的访问器。我通常从 Nokogiri 支持的 CSS 开始,然后,如果我不能到达那里,我会切换到 XPath。你的里程可能会有所不同。

例如:

(rdb:1) puts doc.at('table table tr').to_html

输出:

<tr bgcolor="#7C9BCF">
<td width="12%" align="center"><b>Sunday</b></td>
<td width="12%" align="center"><b>Monday</b></td>
<td width="12%" align="center"><b>Tuesday</b></td>
<td width="12%" align="center"><b>Wednesday</b></td>
<td width="12%" align="center"><b>Thursday</b></td>
<td width="12%" align="center"><b>Friday</b></td>
<td width="12%" align="center"><b>Saturday</b></td>
<td rowspan="2" width="12%" align="center"><b>Total weekly Hours</b></td>
</tr>

这是获取列标题的更简单的方法。

要到达第二行,您可以使用:

(rdb:1) puts doc.at('table table tr[2]').to_html

哪个让你:

<tr bgcolor="#7C9BCF">
<td width="14%" align="center">2013-02-24</td>
<td width="14%" align="center">2013-02-25</td>
<td width="14%" align="center">2013-02-26</td>
<td width="14%" align="center">2013-02-27</td>
<td width="14%" align="center">2013-02-28</td>
<td width="14%" align="center">2013-03-01</td>
<td width="14%" align="center">2013-03-02</td>
</tr>

要获取单元格内容,您可以使用:

(rdb:1) puts doc.search('table table tr[2] td').map(&:text)

哪个返回:

2013-02-24
2013-02-25
2013-02-26
2013-02-27
2013-02-28
2013-03-01
2013-03-02
2013-03-03
2013-03-04
2013-03-05
2013-03-06
2013-03-07
2013-03-08
2013-03-09
2013-03-10
2013-03-11
2013-03-12
2013-03-13
2013-03-14
2013-03-15
2013-03-16

注意它是如何返回两个表的标题的。要将其限制在第一个表中,我们可以使用 at 而不是 searchat 返回第一个匹配的节点,其中search 返回一个NodeSet,类似于一个数组。此外,search 会查看整个文档以找到所有匹配项,这与 at 的行为不同。

此代码找到第一个表格的第二行,然后遍历嵌入的单元格:

(rdb:1) puts doc.at('table table tr[2]').search('td').map(&:text)2013-02-242013-02-252013-02-262013-02-272013-02-282013-03-012013-03-02

它更简单,更容易理解和维护。

关于ruby - XPath 使用 Nokogiri 返回空数组,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/15059320/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com