gpt4 book ai didi

html - 紧接指定文本后提取html表

转载 作者:行者123 更新时间:2023-12-03 17:32:56 28 4
gpt4 key购买 nike

我正在尝试从网页上抓取html表。但是,该页面包含许多我不想抓取的html表。为了确定要抓取的表,我想使用第一个表,该表后面有特定的单词组合(单词组合不在表中,而是文本的一部分)。这是一个例子:

这是我感兴趣的表:

library(XML)
url <- "http://www.sec.gov/Archives/edgar/data/1301063/000119312514133663/0001193125-14-133663.txt"
readHTMLTable(url, trim = T, header = F, stringsAsFactors = F)[29]


我想用来检测该表的标准是它是遵循此单词组合的第一个表:

“安全,健康,环境和可持续性挑战”

html <- getURL(url, followlocation = TRUE)
doc <- htmlParse(html, asText = TRUE)
text <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)
grep("safety, health, environmental and sustainability challenges", text, value = T)

最佳答案

我认为这是您要寻找的:

xpathSApply(doc,'//text()[contains(.,"safety, health, environmental and sustainability challenges")]/following::table[1]');
## <table cellspacing="0" cellpadding="0" width="100%" border="0" style="BORDER-COLLAPSE:COLLAPSE" align="center">
## <tr><td width="48%"/>
## <td valign="bottom" width="12%"/>
## <td/>
## <td/>
## <td/>
## <td valign="bottom" width="12%"/>
## <td/>
## <td/>
## <td/>
## <td valign="bottom" width="12%"/>
## <td/>
## <td/>
## <td/>
## <td valign="bottom" width="12%"/>
## <td/>
## <td/>
## <td/></tr>
## <tr><td valign="bottom" nowrap="nowrap" align="center" style="border-bottom:1px solid #000000"> <p style="margin-top:0px;margin-bottom:1px" align="center"><font style="font-family:Times New Roman" size="1"><b>Name</b></font></p></td>
## <td valign="bottom"><font size="1">  </font></td>
## <td valign="bottom" colspan="2" nowrap="nowrap" align="center" style="border-bottom:1px solid #000000"><font style="font-family:Times New Roman" size="1"><b>Audit<br/>Committee</b></font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom" colspan="2" nowrap="nowrap" align="center" style="border-bottom:1px solid #000000"><font style="font-family:Times New Roman" size="1"><b>Compensation<br/>Committee</b></font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom" colspan="2" nowrap="nowrap" align="center" style="border-bottom:1px solid #000000"><font style="font-family:Times New Roman" size="1"><b>Nominating and<br/>Corporate<br/>Governance<br/>Committee</b></font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom" colspan="2" nowrap="nowrap" align="center" style="border-bottom:1px solid #000000"><font style="font-family:Times New Roman" size="1"><b>Safety, Health,<br/>Environmental and<br/>Sustainability<br/>Committee</b></font></td>
## <td valign="bottom"><font size="1"> </font></td></tr>
## <tr bgcolor="#cceeff"><td valign="top"> <p style="margin-left:1.00em; text-indent:-1.00em"><font style="font-family:Times New Roman" size="2">Kevin S. Crutchfield</font><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex">(1)</sup></font><font style="font-family:Times New Roman" size="2"/></p></td>
## <td valign="bottom"><font size="1">  </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td></tr>
## <tr><td valign="top"> <p style="margin-left:1.00em; text-indent:-1.00em"><font style="font-family:Times New Roman" size="2">Angelo C. Brisimitzakis</font></p></td>
## <td valign="bottom"><font size="1">  </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td></tr>
## <tr bgcolor="#cceeff"><td valign="top"> <p style="margin-left:1.00em; text-indent:-1.00em"><font style="font-family:Times New Roman" size="2">William J. Crowley, Jr.</font></p></td>
## <td valign="bottom"><font size="1">  </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td></tr>
## <tr><td valign="top"> <p style="margin-left:1.00em; text-indent:-1.00em"><font style="font-family:Times New Roman" size="2">E. Linn Draper, Jr.</font></p></td>
## <td valign="bottom"><font size="1">  </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/></tr>
## <tr bgcolor="#cceeff"><td valign="top"> <p style="margin-left:1.00em; text-indent:-1.00em"><font style="font-family:Times New Roman" size="2">Glenn A. Eisenberg</font><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex">(2)</sup></font><font style="font-family:Times New Roman" size="2"/></p></td>
## <td valign="bottom"><font size="1">  </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"/><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex"/></font><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex"/></font><font style="font-family:Times New Roman" size="2"/></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2"/><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex">(3)</sup></font><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td></tr>
## <tr><td valign="top"> <p style="margin-left:1.00em; text-indent:-1.00em"><font style="font-family:Times New Roman" size="2">Deborah M. Fretz</font></p></td>
## <td valign="bottom"><font size="1">  </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"/><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex"/></font><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex"/></font><font style="font-family:Times New Roman" size="2"/></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2"/><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex">(3)</sup></font><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/></tr>
## <tr bgcolor="#cceeff"><td valign="top"> <p style="margin-left:1.00em; text-indent:-1.00em"><font style="font-family:Times New Roman" size="2">P. Michael Giftos</font></p></td>
## <td valign="bottom"><font size="1">  </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"/><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex"/></font><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex"/></font><font style="font-family:Times New Roman" size="2"/></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2"/><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex">(3)</sup></font><font style="font-family:Times New Roman" size="2"> </font></td></tr>
## <tr><td valign="top"> <p style="margin-left:1.00em; text-indent:-1.00em"><font style="font-family:Times New Roman" size="2">L. Patrick Hassey</font></p></td>
## <td valign="bottom"><font size="1">  </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/></tr>
## <tr bgcolor="#cceeff"><td valign="top"> <p style="margin-left:1.00em; text-indent:-1.00em"><font style="font-family:Times New Roman" size="2">Joel Richards, III</font></p></td>
## <td valign="bottom"><font size="1">  </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"/><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex"/></font><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex"/></font><font style="font-family:Times New Roman" size="2"/></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2"/><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex">(3)</sup></font><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td></tr>
## </table>

关于html - 紧接指定文本后提取html表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31997898/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com