gpt4 book ai didi

html - 使用 R 提取 html 标签中的内容

转载 作者:行者123 更新时间:2023-12-05 08:57:26 25 4
gpt4 key购买 nike

我现在正在尝试提取特定 html 标签之间的内容,例如:

<dl class="search-advanced-list">
<dt>
<h2><a id="/advanced-search?intercept=adv&amp;as-advanced=+documenttype%3Asource title:%22ADB%22&amp;as-type=advanced" name="ADB">ADB</a></h2>
</dt>
<dd>Allgemeine deutsche Biographie. Under the auspices of the Historical Commission of the Royal Academy of Sciences. 56 vols. Leipzig: Duncker &amp; Humblot. 1875&#8211;1912.</dd>
<dt>
<h2><a id="/advanced-search?intercept=adv&amp;as-advanced=+documenttype%3Asource title:%22AMS%22&amp;as-type=advanced" name="AMS">AMS</a></h2>
</dt>
<dd>American men of science. J. McKeen Cattell, ed. Editions 1&#8211;4, New York: 1906&#8211;27.</dd>
<dt>
<h2><a id="/advanced-search?intercept=adv&amp;as-advanced=+documenttype%3Asource title:%22Abbott%2C+C.+C.+1861%22&amp;as-type=advanced" name="Abbott__C__C__1861">Abbott, C. C. 1861</a></h2>
</dt>
<dd>Abbott, Charles Compton. 1861. Notes on the birds of the Falkland Islands. Ibis 3: 149&#8211;67.</dd>
...
</dl>

link

我计划提取 <h2> 中的内容</h2><dd> 中的内容和 </dd> .我在 stackOverFlow 上搜索了类似的问题,但仍然无法弄清楚,有没有人有使用 R 解决这个问题的简单方法?

最佳答案

这将创建一个双列矩阵 m,其第一列是 h2,第二列是关联的 dd 值。由于问题中没有关于输入形式的信息,我们假设输入是字符串 LineshtmlTreeParse 行可以适当更改,如果不是。尝试 ?htmlTreeParse 了解更多信息。

library(XML)
doc <- htmlTreeParse(Lines, asText = TRUE, useInternalNodes = TRUE)

f <- function(x) cbind(h2 = xmlValue(x), dd = xpathSApply(x, "//dd", xmlValue))
L <- xpathApply(doc, "//h2", f)
m <- do.call(rbind, L)

这里我们显示h2列和dd列的前10个字符:

> cbind(h2 = m[,1], dd = substr(m[,2], 1, 10))

h2 dd
[1,] "ADB" "Allgemeine"
[2,] "ADB" "American m"
[3,] "ADB" "Abbott, Ch"
[4,] "AMS" "Allgemeine"
[5,] "AMS" "American m"
[6,] "AMS" "Abbott, Ch"
[7,] "Abbott, C. C. 1861" "Allgemeine"
[8,] "Abbott, C. C. 1861" "American m"
[9,] "Abbott, C. C. 1861" "Abbott, Ch"

这是上面使用的输入:

Lines <- '<dl class="search-advanced-list">
<dt>
<h2><a id="/advanced-search?intercept=adv&amp;as-advanced=+documenttype%3Asource title:%22ADB%22&amp;as-type=advanced" name="ADB">ADB</a></h2>
</dt>
<dd>Allgemeine deutsche Biographie. Under the auspices of the Historical Commission of the Royal Academy of Sciences. 56 vols. Leipzig: Duncker &amp; Humblot. 1875&#8211;1912.</dd>
<dt>
<h2><a id="/advanced-search?intercept=adv&amp;as-advanced=+documenttype%3Asource title:%22AMS%22&amp;as-type=advanced" name="AMS">AMS</a></h2>
</dt>
<dd>American men of science. J. McKeen Cattell, ed. Editions 1&#8211;4, New York: 1906&#8211;27.</dd>
<dt>
<h2><a id="/advanced-search?intercept=adv&amp;as-advanced=+documenttype%3Asource title:%22Abbott%2C+C.+C.+1861%22&amp;as-type=advanced" name="Abbott__C__C__1861">Abbott, C. C. 1861</a></h2>
</dt>
<dd>Abbott, Charles Compton. 1861. Notes on the birds of the Falkland Islands. Ibis 3: 149&#8211;67.</dd>
</dl>'

关于html - 使用 R 提取 html 标签中的内容,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32921284/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com