gpt4 book ai didi

xml - 抓取分层数据

转载 作者:数据小太阳 更新时间:2023-10-29 02:12:30 26 4
gpt4 key购买 nike

我正在尝试从 global Dept stores 中抓取各大洲/国家/地区的百货商店列表。 .我正在运行以下代码以首先获取大陆,因为我们可以看到 XML 层次结构的方式是每个大陆的国家不是该大陆的子节点。

> url<-"http://en.wikipedia.org/wiki/List_of_department_stores_by_country"
> doc = htmlTreeParse(url, useInternalNodes = T)
> nodeNames = getNodeSet(doc, "//h2/span[@class='mw-headline']")
> # For Africa
> xmlChildren(nodeNames[[1]])
$a
<a href="/wiki/Africa" title="Africa">Africa</a>

attr(,"class")
[1] "XMLInternalNodeList" "XMLNodeList"
> xmlSize(nodeNames[[1]])
[1] 1

我知道我可以在一个单独的 getNodeSet 命令中完成这些国家,但我只是想确保我没有遗漏任何东西。有没有一种更智能的方法来同时获取每个大洲内的所有数据,然后同时获取每个国家/地区内的所有数据?

最佳答案

uisng xpath,几个路径可以用|组合分隔器。所以我用它来获取同一个列表中的国家和商店。然后我得到第二个国家名单。我用后一个列表拆分第一个

url<-"http://en.wikipedia.org/wiki/List_of_department_stores_by_country"
library(XML)
xmltext <- htmlTreeParse(url, useInternalNodes = T)

## Here I use the combined xpath
cont.shops <- xpathApply(xmltext, '//*[@id="mw-content-text"]/ul/li|
//*[@id="mw-content-text"]/h3',xmlValue)
cont.shops<- do.call(rbind,cont.shops) ## from list to vector


head(cont.shops) ## first element is country followed by shops
[,1]
[1,] "[edit] Â Tunisia"
[2,] "Magasin Général"
[3,] "Mercure Market"
[4,] "Promogro"
[5,] "Geant"
[6,] "Carrefour"
## I get all the contries in one list
contries <- xpathApply(xmltext, '//*[@id="mw-content-text"]/h3',xmlValue)
contries <- do.call(rbind,contries) ## from list to vector

head(contries)
[,1]
[1,] "[edit] Â Tunisia"
[2,] "[edit] Â Morocco"
[3,] "[edit] Â Ghana"
[4,] "[edit] Â Kenya"
[5,] "[edit] Â Nigeria"
[6,] "[edit] Â South Africa"

现在我进行一些处理以使用国家拆分 cont.shops。

dd <- which(cont.shops %in% contries)                   ## get the index of contries
freq <- c(diff(dd),length(cont.shops)-tail(dd,1)+1) ## use diff to get Frequencies
contries.f <- rep(contries,freq) ## create the factor splitter


ll <- split(cont.shops,contries.f)

我可以检查结果:

> ll[[contries[1]]]
[1] "[edit]  Tunisia" "Magasin Général" "Mercure Market" "Promogro" "Geant"
[6] "Carrefour" "Monoprix"
> ll[[contries[2]]]
[1] "[edit] Â Morocco"
[2] "Alpha 55, one 6-story store in Casablanca"
[3] "Galeries Lafayette, to open in 2011[1] within Morocco Mall, in Casablanca"

关于xml - 抓取分层数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/14652362/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com