gpt4 book ai didi

xml - 用户评论的数据提取

转载 作者:数据小太阳 更新时间:2023-10-29 02:34:28 28 4
gpt4 key购买 nike

出于个人自学兴趣,我正在尝试学习 R。既不是编码员也不是分析师。我想从 Trip Advisor 中提取用户评论。在单个页面中,我们有 10 条评论,但使用下面的代码我也收到了不需要的评论/行。我不确定我是否使用了正确的 html 节点。此外,我想提取用户的完整评论,但它的结尾给了我用户的部分评论。你能帮我提取计数 10 的完整用户评论吗?非常感谢您的帮助。

  dat <- readLines("http://www.tripadvisor.in/Hotel_Review-g60763-d93450-Reviews-Grand_Hyatt_New_York-New_York_City_New_York.html", warn=FALSE)
raw2 <- htmlTreeParse(dat, useInternalNodes = TRUE)
##User Review
plain.text <- xpathSApply(raw2, "//div[@class='col2of2']//p[@class='partial_entry']", xmlValue)
UR <-gsub("\\\n","",plain.text)
Result <- unlist(UR)
Result

最佳答案

与 R 编程相比,这更像是网络抓取练习。

在 R 中,我更喜欢 httr 包来获取 http 响应并将内容提取为已解析的 html。使用 readLines(...) 是最糟糕的方法。所以下面的代码将提取评论摘要。

library(httr)
library(XML)
url <- "http://www.tripadvisor.in/Hotel_Review-g60763-d93450-Reviews-Grand_Hyatt_New_York-New_York_City_New_York.html"
response <- GET(url)
doc <- content(response,type="text/html")
smry <- xpathSApply(doc,'//div[@class="entry"]/p[@class="partial_entry"]',xmlValue)
length(smry)
# [1] 10
smry[1]
# [1] "\nThats all that matters really...I wonder if anyone would chose this hotel for any other factor at all...located right next to Grand central station in midtown and within walking distance of many tourist attractions, top restaurants and corp offices. Stayed 3 nights here on a business trip, I chose this hotel over others purely based on its location. Price is...\n\n\nMore \n\n"

获得完整评论比较复杂,因为它需要点击“更多”按钮。因此,您需要确定在单击引用上的“更多”链接时会触发哪些 http 请求。您可以使用 Firefox 开发人员工具(或许多其他工具,我敢肯定)中的网络监视器选项卡来执行此操作。原来这是一个形式的链接:

http://www.tripadvisor.com/ExpandedUserReviews-g{xxx}-d{yyy}?querystring

其中 {xxx}{yyy} 是酒店唯一的,与原始 url 中的相同,而 querystring 是在网络监视器工具中完全识别。因此,我们使用该 url 和适当的查询字符串形成一个新的 http 请求并解析结果,如下所示。

cls   <- doc['//div[@class="entry"]//span[contains(@class,"moreLink")]/@class']
xr.refno <- sapply(cls,function(x)sub(".*\\str(\\d+)\\s.*","\\1",x))
code <- sub(".*Hotel_Review(\\-g\\d+\\-d\\d+)\\-Reviews.*","\\1",url)
xr.url <- paste0("http://www.tripadvisor.com/ExpandedUserReviews",code)
xr.response <- GET(xr.url,query=list(target=xr.refno[1],
context=1,
reviews=paste(xr.refno,collapse=","),
servlet="Hotel_Review",
expand=1))
xr.doc <- content(xr.response,type="text/html")
xr.full <- xpathSApply(xr.doc,'//div[@class="entry"]/p',xmlValue)
length(xr.full)
# [1] 6
xr.full[1]
# [1] "\nThats all that matters really...I wonder if anyone would chose this hotel for any other factor at all...located right next to Grand central station in midtown and within walking distance of many tourist attractions, top restaurants and corp offices. Stayed 3 nights here on a business trip, I chose this hotel over others purely based on its location. Price is about average in NYC I think. Asked for a room with a good view and was given a 2 BR on the 30th floor. After checking in I realized there may not be the kind of view that I expected at all from any room in this hotel - due to it being surrounded by high rises in all directions. However, no other complaints as such - except may that the bathroom was a bit too cramped. That I guess is the norm in NYC. I would stay here again if it was a business visit based on the location. Faster than avg wifi (free) was a good plus.\n"

还有一个细微差别/问题。请注意,只有 6 个“扩展评论”。这是因为符合“部分评论”格式的简短评论没有“更多”按钮。所以你需要弄清楚哪些部分评论实际上是完整的。既然你说你正在学习 R,我就把它留给你......

关于xml - 用户评论的数据提取,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32296167/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com