gpt4 book ai didi

rvest:语言选择在 tripadvisor 中不起作用

转载 作者:行者123 更新时间:2023-12-04 03:35:54 26 4
gpt4 key购买 nike

我正面临网络抓取问题。我打算在 tripadvisor 上收集一些评论。我想使用 rvest 并获得所有语言的评论。来自 this questions我知道一种可能的方法是在 url 的末尾使用 ?filterLang=ALL 。在网络浏览器中,它确实有效。示例:

https://www.tripadvisor.com/Restaurant_Review-g187147-d2013853-Reviews-114_Faubourg-Paris_Ile_de_France.html?filterLang=ALL

是否提供选择“所有语言”的评论(您可以看到很多法语评论)。这是我的问题:我尝试获取评论的标题:

library(rvest)
url <- "https://www.tripadvisor.com/Restaurant_Review-g187147-d2013853-Reviews-114_Faubourg-Paris_Ile_de_France.html?filterLang=ALL"

reviews_html <- read_html(url)

reviews_html %>%
html_nodes(xpath = "//span[@class='noQuotes']") %>%
html_text()

[1] "I've never visited this restaurant," "Perfect"
[3] "Memorable experience" "Tasty"
[5] "Absolutely spectacular" "Excellent"
[7] "Wonderfullll" "A Perfect Evening"
[9] "Dinner " "Perfect dinner and evening"

我只有英文的。奇怪的是:如果我尝试获取页数:

reviews_html %>%
html_nodes(xpath = "//div[@data-tab='TABS_REVIEWS']//a[@data-page-number]")%>%
html_text()

[1] "Next" "1" "2" "3" "4" "5" "6" "176"

我有对应“所有语言”选项的评论页数!如果与没有选择语言的情况比较

url <- "https://www.tripadvisor.com/Restaurant_Review-g187147-d2013853-Reviews-114_Faubourg-Paris_Ile_de_France.html"

reviews_html <- read_html(url)

reviews_html %>%
html_nodes(xpath = "//span[@class='noQuotes']") %>%
html_text()

[1] "I've never visited this restaurant," "Perfect"
[3] "Memorable experience" "Tasty"
[5] "Absolutely spectacular" "Excellent"
[7] "Wonderfullll" "A Perfect Evening"
[9] "Dinner " "Perfect dinner and evening"

我得到了相同的评论,但是:

reviews_html %>%
html_nodes(xpath = "//div[@data-tab='TABS_REVIEWS']//a[@data-page-number]")%>%
html_text()

[1] "Next" "1" "2" "3" "4" "5" "6" "61"

我得到对应于英文语言选择的页数。我也尝试设置 cookies:

library(httr)

url <- "https://www.tripadvisor.com/Restaurant_Review-g187147-d2013853-Reviews-114_Faubourg-Paris_Ile_de_France.html?filterLang=ALL"
httr::GET(url,
set_cookies(`TALanguage` = "ALL",
`Domain` = ".tripadvisor.com"))%>%
read_html()%>%
html_nodes(xpath = "//span[@class='noQuotes']") %>%
html_text()

但是也没用。有谁知道发生了什么,以及我可以做些什么来使用 rvest 获得所有语言的评论?

最佳答案

当您手动选择过滤器时,在同一 url 上有一个 POST 调用。在表单正文中设置 filterLang=ALL 会正确返回数据:

library(rvest)
library(httr)

reviews_html <- POST(
"https://www.tripadvisor.com/Restaurant_Review-g187147-d2013853-Reviews-114_Faubourg-Paris_Ile_de_France.html",
add_headers('x-requested-with'= 'XMLHttpRequest'),
body = list(
preferFriendReviews = "FALSE",
t = "",
q = "", # filter by mention, try "france"
filterSeasons = "", # "1" is mar-may / "2" is jun-aug / "3" is sep-nov / "4" is dec-feb
filterLang = "ALL", # try "zhCN" or "fr"
filterSafety = "FALSE",
filterSegment = "", # "3" is families / "2" is couples / "5" is solo / "1" is business / "4" is friends
trating = "", # stars: "5" / "4" / "3" / "2" / "1" / "0"
isLastPoll = "false",
changeSet = "REVIEW_LIST"
),
encode = "form") %>%
read_html()

reviews <- reviews_html %>%
html_nodes(xpath = "//span[@class='noQuotes']") %>%
html_text()

print(reviews)

pages <- reviews_html %>%
html_nodes(xpath = "//div[@data-tab='TABS_REVIEWS']//a[@data-page-number]")%>%
html_text()

print(pages)

在上面的代码中,如果您需要这些过滤器,我添加了一些关于字段的描述

kaggle link

输出:

 [1] "I've never visited this restaurant," "Excellente expérience"              
[3] "Du grand art" "Promesse tenue"
[5] "Une soirée de rêve en famille" "Délicieux !!! "
[7] "Une expérience inoubliable" "UN CERTAIN REGARD"
[9] "Excellent soiree en couple" "Une soirée magnifique"
[1] "Next" "1" "2" "3" "4" "5" "6" "176"

关于rvest:语言选择在 tripadvisor 中不起作用,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66916363/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com