gpt4 book ai didi

javascript - 将幻影渲染的 HTML 读取到 R 中

转载 作者:行者123 更新时间:2023-12-03 03:43:52 25 4
gpt4 key购买 nike

问题:使用 rvest,我似乎无法从通过 phantom js 呈现的 html 页面中找到我需要的信息 block 。我已经尝试了几乎所有可能的格式,但我似乎无法让 html_node 选取正确的 block 。

从幻影渲染的 html:

<div class="page">

<div class="main-header">
</script>

<div id="listing-703036966" class="shop-srp-listings__listing">
<div class="card listing-row--search hide-fade">

<div class="listing-row__main">
<div class="listing-row__image">

<div class="media-count shadowed">
<a href="/vehicledetail/detail/703036966/overview/" target="_self" class="media-count--photo" data-goto-vdp="703036966" data-standard-link="md-thumb">
25 Photos
</a>

<a href="/vehicledetail/detail/703036966/overview/" target="_self" class="media-count--video" data-goto-vdp="703036966" data-standard-link="md-thumb">
1 Video
</a>
</div>

<a href="/vehicledetail/detail/703036966/overview/" target="_self" class="gray-bg listing-row__photo" data-goto-vdp="703036966" data-standard-link="md-thumb">
<img alt="New 2018 BMW 750 i" src="https://www.cstatic-images.com/phototab/e/1/4/e2/f87fb57ec51cab4f57cbaeb9f9f.jpg" onload="window.performance.mark('serverSideFirstPhotoLoaded')">
</a>
<div class="compare-srp">
<div class="listing-row__save">
<a id="703036966" class="switch-favorite unsaved saveVehicleHeart compare-switch-favorite" savedfeatureinstance="" vehicle="{&quot;listingId&quot;:703036966,&quot;mkId&quot;:20005,&quot;mkNm&quot;:&quot;BMW&quot;,&quot;mdId&quot;:20536,&quot;mdNm&quot;:&quot;750&quot;,&quot;trimId&quot;:25905,&quot;trimName&quot;:&quot;i&quot;,&quot;modelYearId&quot;:35797618,&quot;modelYear&quot;:2018,&quot;stkTyp&quot;:&quot;New&quot;,&quot;state&quot;:&quot;NC&quot;,&quot;zipcode&quot;:&quot;27107&quot;}" cars-common-omniture-custom="" omniture-events="">
<div class="save-icon-wrapper">
<div class="cui-icon icon-heart-line">
<svg width="16" height="16" class="icon-image">
<use xlink:href="#cui-icon-heart-outline"></use>
</svg>
</div>

<div class="cui-icon icon-heart">
<svg width="16" height="16" class="icon-image">
<use xlink:href="#cui-icon-heart-fill"></use>
</svg>
</div>
</div>

<p class="saved-label">Save</p>
</a>
</div>
<div class="compare-button" data-compare-listing="703036966">
<div class="compare-icon-wrapper">
<div class="cui-icon icon-plus-sign">
<svg width="16" height="16" class="icon-plus-sign">
<use xlink:href="#cui-icon-plus-sign"></use>
</svg>
</div>
<div class="cui-icon icon-checkmark">
<svg width="16" height="16" class="icon-checkmark">
<use xlink:href="#cui-icon-checkmark"></use>
</svg>
</div>
</div>
<p class="compare-button__label compare">Compare</p>
<p class="compare-button__label added">Added</p>
</div>
</div>
</div>

等等

我在 R 中做了什么

library(rvest)
library(stringr)
library(plyr)
library(dplyr)
library(ggvis)
library(knitr)
library(tidyverse)

cars <- read_html("my file.html") %>%
html_nodes("div") %>%
html_text()

但是,当我检查汽车向量时,我完全缺少所需的代码块,即:

<a id="703036966" class="switch-favorite unsaved saveVehicleHeart         compare-switch-favorite" savedfeatureinstance="" vehicle=".   {&quot;listingId&quot;:703036966,&quot;mkId&quot;:20005,&quot;mkNm&quot;:&quot;BMW&quot;,&quot;mdId&quot;:20536,&quot;mdNm&quot;:&quot;750&quot;,&quot;trimId&quot;:25905,&quot;trimName&quot;:&quot;i&quot;,&quot;modelYearId&quot;:35797618,&quot;modelYear&quot;:2018,&quot;stkTyp&quot;:&quot;New&quot;,&quot;state&quot;:&quot;NC&quot;,&quot;zipcode&quot;:&quot;27107&quot;}" cars-common-omniture-custom="" omniture-events="">

但它永远不会转换为可用的形式,并且我尝试的所有不同节点都会丢失它(div、p、span)。

有什么想法吗?

最佳答案

您似乎希望从单个节点解析括号内的内容。即:字符串 "vehicle='{"listingId":703036966,...",来自具有 CSS 路径的节点 "a id.703036966 saveVehicleHeart"

由于该节点不包含要在 html 浏览器中呈现的文本,因此命令 html_text() 将无济于事。相反,您可以将节点的代码存储为字符串,然后解析感兴趣的部分。

<强>1。检索节点的字符串。节点的几个可能的 CSS 路径之一是 '.saveVehicleHeart'

library(rvest)
library(stringr)
library(dplyr)
car_html <- read_html("my file.html")
cars <- as.character(html_node(car_html, css = '.saveVehicleHeart'))

2.提取括号“{ }”内的内容

cars <- cars %>%
str_match(., "\\{.*?\\}") %>% ## Extract everything between the first "{" and the subsequent "}"
gsub("\\{|\\}", "", .) ## Remove the characters "{" and "}"

<强>3。奖金。将其放入一个漂亮的数据框架中。您没有要求这样做,但它可能会有所帮助。

df_cars <- cars %>% 
cbind(read.table(text = ., sep = (','))) %>%
t() %>%
as_data_frame() %>%
.[-1,] %>% ## The first row contains the original unparsed string. We drop it.
separate(., V1, into = c("Variable", "Value"), sep = "\\:")
df_cars

# A tibble: 12 × 2
Variable Value
* <chr> <chr>
1 listingId 703036966
2 mkId 20005
3 mkNm BMW
4 mdId 20536
5 mdNm 750
6 trimId 25905
7 trimName i
8 modelYearId 35797618
9 modelYear 2018
10 stkTyp New
11 state NC
12 zipcode 27107

关于javascript - 将幻影渲染的 HTML 读取到 R 中,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45495239/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com