gpt4 book ai didi

r - read_html(url) 和 read_html(content(GET(url), "text")) 之间的区别

转载 作者:行者123 更新时间:2023-12-04 12:21:51 26 4
gpt4 key购买 nike

我正在看这个很好的答案:https://stackoverflow.com/a/58211397/3502164 .

解决方案的开头包括:

library(httr)
library(xml2)

gr <- GET("https://nzffdms.niwa.co.nz/search")
doc <- read_html(content(gr, "text"))

xml_attr(xml_find_all(doc, ".//input[@name='search[_csrf_token]']"), "value")

输出在多个请求中是恒定的:
"59243d3a2....61f8f73136118f9"

到目前为止,我的默认方式是:
doc <- read_html("https://nzffdms.niwa.co.nz/search")
xml_attr(xml_find_all(doc, ".//input[@name='search[_csrf_token]']"), "value")

该结果与上面的输出不同,并且在多个请求中发生变化。

问题:

两者有什么区别:
  • read_html(url)
  • read_html(content(GET(url), "text"))

  • 为什么它会导致不同的值,为什么只有“GET”解决方案返回链接问题中的 csv?

    (我希望可以用三个子问题来构建它)。

    我尝试了什么:

    深入函数调用的兔子洞:
    read_html
    (ms <- methods("read_html"))
    getAnywhere(ms[1])
    xml2:::read_html
    xml2:::read_html.default
    #xml2:::read_html.response

    read_xml
    (ms <- methods("read_xml"))
    getAnywhere(ms[1])

    但这导致了这个问题: Find the used method for R wrapper functions

    感想:
  • 我没有看到 get 请求需要任何 header 或 Cookie,即
    可以解释不同的 react 。
  • 据我了解read_htmlread_html(content(GET(.),
    "text"))
    将返回 XML/html。
  • 好的,在这里我不确定检查是否有意义,但因为我没有想法:我检查了是否有某种缓存正在进行。

  • 代码:
    with_verbose(GET("https://nzffdms.niwa.co.nz/search"))
    ....
    <- Expires: Thu, 19 Nov 1981 08:52:00 GMT
    <- Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0

    --> 在我看来缓存可能不是解决方案。
  • help("GET")给出了一个关于“条件 GET”的有趣部分:

  • The semantics of the GET method change to a "conditional GET" if the request message includes an If-Modified-Since, If-Unmodified-Since, If-Match, If-None-Match, or If-Range header field. A conditional GET method requests that the entity be transferred only under the circumstances described by the conditional header field(s). The conditional GET method is intended to reduce unnecessary network usage by allowing cached entities to be refreshed without requiring multiple requests or transferring data already held by the client.



    但据我所见 with_verbose()没有 If-Modified-Since, If-Unmodified-Since, If-Match, If-None-Match, or If-Range被设置。

    最佳答案

    不同之处在于重复调用 httr::GET ,句柄在调用之间保持不变。与 xml2::read_html() ,每次都会建立一个新的连接。

    从 httr 文档:

    The handle pool is used to automatically reuse Curl handles for the same scheme/host/port combination. This ensures that the http session is automatically reused, and cookies are maintained across requests to a site without user intervention.



    在 xml2 文档中,讨论了传递给 read_html() 的字符串参数。 :

    A string can be either a path, a url or literal xml. Urls will be converted into connections either using base::url or, if installed, curl::curl



    所以你的答案是 read_html(GET(url))就像刷新浏览器一样,但是 read_html(url)就像关闭浏览器并打开一个新浏览器一样。服务器在它提供的页面上提供一个唯一的 session ID。新 session ,新 ID。您可以调用 httr::reset_handle(url) 来证明这一点:
    library(httr)
    library(xml2)

    # GET the page (note xml2 handles httr responses directly, don't need content("text"))
    gr <- GET("https://nzffdms.niwa.co.nz/search")
    doc <- read_html(gr)
    print(xml_attr(xml_find_all(doc, ".//input[@name='search[_csrf_token]']"), "value"))

    # A new GET using the same handle gets exactly the same response
    gr <- GET("https://nzffdms.niwa.co.nz/search")
    doc <- read_html(gr)
    print(xml_attr(xml_find_all(doc, ".//input[@name='search[_csrf_token]']"), "value"))

    # Now call GET again after resetting the handle
    httr::handle_reset("https://nzffdms.niwa.co.nz/search")
    gr <- GET("https://nzffdms.niwa.co.nz/search")
    doc <- read_html(gr)
    print(xml_attr(xml_find_all(doc, ".//input[@name='search[_csrf_token]']"), "value"))

    就我而言,采购上面的代码给了我:
    [1] "ecd9be7c75559364a2a5568049c0313f"
    [1] "ecd9be7c75559364a2a5568049c0313f"
    [1] "d953ce7acc985adbf25eceb89841c713"

    关于r - read_html(url) 和 read_html(content(GET(url), "text")) 之间的区别,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58219503/

    26 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com