gpt4 book ai didi

go - go-colly:如何在c.OnResponse中获取HTML标题,以便填充结构?

转载 作者:行者123 更新时间:2023-12-01 21:13:43 24 4
gpt4 key购买 nike

如何在c.OnResponse中获取HTML.title-还是有更好的替代方法用url / title / content填充Struct

  • 最后,我需要填充以下结构并将其发布到elasticsearch。

  • type WebPage struct {
    Url string `json:"url"`
    Title string `json:"title"`
    Content string `json:"content"`
    }

        // Print the response
    c.OnResponse(func(r *colly.Response) {
    pageCount++
    log.Println(r.Headers)


    webpage := WebPage{
    Url: r.Ctx.Get("url"), //- can be put in ctx c.OnRequest, and r.Ctx.Get("url")
    Title: "my title", //string(r.title), // Where to get this?
    Content: string(r.Body), //string(r.Body) - can be done in c.OnResponse
    }

    enc := json.NewEncoder(os.Stdout)
    enc.SetIndent("", " ")
    enc.Encode(webpage) // SEND it to elasticsearch

    log.Println(fmt.Sprintf("%d DONE Visiting : %s", pageCount, urlVisited))

    })



    我可以通过以下方法获取标题,但是Ctx不可用,因此无法将“title”值放在Ctx中。还有其他选择吗?

        c.OnHTML("title", func(e *colly.HTMLElement) {
    fmt.Println(e.Text)
    e.Ctx.Put("title", e.Text) // NOT ACCESSIBLE!
    })

    日志
    2020/05/07 17:42:37 7  DONE Visiting : https://www.coursera.org/learn/build-portfolio-website-html-css
    {
    "url": "https://www.coursera.org/learn/build-portfolio-website-html-css",
    "title": "my page title",
    "content": "page html body bla "
    }
    2020/05/07 17:42:37 8 DONE Visiting : https://www.coursera.org/learn/build-portfolio-website-html-css
    {
    "url": "https://www.coursera.org/browse/social-sciences",
    "title": "my page title",
    "content": "page html body bla "
    }

    最佳答案

    我创建了该结构的全局变量,并用不同的方法填充它
    不知道这是否是最好的方法。


    fun main(){
    ....

    webpage := WebPage{} //Is this a right way to declare a mutable struct?

    c.OnRequest(func(r *colly.Request) { // url
    webpage.Url = r.URL.String() // Is this the right way to mutate?

    })

    c.OnResponse(func(r *colly.Response) { //get body
    pageCount++
    log.Println(fmt.Sprintf("%d DONE Visiting : %s", pageCount, webpage.Url))

    })

    c.OnHTML("head title", func(e *colly.HTMLElement) { // Title
    webpage.Title = e.Text
    })
    c.OnHTML("html body", func(e *colly.HTMLElement) { // Body / content
    webpage.Content = e.Text // Can url title body be misrepresented in multithread scenario?
    })

    c.OnHTML("a[href]", func(e *colly.HTMLElement) { // href , callback
    link := e.Attr("href")
    e.Request.Visit(link)
    })

    c.OnError(func(r *colly.Response, err error) { // Set error handler
    log.Println("Request URL:", r.Request.URL, "failed with response:", r, "\nError:", err)
    })

    c.OnScraped(func(r *colly.Response) { // DONE
    enc := json.NewEncoder(os.Stdout)
    enc.SetIndent("", " ")
    enc.Encode(webpage)
    })

    关于go - go-colly:如何在c.OnResponse中获取HTML标题,以便填充结构?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61668117/

    24 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com