gpt4 book ai didi

scheme - 如何从 Racket 中的 html 中提取元素?

转载 作者:行者123 更新时间:2023-12-01 09:55:43 25 4
gpt4 key购买 nike

我想提取reddit中的url,我的代码是

#lang racket

(require net/url)
(require html)

(define reddit (string->url "http://www.reddit.com/r/programming/search?q=racket&sort=relevance&restrict_sr=on&t=all"))
(define in (get-pure-port reddit #:redirections 5))

(define response-html (read-html-as-xml in))
(define content-0 (list-ref response-html 0))

(close-input-port in)

上面的content-0是

(element
(location 0 0 15)
(location 0 0 82)
...

我想知道如何从中提取特定内容。

最佳答案

  1. 通常将 HTML 处理为 x-expressions 更方便而不是 html模块的 struct

  2. 另外你应该使用 call/input-url 自动处理关闭端口。

您可以通过定义 read-html-as-xexpr 来结合这两种想法函数并像这样使用它:

#lang racket/base

(require html
net/url
xml)

(define (read-html-as-xexpr in) ;; input-port? -> xexpr?
(caddr
(xml->xexpr
(element #f #f 'root '()
(read-html-as-xml in)))))

(define reddit (string->url "http://www.reddit.com/r/programming/search?q=racket&sort=relevance&restrict_sr=on&t=all"))

(call/input-url reddit
get-pure-port
read-html-as-xexpr)

这将返回一个大的 x 表达式,例如:

'(html
((lang "en") (xml:lang "en") (xmlns "http://www.w3.org/1999/xhtml"))
(head
()
(title () "programming: search results")
(meta
((content " reddit, reddit.com, vote, comment, submit ")
(name "keywords")))
(meta
((content "reddit: the front page of the internet") (name "description")))
(meta ((content "origin") (name "referrer")))
(meta ((content "text/html; charset=UTF-8") (http-equiv "Content-Type")))
... snip ...

如何提取其中的特定部分?

  • 对于我不希望整体结构发生变化的简单 HTML,我通常只使用 match .

  • 然而,更正确和可靠的方法是使用 xml/path module .



更新:我注意到您的问题是从询问提取 URL 开始的。这是更新为使用 se-path*/list 的示例获得所有 href所有 <a> 的属性元素:

#lang racket/base

(require html
net/url
xml
xml/path)

(define (read-html-as-xexprs in) ;; (-> input-port? xexpr?)
(caddr
(xml->xexpr
(element #f #f 'root '()
(read-html-as-xml in)))))

(define reddit (string->url "http://www.reddit.com/r/programming/search?q=racket&sort=relevance&restrict_sr=on&t=all"))

(define xe (call/input-url reddit
get-pure-port
read-html-as-xexprs))

(se-path*/list '(a #:href) xe)

结果:

'("#content"
"http://www.reddit.com/r/announcements/"
"http://www.reddit.com/r/Art/"
"http://www.reddit.com/r/AskReddit/"
"http://www.reddit.com/r/askscience/"
"http://www.reddit.com/r/aww/"
"http://www.reddit.com/r/blog/"
"http://www.reddit.com/r/books/"
"http://www.reddit.com/r/creepy/"
"http://www.reddit.com/r/dataisbeautiful/"
"http://www.reddit.com/r/DIY/"
"http://www.reddit.com/r/Documentaries/"
"http://www.reddit.com/r/EarthPorn/"
"http://www.reddit.com/r/explainlikeimfive/"
"http://www.reddit.com/r/Fitness/"
"http://www.reddit.com/r/food/"
... snip ...

关于scheme - 如何从 Racket 中的 html 中提取元素?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28195841/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com