gpt4 book ai didi

Racket 中的 HTML 解析问题

转载 作者:行者123 更新时间:2023-11-28 01:50:51 25 4
gpt4 key购买 nike

我想解析一些HTML文档,好像Racket的html和xml库不能很好地处理这个。例如,这是一个 HTML 文档:

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<title>Test</title>
<script>
var k = "<scr";
</script>
</head>
<body>
</body>
</html>

都不是read-html也不read-xml可以解析这个。他们认为 <scrvar k = "<scr"是开始标记的一部分。

那么,有没有更好的方法来做到这一点?

最佳答案

试试 html-parsing包。

The html-parsing parsing behavior is permissive in that it accepts erroneous HTML, handling several classes of HTML syntax errors gracefully, without yielding a parse error. This is crucial for parsing arbitrary real-world Web pages, since many pages actually contain syntax errors that would defeat a strict or validating parser. html-parsing’s handling of errors is intended to generally emulate popular Web browsers’ interpretation of the structure of erroneous HTML. We euphemistically term this kind of parse “pragmatic.”

虽然我不确定它是否会处理 <script>像这样的标签,它可能。作者 Neil Van Dyke 活跃于 Racket mailing list .

关于Racket 中的 HTML 解析问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20159134/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com