gpt4 book ai didi

java - 清理 HTML 数据

转载 作者:行者123 更新时间:2023-11-30 11:57:39 25 4
gpt4 key购买 nike

我从不同的 RSS/ATOM 提要获取数据,有时我收到的 HTML 数据包含 HTML 标签,但它们没有关闭标签或其他一些问题,这会搞砸页面布局/样式。

有些东西有类名/id 冲突。有什么方法可以 sanitizer 吗?

如果有人能指出一些可靠的 Javascript/Java 实现。

最佳答案

你可以给JTidy一试。

JTidy can be used as a tool for cleaning up malformed and faulty HTML.

另一个选项是 HTML Cleaner

HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring the order to tags, attributes and ordinary text. For the given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create Document Object Model. However, user may provide custom tag and rule set for tag filtering and balancing.

关于java - 清理 HTML 数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/3697392/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com