java - 如何构建 HTML org.w3c.dom.Document？-6ren

java - 如何构建 HTML org.w3c.dom.Document？

转载作者：太空狗更新时间：2023-10-29 14:10:50

documentation of the Document interface接口(interface)描述如下:

The Document interface represents the entire HTML or XML document.

javax.xml.parsers.DocumentBuilder构建 XML Document s。但是，我无法找到构建 Document 的方法。那是一个 HTML Document !

我想要一个 HTML Document因为我正在尝试构建一个文档，然后将其传递给一个需要 HTML 的库 Document .该库使用 Document#getElementsByTagName(String tagname)以不区分大小写的方式，这适用于 HTML，但不适用于 XML。

我环顾四周，没有找到任何东西。项目如 How to convert an Html source of a webpage into org.w3c.dom.Document in java?实际上没有答案。

最佳答案

您似乎有两个明确的要求:

您需要将 HTML 表示为 org.w3c.dom.Document .

您需要 Document#getElementsByTagName(String tagname)以不区分大小写的方式操作。

如果您尝试使用 org.w3c.dom.Document 处理 HTML ，那么我假设您正在使用某种形式的 XHTML。因为诸如 DOM 之类的 XML API 需要格式良好的 XML。 HTML 不一定是格式良好的 XML，但 XHTML 是格式良好的 XML。即使您正在使用 HTML，在尝试通过 XML 解析器运行它之前，您也必须进行一些预处理以确保它是格式良好的 XML。首先使用 HTML 解析器解析 HTML 可能更容易，例如 jsoup ，然后构建您的 org.w3c.dom.Document通过遍历 HTML 解析器生成的树(在 jsoup 的情况下为 org.jsoup.nodes.Document)。

有一个 org.w3c.dom.html.HTMLDocument 接口(interface)，扩展 org.w3c.dom.Document .我发现的唯一实现是在 Xerces-j 中(2.11.0) 形式为 org.apache.html.dom.HTMLDocumentImpl .起初这看起来很有希望，但是经过仔细检查，我们发现存在一些问题。

1. 没有一种清晰、“干净”的方式来获取实现 org.w3c.dom.html.HTMLDocument 的对象的实例。界面。

使用 Xerces 我们通常会得到一个 Document对象使用 DocumentBuilder以下列方式:

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.newDocument();
//or doc = builder.parse(xmlFile) if parsing from a file

或者使用 DOMImplementation种类:

DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
DOMImplementationLS impl = (DOMImplementationLS)registry.getDOMImplementation("LS");
LSParser lsParser = impl.createLSParser(DOMImplementationLS.MODE_SYNCHRONOUS, null);
Document document = lsParser.parseURI("myFile.xml");

在这两种情况下，我们纯粹使用 org.w3c.dom.*获取 Document的接口(interface)目的。

我为 HTMLDocument 找到的最接近的等价物是这样的:

HTMLDOMImplementation htmlDocImpl = HTMLDOMImplementationImpl.getHTMLDOMImplementation();
HTMLDocument htmlDoc = htmlDocImpl.createHTMLDocument("My Title");

这要求我们直接实例化内部实现类，使我们的实现依赖于 Xerces。

(注意:我还看到 Xerces 也有一个内部 HTMLBuilder(它实现了已弃用的 DocumentHandler )，据说可以生成一个 HTMLDocument using a SAX parser, but I didn't bother looking into it. )

2. org.w3c.dom.html.HTMLDocument不会生成正确的 XHTML。

虽然，您可以搜索 HTMLDocument树使用 getElementsByTagName(String tagname)以不区分大小写的方式，所有元素名称都在内部以全部大写形式保存。但是 XHTML 元素和属性名称应该在 all lowercase 中. (这可以通过遍历整个文档树并使用 Document 的 renameNode() 方法将所有元素的名称更改为小写来解决。)

此外，XHTML 文档应该有一个正确的 DOCTYPE declaration和 xmlns declaration for the XHTML namespace .似乎没有一种直接的方法可以在 HTMLDocument 中设置它们。 (除非您对内部 Xerces 实现进行一些摆弄)。

3. org.w3c.dom.html.HTMLDocument文档很少，接口(interface)的 Xerces 实现似乎不完整。

我没有搜索整个互联网，而是我找到的唯一文档 HTMLDocument是之前链接的 JavaDocs，以及 Xerces 内部实现的源代码中的注释。在这些评论中，我还发现界面的几个不同部分没有实现。 (旁注:我真的觉得 org.w3c.dom.html.HTMLDocument 界面本身并没有真正被任何人使用，而且它本身可能是不完整的。)

由于这些原因，我认为最好避免使用 org.w3c.dom.html.HTMLDocument并尽我们所能用 org.w3c.dom.Document .我们可以做什么？

一种方法是扩展 org.apache.xerces.dom.DocumentImpl (扩展 org.apache.xerces.dom.CoreDocumentImpl 实现 org.w3c.dom.Document )。这种方法不需要太多代码，但它仍然使我们的实现依赖于 Xerces，因为我们正在扩展 DocumentImpl .在我们的 MyHTMLDocumentImpl ，我们只是在元素创建和搜索时将所有标签名称转换为小写。这将允许使用 Document#getElementsByTagName(String tagname)以不区分大小写的方式。
MyHTMLDocumentImpl :

import org.apache.xerces.dom.DocumentImpl;
import org.apache.xerces.dom.DocumentTypeImpl;
import org.w3c.dom.DOMException;
import org.w3c.dom.Document;
import org.w3c.dom.DocumentType;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;

//a base class somewhere in the hierarchy implements org.w3c.dom.Document
public class MyHTMLDocumentImpl extends DocumentImpl {

    private static final long serialVersionUID = 1658286253541962623L;


    /**
     * Creates an Document with basic elements required to meet
     * the <a href="http://www.w3.org/TR/xhtml1/#strict">XHTML standards</a>.
     * <pre>
     * {@code
     * <?xml version="1.0" encoding="UTF-8"?>
     * <!DOCTYPE html 
     *     PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
     *     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
     * <html xmlns="http://www.w3.org/1999/xhtml">
     *     <head>
     *         <title>My Title</title>
     *     </head>
     *     <body/>
     * </html>
     * }
     * </pre>
     * 
     * @param title desired text content for title tag. If null, no text will be added.
     * @return basic HTML Document. 
     */
    public static Document makeBasicHtmlDoc(String title) {
        Document htmlDoc = new MyHTMLDocumentImpl();
        DocumentType docType = new DocumentTypeImpl(null, "html",
                "-//W3C//DTD XHTML 1.0 Strict//EN",
                "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd");
        htmlDoc.appendChild(docType);
        Element htmlElement = htmlDoc.createElementNS("http://www.w3.org/1999/xhtml", "html");
        htmlDoc.appendChild(htmlElement);
        Element headElement = htmlDoc.createElement("head");
        htmlElement.appendChild(headElement);
        Element titleElement = htmlDoc.createElement("title");
        if(title != null)
            titleElement.setTextContent(title);
        headElement.appendChild(titleElement);
        Element bodyElement = htmlDoc.createElement("body");
        htmlElement.appendChild(bodyElement);

        return htmlDoc;
    }

    /**
     * This method will allow us to create a our
     * MyHTMLDocumentImpl from an existing Document.
     */
    public static Document createFrom(Document doc) {
        Document htmlDoc = new MyHTMLDocumentImpl();
        DocumentType originDocType = doc.getDoctype();
        if(originDocType != null) {
            DocumentType docType = new DocumentTypeImpl(null, originDocType.getName(),
                    originDocType.getPublicId(),
                    originDocType.getSystemId());
            htmlDoc.appendChild(docType);
        }
        Node docElement = doc.getDocumentElement();
        if(docElement != null) {
            Node copiedDocElement = docElement.cloneNode(true);
            htmlDoc.adoptNode(copiedDocElement);
            htmlDoc.appendChild(copiedDocElement);
        }
        return htmlDoc;
    }

    private MyHTMLDocumentImpl() {
        super();
    }

    @Override
    public Element createElement(String tagName) throws DOMException {
        return super.createElement(tagName.toLowerCase());
    }

    @Override
    public Element createElementNS(String namespaceURI, String qualifiedName) throws DOMException {
        return super.createElementNS(namespaceURI, qualifiedName.toLowerCase());
    }

    @Override
    public NodeList getElementsByTagName(String tagname) {
        return super.getElementsByTagName(tagname.toLowerCase());
    }

    @Override
    public NodeList getElementsByTagNameNS(String namespaceURI, String localName) {
        return super.getElementsByTagNameNS(namespaceURI, localName.toLowerCase());
    }

    @Override
    public Node renameNode(Node n, String namespaceURI, String qualifiedName) throws DOMException {
        return super.renameNode(n, namespaceURI, qualifiedName.toLowerCase());
    }
}

测试员:

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;

import org.w3c.dom.DOMConfiguration;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NodeList;
import org.w3c.dom.bootstrap.DOMImplementationRegistry;
import org.w3c.dom.ls.DOMImplementationLS;
import org.w3c.dom.ls.LSOutput;
import org.w3c.dom.ls.LSSerializer;


public class HTMLDocumentTest {

    private final static int P_ELEMENT_NUM = 3;

    public static void main(String[] args) //I'm throwing all my exceptions here to shorten the example, but obviously you should handle them appropriately.
            throws ClassNotFoundException, InstantiationException, IllegalAccessException, ClassCastException, IOException {

        Document htmlDoc = MyHTMLDocumentImpl.makeBasicHtmlDoc("My Title");

        //populate the html doc with some example content
        Element bodyElement = (Element) htmlDoc.getElementsByTagName("body").item(0);
        for(int i = 0; i < P_ELEMENT_NUM; ++i) {
            Element pElement = htmlDoc.createElement("p");
            String id = Integer.toString(i+1);
            pElement.setAttribute("id", "anId"+id);
            pElement.setTextContent("Here is some text"+id+".");
            bodyElement.appendChild(pElement);
        }

        //get the title element in a case insensitive manner.
        NodeList titleNodeList = htmlDoc.getElementsByTagName("tItLe");
        for(int i = 0; i < titleNodeList.getLength(); ++i)
            System.out.println(titleNodeList.item(i).getTextContent());

        System.out.println();

        {//get all p elements searching with lowercase
            NodeList pNodeList = htmlDoc.getElementsByTagName("p");
            for(int i = 0; i < pNodeList.getLength(); ++i) {
                System.out.println(pNodeList.item(i).getTextContent());
            }
        }

        System.out.println();

        {//get all p elements searching with uppercase
            NodeList pNodeList = htmlDoc.getElementsByTagName("P");
            for(int i = 0; i < pNodeList.getLength(); ++i) {
                System.out.println(pNodeList.item(i).getTextContent());
            }
        }

        System.out.println();

        //to serialize
        DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
        DOMImplementationLS domImplLS = (DOMImplementationLS) registry.getDOMImplementation("LS");

        LSSerializer lsSerializer = domImplLS.createLSSerializer();
        DOMConfiguration domConfig = lsSerializer.getDomConfig();
        domConfig.setParameter("format-pretty-print", true);  //if you want it pretty and indented

        LSOutput lsOutput = domImplLS.createLSOutput();
        lsOutput.setEncoding("UTF-8");

        //to write to file
        try (OutputStream os = new FileOutputStream(new File("myFile.html"))) {
            lsOutput.setByteStream(os);
            lsSerializer.write(htmlDoc, lsOutput);
        }

        //to print to screen
        System.out.println(lsSerializer.writeToString(htmlDoc)); 
    }

}

输出:

My Title

Here is some text1.
Here is some text2.
Here is some text3.

Here is some text1.
Here is some text2.
Here is some text3.

<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <title>My Title</title>
    </head>
    <body>
        <p id="anId1">Here is some text1.</p>
        <p id="anId2">Here is some text2.</p>
        <p id="anId3">Here is some text3.</p>
    </body>
</html>

另一种与上述类似的方法是创建一个 Document包装 Document 的包装器对象并实现 Document界面本身。这需要比“扩展 DocumentImpl”方法更多的代码，但这种方式“更干净”，因为我们不必关心特定的 Document实现。这种方法的额外代码并不难；为 Document 提供所有这些包装器实现有点乏味。方法。我还没有完全解决这个问题，可能会有一些问题，但如果它有效，这是一般的想法:

public class MyHTMLDocumentWrapper implements Document {

    private Document doc;

    public MyHTMLDocumentWrapper(Document doc) {
        //...
        this.doc = doc;
        //...
    }

    //...
}

是否 org.w3c.dom.html.HTMLDocument ，我上面提到的方法之一，或其他方法，也许这些建议将帮助您了解如何进行。

编辑:

在我尝试解析以下 XHTML 文件时的解析测试中，Xerces 会在尝试打开 http 连接的实体管理类中挂起。为什么我不知道？特别是因为我在没有实体的本地 html 文件上进行了测试。 (也许与 DOCTYPE 或命名空间有关？)这是文档:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC 
    "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <title>My Title</title>
    </head>
    <body>
        <p id="anId1">Here is some text1.</p>
        <p id="anId2">Here is some text2.</p>
        <p id="anId3">Here is some text3.</p>
    </body>
</html>

关于java - 如何构建 HTML org.w3c.dom.Document？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/29041855/

文章推荐： Android NDK & FFMPEG : findLibrary returned null

文章推荐： git - 冰山: LGit_GIT_ERROR: Invalid version 0 on git_remote_callback

文章推荐： javascript - 如何在智能表中按日期对项目进行排序

文章推荐： GIT 本地仓库

schema.org - Schema.org、Goodrelations-vocabulary.org 和 Productontology.org 之间有什么关系？
Schema.org、Goodrelations-vocabulary.org 和 Productontology.org 之间有什么关系？ Schema.org 告知，“W3C schema.org
java - 为什么 org.ietf、org.omg、org.w3c 和 org.xml 是 POJO 的一部分？
大家好，我想知道包 org.ietf、org.omg、org.w3c 和 org 是如何实现的.xml 已进入 "official" Java classes ？例如，默认 JDK 不会包含 Apa
schema.org - DBpedia.org 本体与 Schema.org 本体
首先，我试图用来自 Schema.org 的属性定义数据库表，例如，例如，我有一个名为“JobPosting”的表，它或多或少具有与 http://schema.org/JobPosting 中定义的
java - 通过 org.w3c.dom.Element 对象作为 org.dom4j.Document 上的参数查找(将 org.w3c.dom.Element 转换为 org.dom4j.Element)
我有一个 org.w3c.dom.Document 被 org.dom4j.io.DOMReader 解析。我想通过 org.w3c.dom.Element 搜索 dom4j DOM 文档。比方说
java - 无法解析 - org.dom4j.DocumentException : org. dom4j.DocumentFactory 无法转换为 org.dom4j.DocumentFactory
我正在将我的应用程序部署到 Tomcat 6.0.20。应用程序使用 Hibernate 作为 Web 层的 ORM、Spring 和 JSF。我还从 main() 方法制作了简单的运行器来测试
deployment - 由 : org. dom4j.DocumentException 引起 : org. dom4j.DocumentFactory 无法转换为 org.dom4j.DocumentFactory
我有一个使用 hibernate > 4 的 gradle 项目。如果我在 Apache tomcat 中运行我的 war 文件，我不会收到任何错误。但是当我在 Wildfly 8.2 中部署它时，出
Android Studio : Could not find org. jacoco :org. jacoco.agent :org. gradle.testing.jacoco.plugins.JacocoPluginExtension_Decorated
我正在尝试将 JaCoCo 添加到我的 Android 以覆盖 Sonar Qube。但是在运行命令 ./gradlew jacocoTestReport 时，我收到以下错误。 Task :app:
org-mode - 在 org 模式下格式化日期
如何在 emacs 组织模式中格式化日期？例如，在下表中，我希望日期显示为“Aug 29”或“Wed, Aug 29”而不是“” #+ATTR_HTML: border="2" rules="all
org-mode - 在 org 文件中包含代码片段
我想使用 org 模式来写一本技术书籍。我正在寻找一种将外部文件中的现有代码插入到 babel 代码块中的方法，该代码块在导出为 pdf 时会提供很好的格式。例如 #+BEGIN_SRC pytho
schema.org - schema.org 中的产品类别？
用作引用:https://support.google.com/webmasters/answer/146750?hl=en 您会注意到在“产品”下有一个属性类别，此外页面下方还有一个示例: Too
schema.org - Schema.org 中的产品列表
我读了这个Google doc .它说我们不使用列表中的产品。那么对于产品列表(具有多页的类似产品的类别，如“鞋子”)，推荐使用哪种模式？我用这个: { "@context": "htt
schema.org - schema.org 数据集和维基数据之间是否存在映射？
我目前在做DBpedia数据集，想通过wikidata实现schema.org和DBpedia的映射。因此我想知道 schema.org 和 wikidata 之间是否存在任何映射。最佳答案我认为
org-mode - org-mode 表内的代码块
我爱org-tables ，我用它们来记录各种事情。我现在正在为 Nix 记录一些单行代码(在阅读了 Domen Kožar 的 excellent guide 后，在 this year's Eur
schema.org - schema.org 中的多个作者或贡献者
如果看一下 Movie在 schema.org 中输入，actor 和 actors 属性都是允许的(actor 取代 actors)。但是 author 和 contributor 属性没有等效项。
schema.org - Schema.org 中的多家餐厅
我们有一些餐厅有多个地点或分支机构。我想包含正确的 Schema.org 标记，但找不到任何允许列出多个餐厅的内容。每家餐厅都有自己的地址、电子邮件、电话和营业时间，甚至可能是“分店名称”。两个分
schema.org - Schema.org 的多个综合评级
我在一个页面中有多个综合评分片段。有没有办法让其中之一成为默认值？将显示在搜索引擎结果中的那个？谢谢大家! 更新:该网页本质上是品牌的页面。它包含品牌评论的总评分及其产品列表(每个产品的总评分)。
java - org.apache.maven.archiver.MavenArchiver.getManifest(org.apache.maven.project.MavenProject，org.apache.maven.archiver.MavenArchiveConfiguration)
我提到了一些相关的职位，但并没有解决我的问题。因为我正在使用maven-jar-plugin-2.4 jar。我正在使用JBoss Developer Studio 7.1.1 GA IDE，并且正
schema.org - 个人网站是否应该将根页面标记为 schema.org 'Person' ？
网站的根页面(即 http://example.com/ )的特殊之处在于它是默认的着陆页。它可能包含许多不同的对象类型。它可能被认为是一个网站，或者一个博客等... 但它是否也应该被标记为给定对象
org-mode - 如何隐藏一些文本不被 org-publish-* 函数发布？
我想将一些文本放入一个 org 文件中，当我将内容导出到其中一种目标类型(在本例中为 HTML)时，该文件不会发布。有什么方法可以实现这个目标吗？最佳答案您可能想要使用 :noexport: 标签
org-mode - 在 org-mode 的编号列表中的步骤之间移动
org-mode 是否有一个键绑定(bind)可以在编号/项目符号列表项之间移动，就像您可以对标题一样？喜欢的功能: org-forward-heading-same-level 大纲下一个可见标题

太空狗

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

java - 如何构建 HTML org.w3c.dom.Document？