gpt4 book ai didi

python - mhtml 文件的标签及其含义的某处是否有主列表?

转载 作者:行者123 更新时间:2023-11-28 19:27:10 25 4
gpt4 key购买 nike

我正在尝试从 xls 文件中读取和提取数据,这些文件实际上是单文件网页,如下所示

This document is a Single File Web Page, also known as a Web Archive file.  

我试图找出所有标签的含义,以便确保我使用 lxml 正确解析它们。

例如这里是一个标签的例子:

 <th class=3Dtl colspan=3D1 rowspan=3D2

虽然我成功地处理了我正在玩弄的几个文件,但我想弄清楚我是否在做一些假设,这些假设稍后会回来困扰我。因此,这些标签及其含义的列表会很棒。

最佳答案

如果 MHTML 是从 Microsoft Word 生成的,它可能是 WordprocessingML 的组合和 HTML4标签。

The top-level elements in a WordprocessingML document are:

SmartTagType element describes a Smart Tag type used in the document.
DocumentProperties element contains Office Document Properties.
CustomDocumentProperties element contains Custom Office Document Properties.
schemaLibrary element defines a collection of schemas that comprise a document's schema library.
fonts element (wordDocumentElt complexType) contains font information
frameset element (wordDocumentElt complexType) contains HTML Frameset definitions.
styles element (wordDocumentElt complexType) contains style definitions.
divs element contains HTML DIV information.
shapeDefaults element contains drawing defaults.
docOleData element contains supplemental data containing storages for OLE objects.
docSuppData element contains supplemental data containing toolbar customizations, envelope data, and the Microsoft Visual Basic project.
docPr element contains document options.
shapeDefaults element contains the wrapper representing the shape defaults.
bgPict element contains background picture information.
body element contains the document body.

However, the simplest WordprocessingML document consists of just five elements (and a single namespace). The five elements are:

wordDocument element: The root element for a WordprocessingML document.
body element: The container for the displayable text.
p element: A paragraph.
r element: A contiguous set of WordprocessingML components with a consistent set of properties.
t element: A piece of text.

关于python - mhtml 文件的标签及其含义的某处是否有主列表?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/7488243/

25 4 0