php - 你如何解析和处理 PHP 中的 HTML/XML？-6ren

php - 你如何解析和处理 PHP 中的 HTML/XML？

转载作者：行者123 更新时间：2023-12-01 16:45:38

24

4

如何解析 HTML/XML 并从中提取信息？

最佳答案

原生 XML 扩展
我更喜欢使用 native XML extensions 之一由于它们与 PHP 捆绑在一起，因此通常比所有 3rd 方库都快，并为我提供了对标记所需的所有控制。
DOM

The DOM extension allows you to operate on XML documents through the DOM API with PHP 5. It is an implementation of the W3C's Document Object Model Core Level 3, a platform- and language-neutral interface that allows programs and scripts to dynamically access and update the content, structure and style of documents.

DOM 能够解析和修改现实世界(损坏的)HTML，它可以做 XPath queries .它基于 libxml .
使用 DOM 需要一些时间来提高效率，但 IMO 这段时间非常值得。由于 DOM 是一种与语言无关的接口(interface)，您会发现多种语言的实现，因此如果您需要更改您的编程语言，那么您很可能已经知道如何使用该语言的 DOM API。
基本用法示例可以在 Grabbing the href attribute of an A element 中找到。可以在 DOMDocument in php 找到一般概念概述。
How to use the DOM extension has been covered extensively on StackOverflow ，所以如果你选择使用它，你可以确定你遇到的大部分问题都可以通过搜索/浏览 Stack Overflow 来解决。
XMLReader

The XMLReader extension is an XML pull parser. The reader acts as a cursor going forward on the document stream and stopping at each node on the way.

XMLReader 和 DOM 一样，也是基于 libxml 的。我不知道如何触发 HTML 解析器模块，因此使用 XMLReader 解析损坏的 HTML 的可能性可能不如使用 DOM 强，您可以在其中明确告诉它使用 libxml 的 HTML 解析器模块。
可以在 getting all values from h1 tags using php 找到基本用法示例。
XML Parser

This extension lets you create XML parsers and then define handlers for different XML events. Each XML parser also has a few parameters you can adjust.

XML Parser 库也是基于libxml，并实现了 SAX样式 XML 推送解析器。它可能是比 DOM 或 SimpleXML 更好的内存管理选择，但比 XMLReader 实现的拉式解析器更难使用。
SimpleXml

The SimpleXML extension provides a very simple and easily usable toolset to convert XML to an object that can be processed with normal property selectors and array iterators.

当您知道 HTML 是有效的 XHTML 时，SimpleXML 是一个选项。如果您需要解析损坏的 HTML，甚至不要考虑 SimpleXml，因为它会卡住。
可以在 A simple program to CRUD node and node values of xml file 找到基本用法示例。还有 lots of additional examples in the PHP Manual .

第 3 方库(基于 libxml)
如果您更喜欢使用第 3 方库，我建议您使用实际使用 DOM 的库。/ libxml在下面而不是字符串解析。
FluentDom - Repo

FluentDOM provides a jQuery-like fluent XML interface for the DOMDocument in PHP. Selectors are written in XPath or CSS (using a CSS to XPath converter). Current versions extend the DOM implementing standard interfaces and add features from the DOM Living Standard. FluentDOM can load formats like JSON, CSV, JsonML, RabbitFish and others. Can be installed via Composer.

HtmlPageDom

Wa72\HtmlPageDom is a PHP library for easy manipulation of HTMLdocuments using DOM. It requires DomCrawler from Symfony2components for traversingthe DOM tree and extends it by adding methods for manipulating theDOM tree of HTML documents.

phpQuery (多年未更新)

phpQuery is a server-side, chainable, CSS3 selector driven Document Object Model (DOM) API based on jQuery JavaScript Library written in PHP5 and provides additional Command Line Interface (CLI).

另见: https://github.com/electrolinux/phpquery
Zend_Dom

Zend_Dom provides tools for working with DOM documents and structures. Currently, we offer Zend_Dom_Query, which provides a unified interface for querying DOM documents utilizing both XPath and CSS selectors.

QueryPath

QueryPath is a PHP library for manipulating XML and HTML. It is designed to work not only with local files, but also with web services and database resources. It implements much of the jQuery interface (including CSS-style selectors), but it is heavily tuned for server-side use. Can be installed via Composer.

fDOMDocument

fDOMDocument extends the standard DOM to use exceptions at all occasions of errors instead of PHP warnings or notices. They also add various custom methods and shortcuts for convenience and to simplify the usage of DOM.

sabre/xml

sabre/xml is a library that wraps and extends the XMLReader and XMLWriter classes to create a simple "xml to object/array" mapping system and design pattern. Writing and reading XML is single-pass and can therefore be fast and require low memory on large xml files.

FluidXML

FluidXML is a PHP library for manipulating XML with a concise and fluent API.It leverages XPath and the fluent programming pattern to be fun and effective.

3rd-Party(不是基于 libxml 的)
构建在 DOM/libxml 上的好处是，您可以获得良好的开箱即用性能，因为您基于 native 扩展。然而，并不是所有的 3rd-party libs 都走这条路。其中一些列在下面
PHP Simple HTML DOM Parser

An HTML DOM parser written in PHP5+ lets you manipulate HTML in a very easy way!

Require PHP 5+.

Supports invalid HTML.

Find tags on an HTML page with selectors just like jQuery.

Extract contents from HTML in a single line.

我一般不推荐这个解析器。代码库很糟糕，解析器本身很慢而且很耗内存。并非所有 jQuery 选择器(例如 child selectors )都可用。任何基于 libxml 的库都应该轻松胜过这一点。
PHP Html Parser

PHPHtmlParser is a simple, flexible, html parser which allows you to select tags using any css selector, like jQuery. The goal is to assiste in the development of tools which require a quick, easy way to scrape html, whether it's valid or not! This project was original supported by sunra/php-simple-html-dom-parser but the support seems to have stopped so this project is my adaptation of his previous work.

同样，我不会推荐这个解析器。 CPU 使用率高时速度相当慢。也没有清除创建的 DOM 对象内存的功能。这些问题尤其适用于嵌套循环。文档本身不准确且拼写错误，自 16 年 4 月 14 日以来没有对修复程序作出回应。
Ganon

A universal tokenizer and HTML/XML/RSS DOM Parser

   Ability to manipulate elements and their attributes

```
   Supports invalid HTML and UTF8
```

   Can perform advanced CSS3-like queries on elements (like jQuery -- namespaces supported)

A HTML beautifier (like HTML Tidy)
```
   Minify CSS and Javascript
```

   Sort attributes, change character case, correct indentation, etc.

Extensible

   Parsing documents using callbacks based on current character/token

   Operations separated in smaller functions for easy overriding

Fast and Easy

从来没有用过。说不上好不好。

HTML 5
你可以使用上面的来解析 HTML5，但是 there can be quirks由于 HTML5 允许的标记。因此，对于 HTML5，您要考虑使用专用解析器，例如
html5lib

A Python and PHP implementations of a HTML parser based on the WHATWG HTML5 specification for maximum compatibility with major desktop web browsers.

一旦 HTML5 完成，我们可能会看到更多的专用解析器。 W3 还有一篇博文，标题为 How-To for html 5 parsing值得一试。

网络服务
如果您不想编写 PHP，也可以使用 Web 服务。一般来说，我发现这些用途很少，但这只是我和我的用例。
ScraperWiki .

ScraperWiki's external interface allows you to extract data in the form you want for use on the web or in your own applications. You can also extract information about the state of any scraper.

正则表达式
最后和 最不推荐 ，您可以使用 regular expressions 从 HTML 中提取数据.通常不鼓励在 HTML 上使用正则表达式。
您将在网络上找到的大多数与标记匹配的片段都很脆弱。在大多数情况下，它们仅适用于非常特殊的 HTML 片段。微小的标记更改，例如在某处添加空格，或添加或更改标签中的属性，都可能导致 RegEx 在编写不正确时失败。在 HTML 上使用 RegEx 之前，您应该知道自己在做什么。
HTML 解析器已经知道 HTML 的语法规则。必须为您编写的每个新 RegEx 教授正则表达式。 RegEx 在某些情况下很好，但这实际上取决于您的用例。
您 can write more reliable parsers ，但是当上述库已经存在并且在这方面做得更好时，使用正则表达式编写完整可靠的自定义解析器是浪费时间。
另见 Parsing Html The Cthulhu Way

书籍
如果你想花一些钱，看看

PHP Architect's Guide to Webscraping with PHP

我不隶属于 PHP 架构师或作者。

关于php - 你如何解析和处理 PHP 中的 HTML/XML？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/292926/

24

4

0

文章推荐： ios - Xcode 5不显示方案选项以在iOS 6上运行模拟器

文章推荐： ios - GoogleMap SDK iOS 1.6 错误日志

文章推荐： java - 如何在 build.gradle 文件上实现 intelliJ 插件操作？

javascript - 控制台错误 - 解析 AJAX JSON 解析
我一直在使用 AJAX 从我正在创建的网络服务中解析 JSON 数组时遇到问题。我的前端是一个简单的 ajax 和 jquery 组合，用于显示从我正在创建的网络服务返回的结果。尽管知道我的数据库查
xml - Json 解析 vs xml 解析？
很难说出这里要问什么。这个问题模棱两可、含糊不清、不完整、过于宽泛或夸夸其谈，无法以目前的形式得到合理的回答。如需帮助澄清此问题以便重新打开，visit the help center . 关闭 1
android - java.lang.NoClassDefFoundError : com. 解析。解析
我在尝试运行 Android 应用程序时遇到问题并收到以下错误 java.lang.NoClassDefFoundError: com.parse.Parse 当我尝试运行该应用时。最佳答案在这
python - 解析 HTML 内容时防止 etree 解析 HTML 实体
有什么办法可以防止etree在解析HTML内容时解析HTML实体吗？ html = etree.HTML('&') html.find('.//body').text 这给了我 '&' 但我想
javascript - 使用 JSON 解析/解析 js 对象时，返回方法中的函数范围会丢失
我有一个有点疯狂的例子，但对于那些 JavaScript 函数作用域专家来说，它看起来是一个很好的练习: (function (global) { // our module number one
java - 使用 Java 解析 HTML 数据(DOM 解析)
关闭。此题需要details or clarity 。目前不接受答案。想要改进这个问题吗？通过 editing this post 添加详细信息并澄清问题. 已关闭 8 年前。 Improve th
php - 在服务器上用 PHP 解析 HTML 还是在最终用户端用 JavaScript 解析 HTML 会更好？
我需要编写一个脚本来获取链接并解析链接页面的 HTML 以提取标题和其他一些数据，例如可能是简短的描述，就像您链接到 Facebook 上的内容一样。当用户向站点添加链接时将调用它，因此在客户端启动
node.js - 为什么 npm 包从/AppData 解析，而不是从 local/node_modules 解析？
在 VS Code 中本地开发时，包解析为 C:/Users//AppData/Local/Microsoft/TypeScript/3.5/node_modules/@types//index而不是
php - 解析 json 错误 : SyntaxError: JSON. 解析:JSON 数据的第 1 行第 2 列出现意外字符
我在将 json 从 php 解析为 javascript 时遇到问题这是我的示例代码: //function MethodAjax = function (wsFile, param) {
php - 解析 json 错误 : SyntaxError: JSON. 解析:JSON 数据的第 1 行第 2 列出现意外字符
我在将 json 从 php 解析为 javascript 时遇到问题这是我的示例代码: //function MethodAjax = function (wsFile, param) {
解析，在哪里可以了解
我被赋予了将一种语言“翻译”成另一种语言的工作。对于使用正则表达式的简单逐行方法来说，源代码过于灵活(复杂)。我在哪里可以了解更多关于词法分析和解析器的信息？最佳答案如果你想对这个主题产生“情绪化
正则表达式 {} 解析
您好，我在解析此文本时遇到问题 { { { {[system1];1;1;0.612509325}; {[system2];1;
JavaScript 解析？
我正在为 adobe after effects 在 extendscript 中编写一些代码，最终变成了 javascript。我有一个数组，我想只搜索单词“assemble”并返回整个 jc3_
JavaScript 解析
我有这段代码: $(document).ready(function() { // }); 问题:FB_RequireFeatures block 外部的代码先于其内部的代码执行。因此 who
解析.netcore项目中IStartupFilter使用教程
背景： netcore项目中有些服务是在通过中间件来通信的，比如orleans组件。它里面服务和客户端会指定网关和端口，我们只需要开放客户端给外界，服务端关闭端口。相当于去掉host，这样省掉了些
解析:继承ViewGroup后的子类如何重写onMeasure方法
1.首先贴上我试验成功的代码复制代码代码如下: protected void onMeasure(int widthMeasureSpec, int heightMeasureSpec)
Python如何对XML 解析
什么是 XML？ XML 指可扩展标记语言（eXtensible Markup Language），标准通用标记语言的子集，是一种用于标记电子文件使其具有结构性的标记语言。你可以通过本站学习 X
解析:php调用MsSQL存储过程使用内置RETVAL获取过程中的return值
【PHP代码】复制代码代码如下: $stmt = mssql_init('P__Global_Test', $conn) or die("initialize sto
解析:清除SQL被注入恶意病毒代码的语句
在SQL查询分析器执行以下代码就可以了。复制代码代码如下: declare @t varchar(255),@c varchar(255) declare table_cursor curs
【JavaScript】前端算法题40道题+解析
前言最近练习了一些前端算法题，现在做个总结，以下题目都是个人写法，并不是标准答案，如有错误欢迎指出，有对某道题有新的想法的友友也可以在评论区发表想法，互相学习🤭 题目题目一: 二维数组中的

首页

博学

6Ren·AI

商城

php - 你如何解析和处理 PHP 中的 HTML/XML？