c# - 解析 HTML 文档 : Regular expression or LINQ?-6ren

c# - 解析 HTML 文档 : Regular expression or LINQ?

转载作者：太空狗更新时间：2023-10-29 18:19:38

25

4

尝试解析 HTML 文档并提取一些元素(任何指向文本文件的链接)。

当前的策略是将 HTML 文档加载到字符串中。然后找到文本文件链接的所有实例。它可以是任何文件类型，但是对于这个问题，它是一个文本文件。

最终目标是拥有一个 IEnumerable 字符串对象列表。这部分很简单，但解析数据才是问题。

<html>
<head><title>Blah</title>
</head>
<body>
<br/>
<div>Here is your first text file: <a href="http://myServer.com/blah.txt"></div>
<span>Here is your second text file: <a href="http://myServer.com/blarg2.txt"></span>
<div>Here is your third text file: <a href="http://myServer.com/bat.txt"></div>
<div>Here is your fourth text file: <a href="http://myServer.com/somefile.txt"></div>
<div>Thanks for visiting!</div>
</body>
</html>

最初的方法是:

将字符串加载到 XML 文档中，并以 Linq-To-Xml 方式对其进行攻击。
创建一个正则表达式，以查找以 href= 开头并以 .txt 结尾的字符串>

问题是:

那个正则表达式会是什么样子？我是正则表达式新手，这是我正则表达式学习的一部分。
您会使用哪种方法来提取标签列表？
哪种方式效率最高？
哪种方法最具可读性/可维护性？

更新:感谢 Matthew关于 HTML Agility Pack 建议。它工作得很好! XPath 建议也适用。我希望我可以将两个答案都标记为“答案”，但我显然不能。它们都是问题的有效解决方案。

这是一个使用 Jeff 建议的正则表达式的 C# 控制台应用程序.它可以很好地读取字符串，并且不会包含任何未以 .txt 结尾的 href。对于给定的示例，它正确地不在结果中包含 .txt.snarg 文件(如 HTML 字符串函数中提供的那样)。

using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;
using System.IO;

namespace ParsePageLinks
{
    class Program
    {
        static void Main(string[] args)
        {
            GetAllLinksFromStringByRegex();
        }

        static List<string> GetAllLinksFromStringByRegex()
        {
            string myHtmlString = BuildHtmlString();
            string txtFileExp = "href=\"([^\\\"]*\\.txt)\"";

            List<string> foundTextFiles = new List<string>();

            MatchCollection textFileLinkMatches = Regex.Matches(myHtmlString, txtFileExp, RegexOptions.IgnoreCase);
            foreach (Match m in textFileLinkMatches)
            {
                foundTextFiles.Add( m.Groups[1].ToString()); // this is your captured group
            }

            return files;
        }

            static string BuildHtmlString()
            {
                return new StringReader(@"<html><head><title>Blah</title></head><body><br/>
<div>Here is your first text file: <a href=""http://myServer.com/blah.txt""></div>
<span>Here is your second text file: <a href=""http://myServer.com/blarg2.txt""></span>
<div>Here is your third text file: <a href=""http://myServer.com/bat.txt.snarg""></div>
<div>Here is your fourth text file: <a href=""http://myServer.com/somefile.txt""></div>
<div>Thanks for visiting!</div></body></html>").ReadToEnd();
            }       
        }
    }

最佳答案

都没有。将其加载到 (X/HT)MLDocument 中并使用 XPath，这是一种处理 XML 的标准方法，非常强大。要查看的函数是 SelectNodes和 SelectSingleNode .

因为您显然使用的是 HTML(不是 XHTML)，所以您应该使用 HTML Agility Pack .大多数方法和属性与相关的 XML 类匹配。

使用 XPath 的示例实现:

    HtmlDocument doc = new HtmlDocument();
    doc.Load(new StringReader(@"<html>
<head><title>Blah</title>
</head>
<body>
<br/>
<div>Here is your first text file: <a href=""http://myServer.com/blah.txt""></div>
<span>Here is your second text file: <a href=""http://myServer.com/blarg2.txt""></span>
<div>Here is your third text file: <a href=""http://myServer.com/bat.txt""></div>
<div>Here is your fourth text file: <a href=""http://myServer.com/somefile.txt""></div>
<div>Thanks for visiting!</div>
</body>
</html>"));
        HtmlNode root = doc.DocumentNode;
        // 3 = ".txt".Length - 1.  See http://stackoverflow.com/questions/402211/how-to-use-xpath-function-in-a-xpathexpression-instance-programatically
        HtmlNodeCollection links = root.SelectNodes("//a[@href['.txt' = substring(., string-length(.)- 3)]]");
    IList<string> fileStrings;
    if(links != null)
    {
        fileStrings = new List<string>(links.Count);
        foreach(HtmlNode link in links)
        fileStrings.Add(link.GetAttributeValue("href", null));
    }
    else
        fileStrings = new List<string>(0);

关于c# - 解析 HTML 文档 : Regular expression or LINQ?，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/907563/

25

4

0

文章推荐： Python 套接字 : Enabling Promiscuous Mode in Linux

文章推荐： angular - 在 Angular 中导入 JS 库

文章推荐： Python:在 Pylab 标题中使用变量

express - Express 中的内容安全策略报告空对象
我的 Web 应用程序在后端使用 Node.js 和 Express。当违反内容安全策略 (CSP) 时，报告 URI 报告空对象。我的后台代码如下: app.use(bodyParser.urlen
express - Express 中虚拟路径前缀的用途
在服务器端提供静态服务的方式在 Express 中似乎非常简单: To serve static files such as images, CSS files, and JavaScript fil
javascript - var express = 需要 ('express' ); var app = express(), 什么是 express()??它是方法还是构造函数？它从何而来
var express = require('express'); var app = express(); 这就是我们创建快速应用程序的方式。但是这个'express()'是什么？它是方法还是构造函
express - 无法安装 express 因为 npm install express 错误
我在尝试安装时收到以下错误 express : npm ERR! code ERR_OSSL_PEM_NO_START_LINE npm ERR! errno ERR_OSSL_PEM_NO_STAR
node.js - express.Router() 与 express() 中的 express
如 express 所述routing guide和 this answer ，我们可以创建“迷你应用程序”并从主应用程序使用它。但是我看到一段代码，它在模块中使用 app 而不是 router ap
express - 如何在 NestJS 中安装 Express 中间件(express-openapi-validator)？
我正在写一个 NestJS应用。现在我想安装 Express中间件 express-openapi-validator . 但是，我无法让它工作。有一个 description for how to
c# - Expression.Assign 将 Expression.Call 分配给 Expression.Variable
我看过很多类似的帖子，似乎我声明的 var1 似乎需要在其他地方传递，但我似乎无法弄清楚。 public Expression> CreateEqualNameExpression(string ma
Node.js Express express.json 和 express.urlencoded 与表单提交
Express(或 Connect 的)bodyParser 中间件被标记为已弃用，建议用户改用: app.use(connect.urlencoded()) app.use(connect.json
express - 为什么 Apollo Server 不是 Express 的中间件，而是接受 Express 作为中间件的服务器？
我只是想知道这种看似尴尬的配置的原因是什么(来自 Getting Started w/ Apollo Server )， const server = new ApolloServer({ //
angular - ngIf 导致错误 : Conditional Expression requires all 3 expressions at the end of expression
我正在尝试在表单组中写入表单控件特定的验证错误消息。我在网上找到了几个教程和示例 ( such as this one )，概述了一个看似简单的 *ngIf div，如果在控件上检测到错误，则显示错误
express - 无服务器 Express 应用程序在路由内的逻辑完成之前终止
我有一个简单的 Express 应用程序，托管在 AWS 上，使用无服务器框架。我正在使用 serverless-http 包装 express 应用程序以部署到 AWS lambda 函数，并使用
express - 安装 express 应用程序生成器时出错
我最近在 mozilla 教程的帮助下安装了 node 和 express。我正在安装应用程序生成器的下一步，但是当我运行时 npm install express-generator -g 在我的终
express - 这两种使用 express 中间件的方法有区别吗？
我遇到过两种不同的方式来定义 express、use() 中间件，我想知道它们之间是否有任何区别，或者它是否只是语法糖？一个 const app = express(); app.use(cors(
express - Express/Jade 中的相对链接
我试图让我的 Jade 模板编写一个相对于当前 URL 的超链接 ( )。例如，我的 View 是从 http://localhost/cats 调用的它看起来像这样: extends layou
express - 如何在 Express 中将所有请求作为过滤上下文处理？
检查 Express 文档我在下面看到了这种解决方案: app.all('/*', function(req, res) { console.log('Intercepting request
express - Sequelize Express 多个模型
我似乎无法弄清楚如何包含多个模型。我有三个模型。Tabs, Servers, and PointsTabs hasMany ServerServers belongsTo Tabs and hasM
iis-express - 如何启动IIS Express？
我已使用Web PI安装IIS Express。在托盘中，没有IIS Express图标。如何在不使用命令行的情况下启动IIS Express？我希望IIS永久运行，因此没有命令行。最佳答案参见R
express - 如何在没有模板引擎的情况下制作 Express 网站？
我不想在我的网站上使用 Jade 或 EJS。如何在不默认使用 Jade 模板的情况下创建快速站点？谢谢最佳答案如果您想要的是直接为静态 html 文件提供缓存资源的可能性，同时仍然能够点击“/”
express - express.js + PATCH动词
Express是否支持HTTP动词“PATCH”，例如: app.patch("/api/resource", function(req, res){ ... }); 我检查了文档，对我来说似乎还不清
express - Vue历史模式和 express 服务器出现404错误
我正在快速服务器中运行 vue SPA。问题是当使用历史模式并刷新页面时，我得到一个 404 not found 异常。我尝试使用 connect-history-api-fallback 但不起作用

首页

博学

6Ren·AI

商城

c# - 解析 HTML 文档 : Regular expression or LINQ?