c# 为文章摘要安全地截断 HTML-6ren

c# 为文章摘要安全地截断 HTML

转载作者：技术小花猫更新时间：2023-10-29 12:32:09

有人有这个的 c# 变体吗？

这样我就可以获取一些 html 并在不中断的情况下将其显示为文章的摘要？

Truncate text containing HTML, ignoring tags

让我免于重新发明轮子!

编辑

抱歉，新来的，你的权利，应该更好地表达问题，这里有更多信息

我希望获取一个 html 字符串并将其截断为一定数量的单词(或什至字符长度)，这样我就可以将它的开头显示为摘要(然后指向主要文章)。我希望保留 html，以便在预览中显示链接等。

我要解决的主要问题是，如果我们在 1 个或多个标签的中间截断，我们很可能会以未闭合的 html 标签结束!

我的解决方案是

首先将 html 截断为 N 个单词(单词更好，但字符可以)(确保不要停在标记中间并截断 require 属性)
处理这个截短字符串中打开的 html 标签(也许我会把它们放在堆栈上？)
然后处理结束标记并确保它们与堆栈中的标记匹配，因为我将它们弹出？
如果在此之后有任何打开的标签留在堆栈中，则将它们写入截断字符串的末尾，html 应该可以使用了!!!!

编辑 12/11/2009

这是我目前为止在 VS2008 中拼凑的单元测试文件，这“可能”对将来的人有所帮助
我基于 Jan 代码的破解尝试在字符版本 + 单词版本中名列前茅(免责声明:这是肮脏的粗略代码!!对我而言)
我假设在所有情况下都使用“格式良好”的 HTML(但不一定是根据 XML 版本具有根节点的完整文档)
Abels XML 版本在底部，但还没有抽出时间来完全运行测试(还需要理解代码)...
有机会我会更新
在发布代码时遇到问题？堆栈上没有上传工具吗？

感谢所有评论:)

using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;
using System.Xml;
using System.Xml.XPath;
using Microsoft.VisualStudio.TestTools.UnitTesting;

namespace PINET40TestProject
{
    [TestClass]
    public class UtilityUnitTest
    {
        public static string TruncateHTMLSafeishChar(string text, int charCount)
        {
            bool inTag = false;
            int cntr = 0;
            int cntrContent = 0;

            // loop through html, counting only viewable content
            foreach (Char c in text)
            {
                if (cntrContent == charCount) break;
                cntr++;
                if (c == '<')
                {
                    inTag = true;
                    continue;
                }

                if (c == '>')
                {
                    inTag = false;
                    continue;
                }
                if (!inTag) cntrContent++;
            }

            string substr = text.Substring(0, cntr);

            //search for nonclosed tags        
            MatchCollection openedTags = new Regex("<[^/](.|\n)*?>").Matches(substr);
            MatchCollection closedTags = new Regex("<[/](.|\n)*?>").Matches(substr);

            // create stack          
            Stack<string> opentagsStack = new Stack<string>();
            Stack<string> closedtagsStack = new Stack<string>();

            // to be honest, this seemed like a good idea then I got lost along the way 
            // so logic is probably hanging by a thread!! 
            foreach (Match tag in openedTags)
            {
                string openedtag = tag.Value.Substring(1, tag.Value.Length - 2);
                // strip any attributes, sure we can use regex for this!
                if (openedtag.IndexOf(" ") >= 0)
                {
                    openedtag = openedtag.Substring(0, openedtag.IndexOf(" "));
                }

                // ignore brs as self-closed
                if (openedtag.Trim() != "br")
                {
                    opentagsStack.Push(openedtag);
                }
            }

            foreach (Match tag in closedTags)
            {
                string closedtag = tag.Value.Substring(2, tag.Value.Length - 3);
                closedtagsStack.Push(closedtag);
            }

            if (closedtagsStack.Count < opentagsStack.Count)
            {
                while (opentagsStack.Count > 0)
                {
                    string tagstr = opentagsStack.Pop();

                    if (closedtagsStack.Count == 0 || tagstr != closedtagsStack.Peek())
                    {
                        substr += "</" + tagstr + ">";
                    }
                    else
                    {
                        closedtagsStack.Pop();
                    }
                }
            }

            return substr;
        }

        public static string TruncateHTMLSafeishWord(string text, int wordCount)
        {
            bool inTag = false;
            int cntr = 0;
            int cntrWords = 0;
            Char lastc = ' ';

            // loop through html, counting only viewable content
            foreach (Char c in text)
            {
                if (cntrWords == wordCount) break;
                cntr++;
                if (c == '<')
                {
                    inTag = true;
                    continue;
                }

                if (c == '>')
                {
                    inTag = false;
                    continue;
                }
                if (!inTag)
                {
                    // do not count double spaces, and a space not in a tag counts as a word
                    if (c == 32 && lastc != 32)
                        cntrWords++;
                }
            }

            string substr = text.Substring(0, cntr) + " ...";

            //search for nonclosed tags        
            MatchCollection openedTags = new Regex("<[^/](.|\n)*?>").Matches(substr);
            MatchCollection closedTags = new Regex("<[/](.|\n)*?>").Matches(substr);

            // create stack          
            Stack<string> opentagsStack = new Stack<string>();
            Stack<string> closedtagsStack = new Stack<string>();

            foreach (Match tag in openedTags)
            {
                string openedtag = tag.Value.Substring(1, tag.Value.Length - 2);
                // strip any attributes, sure we can use regex for this!
                if (openedtag.IndexOf(" ") >= 0)
                {
                    openedtag = openedtag.Substring(0, openedtag.IndexOf(" "));
                }

                // ignore brs as self-closed
                if (openedtag.Trim() != "br")
                {
                    opentagsStack.Push(openedtag);
                }
            }

            foreach (Match tag in closedTags)
            {
                string closedtag = tag.Value.Substring(2, tag.Value.Length - 3);
                closedtagsStack.Push(closedtag);
            }

            if (closedtagsStack.Count < opentagsStack.Count)
            {
                while (opentagsStack.Count > 0)
                {
                    string tagstr = opentagsStack.Pop();

                    if (closedtagsStack.Count == 0 || tagstr != closedtagsStack.Peek())
                    {
                        substr += "</" + tagstr + ">";
                    }
                    else
                    {
                        closedtagsStack.Pop();
                    }
                }
            }

            return substr;
        }

        public static string TruncateHTMLSafeishCharXML(string text, int charCount)
        {
            // your data, probably comes from somewhere, or as params to a methodint 
            XmlDocument xml = new XmlDocument();
            xml.LoadXml(text);
            // create a navigator, this is our primary tool
            XPathNavigator navigator = xml.CreateNavigator();
            XPathNavigator breakPoint = null;

            // find the text node we need:
            while (navigator.MoveToFollowing(XPathNodeType.Text))
            {
                string lastText = navigator.Value.Substring(0, Math.Min(charCount, navigator.Value.Length));
                charCount -= navigator.Value.Length;
                if (charCount <= 0)
                {
                    // truncate the last text. Here goes your "search word boundary" code:        
                    navigator.SetValue(lastText);
                    breakPoint = navigator.Clone();
                    break;
                }
            }

            // first remove text nodes, because Microsoft unfortunately merges them without asking
            while (navigator.MoveToFollowing(XPathNodeType.Text))
            {
                if (navigator.ComparePosition(breakPoint) == XmlNodeOrder.After)
                {
                    navigator.DeleteSelf();
                }
            }

            // moves to parent, then move the rest
            navigator.MoveTo(breakPoint);
            while (navigator.MoveToFollowing(XPathNodeType.Element))
            {
                if (navigator.ComparePosition(breakPoint) == XmlNodeOrder.After)
                {
                    navigator.DeleteSelf();
                }
            }

            // moves to parent
            // then remove *all* empty nodes to clean up (not necessary):
            // TODO, add empty elements like <br />, <img /> as exclusion
            navigator.MoveToRoot();
            while (navigator.MoveToFollowing(XPathNodeType.Element))
            {
                while (!navigator.HasChildren && (navigator.Value ?? "").Trim() == "")
                {
                    navigator.DeleteSelf();
                }
            }

            // moves to parent
            navigator.MoveToRoot();
            return navigator.InnerXml;
        }

        [TestMethod]
        public void TestTruncateHTMLSafeish()
        {
            // Case where we just make it to start of HREF (so effectively an empty link)

            // 'simple' nested none attributed tags
            Assert.AreEqual(@"<h1>1234</h1><b><i>56789</i>012</b>",
            TruncateHTMLSafeishChar(
                @"<h1>1234</h1><b><i>56789</i>012345</b>",
                12));

            // In middle of a!
            Assert.AreEqual(@"<h1>1234</h1><a href=""testurl""><b>567</b></a>",
            TruncateHTMLSafeishChar(
                @"<h1>1234</h1><a href=""testurl""><b>5678</b></a><i><strong>some italic nested in string</strong></i>",
                7));

            // more
            Assert.AreEqual(@"<div><b><i><strong>1</strong></i></b></div>",
            TruncateHTMLSafeishChar(
                @"<div><b><i><strong>12</strong></i></b></div>",
                1));

            // br
            Assert.AreEqual(@"<h1>1 3 5</h1><br />6",
            TruncateHTMLSafeishChar(
                @"<h1>1 3 5</h1><br />678<br />",
                6));
        }

        [TestMethod]
        public void TestTruncateHTMLSafeishWord()
        {
            // zero case
            Assert.AreEqual(@" ...",
                            TruncateHTMLSafeishWord(
                                @"",
                               5));

            // 'simple' nested none attributed tags
            Assert.AreEqual(@"<h1>one two <br /></h1><b><i>three  ...</i></b>",
            TruncateHTMLSafeishWord(
                @"<h1>one two <br /></h1><b><i>three </i>four</b>",
                3), "we have added ' ...' to end of summary");

            // In middle of a!
            Assert.AreEqual(@"<h1>one two three </h1><a href=""testurl""><b class=""mrclass"">four  ...</b></a>",
            TruncateHTMLSafeishWord(
                @"<h1>one two three </h1><a href=""testurl""><b class=""mrclass"">four five </b></a><i><strong>some italic nested in string</strong></i>",
                4));

            // start of h1
            Assert.AreEqual(@"<h1>one two three  ...</h1>",
            TruncateHTMLSafeishWord(
                @"<h1>one two three </h1><a href=""testurl""><b>four five </b></a><i><strong>some italic nested in string</strong></i>",
                3));

            // more than words available
            Assert.AreEqual(@"<h1>one two three </h1><a href=""testurl""><b>four five </b></a><i><strong>some italic nested in string</strong></i> ...",
            TruncateHTMLSafeishWord(
                @"<h1>one two three </h1><a href=""testurl""><b>four five </b></a><i><strong>some italic nested in string</strong></i>",
                99));
        }

        [TestMethod]
        public void TestTruncateHTMLSafeishWordXML()
        {
            // zero case
            Assert.AreEqual(@" ...",
                            TruncateHTMLSafeishWord(
                                @"",
                               5));

            // 'simple' nested none attributed tags
            string output = TruncateHTMLSafeishCharXML(
                @"<body><h1>one two </h1><b><i>three </i>four</b></body>",
                13);
            Assert.AreEqual(@"<body>\r\n  <h1>one two </h1>\r\n  <b>\r\n    <i>three</i>\r\n  </b>\r\n</body>", output,
             "XML version, no ... yet and addeds '\r\n  + spaces?' to format document");

            // In middle of a!
            Assert.AreEqual(@"<h1>one two three </h1><a href=""testurl""><b class=""mrclass"">four  ...</b></a>",
            TruncateHTMLSafeishCharXML(
                @"<body><h1>one two three </h1><a href=""testurl""><b class=""mrclass"">four five </b></a><i><strong>some italic nested in string</strong></i></body>",
                4));

            // start of h1
            Assert.AreEqual(@"<h1>one two three  ...</h1>",
            TruncateHTMLSafeishCharXML(
                @"<h1>one two three </h1><a href=""testurl""><b>four five </b></a><i><strong>some italic nested in string</strong></i>",
                3));

            // more than words available
            Assert.AreEqual(@"<h1>one two three </h1><a href=""testurl""><b>four five </b></a><i><strong>some italic nested in string</strong></i> ...",
            TruncateHTMLSafeishCharXML(
                @"<h1>one two three </h1><a href=""testurl""><b>four five </b></a><i><strong>some italic nested in string</strong></i>",
                99));
        }
    }
}

最佳答案

编辑:请参阅下面的完整解决方案，第一次尝试剥离 HTML，第二次则没有

让我们总结一下你想要什么:

结果中没有 HTML
它应该在<body> 中获取任何有效数据
它有一个固定的最大长度

如果您的 HTML 是 XHTML，这就变得微不足道了(而且，虽然我没有看到 PHP 解决方案，但我非常怀疑他们使用类似的方法，但我相信这是可以理解的并且相当容易):

XmlDocument xml = new XmlDocument();

// replace the following line with the content of your full XHTML
xml.LoadXml(@"<body><p>some <i>text</i>here</p><div>that needs stripping</div></body>");

// Get all textnodes under <body> (twice "//" is on purpose)
XmlNodeList nodes = xml.SelectNodes("//body//text()");

// loop through the text nodes, replace this with whatever you like to do with the text
foreach (var node in nodes)
{
    Debug.WriteLine(((XmlCharacterData)node).Value);
}

注意:空格等将被保留。这通常是一件好事。

如果您没有 XHTML，您可以使用 HTML Agility Pack ，这让你对普通的旧 HTML 做同样的事情(它在内部将它转换为一些 DOM)。我还没有尝试过，但它应该运行起来相当流畅。

大编辑:

实际解决方案

在一条小评论中，我 promise 采用 XHTML/XmlDocument 方法并将其用于类型安全的方法，以根据文本长度拆分 HTML，但保留 HTML 代码。我采用了以下 HTML，代码在 needs 的中间正确地中断了它，删除其余部分，删除空节点并自动关闭所有打开的元素。

示例 HTML:

<body>
    <p><tt>some<u><i>text</i>here</u></tt></p>
    <div>that <b><i>needs <span>str</span>ip</i></b><s>ping</s></div>
</body>

代码经过测试并可以处理任何类型的输入(好吧，当然，我只是做了一些测试，代码可能包含错误，如果您发现错误请告诉我!)。

// your data, probably comes from somewhere, or as params to a method
int lengthAvailable = 20;
XmlDocument xml = new XmlDocument();
xml.LoadXml(@"place-html-code-here-left-out-for-brevity");

// create a navigator, this is our primary tool
XPathNavigator navigator = xml.CreateNavigator();
XPathNavigator breakPoint = null;


string lastText = "";

// find the text node we need:
while (navigator.MoveToFollowing(XPathNodeType.Text))
{
    lastText = navigator.Value.Substring(0, Math.Min(lengthAvailable, navigator.Value.Length));
    lengthAvailable -= navigator.Value.Length;

    if (lengthAvailable <= 0)
    {
        // truncate the last text. Here goes your "search word boundary" code:
        navigator.SetValue(lastText);
        breakPoint = navigator.Clone();
        break;
    }
}

// first remove text nodes, because Microsoft unfortunately merges them without asking
while (navigator.MoveToFollowing(XPathNodeType.Text))
    if (navigator.ComparePosition(breakPoint) == XmlNodeOrder.After)
        navigator.DeleteSelf();   // moves to parent

// then move the rest
navigator.MoveTo(breakPoint);
while (navigator.MoveToFollowing(XPathNodeType.Element))
    if (navigator.ComparePosition(breakPoint) == XmlNodeOrder.After)
        navigator.DeleteSelf();   // moves to parent

// then remove *all* empty nodes to clean up (not necessary): 
// TODO, add empty elements like <br />, <img /> as exclusion
navigator.MoveToRoot();
while (navigator.MoveToFollowing(XPathNodeType.Element))
    while (!navigator.HasChildren && (navigator.Value ?? "").Trim() == "")
        navigator.DeleteSelf();  // moves to parent

navigator.MoveToRoot();
Debug.WriteLine(navigator.InnerXml);

代码是如何工作的

代码按顺序执行以下操作:

它遍历所有文本节点，直到文本大小超出允许的限制，在这种情况下它会截断该节点。这会自动正确处理 >等作为一个字符。
然后它会缩短“中断节点”的文本并重置它。它克隆了 XPathNavigator在这一点上，我们需要记住这个“突破点”。
要解决 MS 错误(实际上是一个古老的错误)，我们必须先删除所有剩余的文本节点，遵循断点，否则我们会冒文本节点自动合并的风险最终成为彼此的 sibling 。注:DeleteSelf很方便，但是会将导航器位置移动到其父级，这就是为什么我们需要根据上一步中记住的“断点”位置检查当前位置。
然后我们首先做我们想做的事情:删除断点之后的任何节点。
不是必需的步骤:清理代码并删除所有空元素。此操作只是为了清理 HTML 和/或过滤特定(禁止)允许的元素。它可以被忽略。
返回到“root”并使用InnerXml 获取字符串形式的内容.

就这些了，相当简单，尽管乍一看可能有点令人生畏。

PS:如果您使用 XSLT，同样会更容易阅读和理解，XSLT 是此类工作的理想工具。

更新:根据已编辑的问题添加了扩展代码示例
更新:添加了一些解释

关于c# 为文章摘要安全地截断 HTML，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/1714764/

文章推荐： html - 设计不适用于跨度和按钮

文章推荐： linux - 内核模块的安全卸载

文章推荐：用于在图像底部放置标签的 HTML

spring 安全、方法安全和 url 安全
我正在学习 Spring 安全性，但我对它的灵活性感到困惑.. 我知道我可以通过在标签中定义规则来保护网址然后我看到有一个@secure 注释可以保护方法。然后还有其他注释来保护域(或 POJO)
java - key 安全 - 如何确保 key 安全？
假设有一个 key 加密 key 位于内存中并且未写入文件或数据库... byte[] kek = new byte[32]; secureRandom.nextBytes(kek); byte[]
Spring 安全 3.2.0 > <安全 :form-login/> deprecated
我有 Spring Security 3.2.0 RC1 的问题我正在使用标签来连接我这表示“方法‘setF
flutter 安全
我正在创建一个使用 Node Js 服务器 API 的 Flutter 应用程序。对于授权，我决定将 JWT 与私钥/公钥一起使用。服务器和移动客户端之间的通信使用 HTTPS。 Flutter 应用
Javascript 安全
在过去的几年里，我一直在使用范围从 Raphael.js 的 javascript 库。至 D3 ，我已经为自己的教育操纵了来自网络各地的动画。我已经从各种 git 存储库下载了 js 脚本，例如 s
python +安全
在 python 中实现身份验证的好方法是什么？已经存在的东西也很好。我需要它通过不受信任的网络连接进行身份验证。它不需要太高级，只要足以安全地获取通用密码即可。我查看了 ssl 模块。但那个模块让我
Hadoop 安全
我正在尝试学习“如何在 Hadoop 中实现 Kerberos？”我已经看过这个文档 https://issues.apache.org/jira/browse/HADOOP-4487我还了解了基本的
phpmyadmin 安全
我有一个带有 apache2、php、mysql 的生产服务器。我现在只有一个站点 (mysite.com) 作为虚拟主机。我想把 phpmyadmin、webalizer 和 webmin 放在那里
记OPNsense防火墙的安装过程-安全
前些天在网上看到防火墙软件OPNsense，对其有了兴趣，以前写过一个其前面的一个软件M0n0wall（关于m0n0wa
Spring 安全+火力地堡
我在 Spring Boot 和 oauth2(由 Google 提供)上编写了 rest 后端，在 "/login" 上自动重定向。除了 web 的 oauth 之外，我还想在移动后端进行 Fire
c++ - 从派生类调用带有抽象基类的类——安全
我想调用类 Foo，它的构造函数中有抽象类 Base。我希望能够从派生自 Base 的 Derived 调用 Foo 并使用 Derived覆盖方法而不是 Base 的方法。我只能按照指示使用原始指
Codeigniter session 安全
如何提高 session 的安全性？ $this->session->userdata('userid') 我一直在为我的 ajax 调用扔掉这个小坏蛋。有些情况我没有。然后我想，使用 DOM 中的
security - assembly 安全
我目前正在为某些人提供程序集编译服务。他们可以在在线编辑器中输入汇编代码并进行编译。然后编译它时，代码通过ajax请求发送到我的服务器，编译并返回程序的输出。但是，我想知道我可以做些什么来防止对服务
security - 安全、黑客等方面的良好资源？
就目前而言，这个问题不适合我们的问答形式。我们希望答案得到事实、引用或专业知识的支持，但这个问题可能会引起辩论、争论、投票或扩展讨论。如果您觉得这个问题可以改进并可能重新打开，visit the he
php - $_session 安全
目前，我通过将 session 中的 key 与 MySQl 数据库中的相同 key 相匹配来验证用户 session 。我使用随机数重新生成 session ，该随机数在每个页面加载时都受 MD5
ASP.Net 安全
Microsoft 模式与实践团队提供了一个很棒的 pdf，称为:“构建安全的 asp.net 应用程序”。 microsoft pdf 由于它是为 .Net 1.0 编写的，所以现在有点旧了。有谁知
lua - (安全)随机字符串？
在 Lua 中，通常会使用 math.random 生成随机值和/或字符串。 & math.randomseed , 其中 os.time用于 math.randomseed . 然而，这种方法有一个
security - ColdFusion 安全
就目前而言，这个问题不适合我们的问答形式。我们希望答案得到事实、引用资料或专业知识的支持，但这个问题可能会引发辩论、争论、投票或扩展讨论。如果您觉得这个问题可以改进并可能重新打开，visit the
javascript - Ajax 安全
我们有一个严重依赖 Ajax 的应用程序。确保对服务器端脚本的请求不是通过独立程序而是通过坐在浏览器上的实际用户的好方法是什么最佳答案真的没有。通过浏览器发送的任何请求都可以由独立程序伪造。归
security - Websocket 安全
我正在寻找使用 WebSockets 与我们的服务器通信来实现 web (angular) 和 iPhone 应用程序。在过去使用 HTTP 请求时，我们使用请求数据、url、时间戳等的哈希值来验证和

技术小花猫

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城