gpt4 book ai didi

c# 为文章摘要安全地截断 HTML

转载 作者:技术小花猫 更新时间:2023-10-29 12:32:09 26 4
gpt4 key购买 nike

有人有这个的 c# 变体吗?

这样我就可以获取一些 html 并在不中断的情况下将其显示为文章的摘要?

Truncate text containing HTML, ignoring tags

让我免于重新发明轮子!

编辑

抱歉,新来的,你的权利,应该更好地表达问题,这里有更多信息

我希望获取一个 html 字符串并将其截断为一定数量的单词(或什至字符长度),这样我就可以将它的开头显示为摘要(然后指向主要文章)。我希望保留 html,以便在预览中显示链接等。

我要解决的主要问题是,如果我们在 1 个或多个标签的中间截断,我们很可能会以未闭合的 html 标签结束!

我的解决方案是

  1. 首先将 html 截断为 N 个单词(单词更好,但字符可以)(确保不要停在标记中间并截断 require 属性)

  2. 处理这个截短字符串中打开的 html 标签(也许我会把它们放在堆栈上?)

  3. 然后处理结束标记并确保它们与堆栈中的标记匹配,因为我将它们弹出?

  4. 如果在此之后有任何打开的标签留在堆栈中,则将它们写入截断字符串的末尾,html 应该可以使用了!!!!

编辑 12/11/2009

  • 这是我目前为止在 VS2008 中拼凑的单元测试文件,这“可能”对将来的人有所帮助
  • 我基于 Jan 代码的破解尝试在字符版本 + 单词版本中名列前茅(免责声明:这是肮脏的粗略代码!!对我而言)
  • 我假设在所有情况下都使用“格式良好”的 HTML(但不一定是根据 XML 版本具有根节点的完整文档)
  • Abels XML 版本在底部,但还没有抽出时间来完全运行测试(还需要理解代码)...
  • 有机会我会更新
  • 在发布代码时遇到问题?堆栈上没有上传工具吗?

感谢所有评论:)

using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;
using System.Xml;
using System.Xml.XPath;
using Microsoft.VisualStudio.TestTools.UnitTesting;

namespace PINET40TestProject
{
[TestClass]
public class UtilityUnitTest
{
public static string TruncateHTMLSafeishChar(string text, int charCount)
{
bool inTag = false;
int cntr = 0;
int cntrContent = 0;

// loop through html, counting only viewable content
foreach (Char c in text)
{
if (cntrContent == charCount) break;
cntr++;
if (c == '<')
{
inTag = true;
continue;
}

if (c == '>')
{
inTag = false;
continue;
}
if (!inTag) cntrContent++;
}

string substr = text.Substring(0, cntr);

//search for nonclosed tags
MatchCollection openedTags = new Regex("<[^/](.|\n)*?>").Matches(substr);
MatchCollection closedTags = new Regex("<[/](.|\n)*?>").Matches(substr);

// create stack
Stack<string> opentagsStack = new Stack<string>();
Stack<string> closedtagsStack = new Stack<string>();

// to be honest, this seemed like a good idea then I got lost along the way
// so logic is probably hanging by a thread!!
foreach (Match tag in openedTags)
{
string openedtag = tag.Value.Substring(1, tag.Value.Length - 2);
// strip any attributes, sure we can use regex for this!
if (openedtag.IndexOf(" ") >= 0)
{
openedtag = openedtag.Substring(0, openedtag.IndexOf(" "));
}

// ignore brs as self-closed
if (openedtag.Trim() != "br")
{
opentagsStack.Push(openedtag);
}
}

foreach (Match tag in closedTags)
{
string closedtag = tag.Value.Substring(2, tag.Value.Length - 3);
closedtagsStack.Push(closedtag);
}

if (closedtagsStack.Count < opentagsStack.Count)
{
while (opentagsStack.Count > 0)
{
string tagstr = opentagsStack.Pop();

if (closedtagsStack.Count == 0 || tagstr != closedtagsStack.Peek())
{
substr += "</" + tagstr + ">";
}
else
{
closedtagsStack.Pop();
}
}
}

return substr;
}

public static string TruncateHTMLSafeishWord(string text, int wordCount)
{
bool inTag = false;
int cntr = 0;
int cntrWords = 0;
Char lastc = ' ';

// loop through html, counting only viewable content
foreach (Char c in text)
{
if (cntrWords == wordCount) break;
cntr++;
if (c == '<')
{
inTag = true;
continue;
}

if (c == '>')
{
inTag = false;
continue;
}
if (!inTag)
{
// do not count double spaces, and a space not in a tag counts as a word
if (c == 32 && lastc != 32)
cntrWords++;
}
}

string substr = text.Substring(0, cntr) + " ...";

//search for nonclosed tags
MatchCollection openedTags = new Regex("<[^/](.|\n)*?>").Matches(substr);
MatchCollection closedTags = new Regex("<[/](.|\n)*?>").Matches(substr);

// create stack
Stack<string> opentagsStack = new Stack<string>();
Stack<string> closedtagsStack = new Stack<string>();

foreach (Match tag in openedTags)
{
string openedtag = tag.Value.Substring(1, tag.Value.Length - 2);
// strip any attributes, sure we can use regex for this!
if (openedtag.IndexOf(" ") >= 0)
{
openedtag = openedtag.Substring(0, openedtag.IndexOf(" "));
}

// ignore brs as self-closed
if (openedtag.Trim() != "br")
{
opentagsStack.Push(openedtag);
}
}

foreach (Match tag in closedTags)
{
string closedtag = tag.Value.Substring(2, tag.Value.Length - 3);
closedtagsStack.Push(closedtag);
}

if (closedtagsStack.Count < opentagsStack.Count)
{
while (opentagsStack.Count > 0)
{
string tagstr = opentagsStack.Pop();

if (closedtagsStack.Count == 0 || tagstr != closedtagsStack.Peek())
{
substr += "</" + tagstr + ">";
}
else
{
closedtagsStack.Pop();
}
}
}

return substr;
}

public static string TruncateHTMLSafeishCharXML(string text, int charCount)
{
// your data, probably comes from somewhere, or as params to a methodint
XmlDocument xml = new XmlDocument();
xml.LoadXml(text);
// create a navigator, this is our primary tool
XPathNavigator navigator = xml.CreateNavigator();
XPathNavigator breakPoint = null;

// find the text node we need:
while (navigator.MoveToFollowing(XPathNodeType.Text))
{
string lastText = navigator.Value.Substring(0, Math.Min(charCount, navigator.Value.Length));
charCount -= navigator.Value.Length;
if (charCount <= 0)
{
// truncate the last text. Here goes your "search word boundary" code:
navigator.SetValue(lastText);
breakPoint = navigator.Clone();
break;
}
}

// first remove text nodes, because Microsoft unfortunately merges them without asking
while (navigator.MoveToFollowing(XPathNodeType.Text))
{
if (navigator.ComparePosition(breakPoint) == XmlNodeOrder.After)
{
navigator.DeleteSelf();
}
}

// moves to parent, then move the rest
navigator.MoveTo(breakPoint);
while (navigator.MoveToFollowing(XPathNodeType.Element))
{
if (navigator.ComparePosition(breakPoint) == XmlNodeOrder.After)
{
navigator.DeleteSelf();
}
}

// moves to parent
// then remove *all* empty nodes to clean up (not necessary):
// TODO, add empty elements like <br />, <img /> as exclusion
navigator.MoveToRoot();
while (navigator.MoveToFollowing(XPathNodeType.Element))
{
while (!navigator.HasChildren && (navigator.Value ?? "").Trim() == "")
{
navigator.DeleteSelf();
}
}

// moves to parent
navigator.MoveToRoot();
return navigator.InnerXml;
}

[TestMethod]
public void TestTruncateHTMLSafeish()
{
// Case where we just make it to start of HREF (so effectively an empty link)

// 'simple' nested none attributed tags
Assert.AreEqual(@"<h1>1234</h1><b><i>56789</i>012</b>",
TruncateHTMLSafeishChar(
@"<h1>1234</h1><b><i>56789</i>012345</b>",
12));

// In middle of a!
Assert.AreEqual(@"<h1>1234</h1><a href=""testurl""><b>567</b></a>",
TruncateHTMLSafeishChar(
@"<h1>1234</h1><a href=""testurl""><b>5678</b></a><i><strong>some italic nested in string</strong></i>",
7));

// more
Assert.AreEqual(@"<div><b><i><strong>1</strong></i></b></div>",
TruncateHTMLSafeishChar(
@"<div><b><i><strong>12</strong></i></b></div>",
1));

// br
Assert.AreEqual(@"<h1>1 3 5</h1><br />6",
TruncateHTMLSafeishChar(
@"<h1>1 3 5</h1><br />678<br />",
6));
}

[TestMethod]
public void TestTruncateHTMLSafeishWord()
{
// zero case
Assert.AreEqual(@" ...",
TruncateHTMLSafeishWord(
@"",
5));

// 'simple' nested none attributed tags
Assert.AreEqual(@"<h1>one two <br /></h1><b><i>three ...</i></b>",
TruncateHTMLSafeishWord(
@"<h1>one two <br /></h1><b><i>three </i>four</b>",
3), "we have added ' ...' to end of summary");

// In middle of a!
Assert.AreEqual(@"<h1>one two three </h1><a href=""testurl""><b class=""mrclass"">four ...</b></a>",
TruncateHTMLSafeishWord(
@"<h1>one two three </h1><a href=""testurl""><b class=""mrclass"">four five </b></a><i><strong>some italic nested in string</strong></i>",
4));

// start of h1
Assert.AreEqual(@"<h1>one two three ...</h1>",
TruncateHTMLSafeishWord(
@"<h1>one two three </h1><a href=""testurl""><b>four five </b></a><i><strong>some italic nested in string</strong></i>",
3));

// more than words available
Assert.AreEqual(@"<h1>one two three </h1><a href=""testurl""><b>four five </b></a><i><strong>some italic nested in string</strong></i> ...",
TruncateHTMLSafeishWord(
@"<h1>one two three </h1><a href=""testurl""><b>four five </b></a><i><strong>some italic nested in string</strong></i>",
99));
}

[TestMethod]
public void TestTruncateHTMLSafeishWordXML()
{
// zero case
Assert.AreEqual(@" ...",
TruncateHTMLSafeishWord(
@"",
5));

// 'simple' nested none attributed tags
string output = TruncateHTMLSafeishCharXML(
@"<body><h1>one two </h1><b><i>three </i>four</b></body>",
13);
Assert.AreEqual(@"<body>\r\n <h1>one two </h1>\r\n <b>\r\n <i>three</i>\r\n </b>\r\n</body>", output,
"XML version, no ... yet and addeds '\r\n + spaces?' to format document");

// In middle of a!
Assert.AreEqual(@"<h1>one two three </h1><a href=""testurl""><b class=""mrclass"">four ...</b></a>",
TruncateHTMLSafeishCharXML(
@"<body><h1>one two three </h1><a href=""testurl""><b class=""mrclass"">four five </b></a><i><strong>some italic nested in string</strong></i></body>",
4));

// start of h1
Assert.AreEqual(@"<h1>one two three ...</h1>",
TruncateHTMLSafeishCharXML(
@"<h1>one two three </h1><a href=""testurl""><b>four five </b></a><i><strong>some italic nested in string</strong></i>",
3));

// more than words available
Assert.AreEqual(@"<h1>one two three </h1><a href=""testurl""><b>four five </b></a><i><strong>some italic nested in string</strong></i> ...",
TruncateHTMLSafeishCharXML(
@"<h1>one two three </h1><a href=""testurl""><b>four five </b></a><i><strong>some italic nested in string</strong></i>",
99));
}
}
}

最佳答案

编辑:请参阅下面的完整解决方案,第一次尝试剥离 HTML,第二次则没有

让我们总结一下你想要什么:

  • 结果中没有 HTML
  • 它应该在<body> 中获取任何有效数据
  • 它有一个固定的最大长度

如果您的 HTML 是 XHTML,这就变得微不足道了(而且,虽然我没有看到 PHP 解决方案,但我非常怀疑他们使用类似的方法,但我相信这是可以理解的并且相当容易):

XmlDocument xml = new XmlDocument();

// replace the following line with the content of your full XHTML
xml.LoadXml(@"<body><p>some <i>text</i>here</p><div>that needs stripping</div></body>");

// Get all textnodes under <body> (twice "//" is on purpose)
XmlNodeList nodes = xml.SelectNodes("//body//text()");

// loop through the text nodes, replace this with whatever you like to do with the text
foreach (var node in nodes)
{
Debug.WriteLine(((XmlCharacterData)node).Value);
}

注意:空格等将被保留。这通常是一件好事。

如果您没有 XHTML,您可以使用 HTML Agility Pack ,这让你对普通的旧 HTML 做同样的事情(它在内部将它转换为一些 DOM)。我还没有尝试过,但它应该运行起来相当流畅。


大编辑:

实际解决方案

在一条小评论中,我 promise 采用 XHTML/XmlDocument 方法并将其用于类型安全的方法,以根据文本长度拆分 HTML,但保留 HTML 代码。我采用了以下 HTML,代码在 needs 的中间正确地中断了它,删除其余部分,删除空节点并自动关闭所有打开的元素。

示例 HTML:

<body>
<p><tt>some<u><i>text</i>here</u></tt></p>
<div>that <b><i>needs <span>str</span>ip</i></b><s>ping</s></div>
</body>

代码经过测试并可以处理任何类型的输入(好吧,当然,我只是做了一些测试,代码可能包含错误,如果您发现错误请告诉我!)。

// your data, probably comes from somewhere, or as params to a method
int lengthAvailable = 20;
XmlDocument xml = new XmlDocument();
xml.LoadXml(@"place-html-code-here-left-out-for-brevity");

// create a navigator, this is our primary tool
XPathNavigator navigator = xml.CreateNavigator();
XPathNavigator breakPoint = null;


string lastText = "";

// find the text node we need:
while (navigator.MoveToFollowing(XPathNodeType.Text))
{
lastText = navigator.Value.Substring(0, Math.Min(lengthAvailable, navigator.Value.Length));
lengthAvailable -= navigator.Value.Length;

if (lengthAvailable <= 0)
{
// truncate the last text. Here goes your "search word boundary" code:
navigator.SetValue(lastText);
breakPoint = navigator.Clone();
break;
}
}

// first remove text nodes, because Microsoft unfortunately merges them without asking
while (navigator.MoveToFollowing(XPathNodeType.Text))
if (navigator.ComparePosition(breakPoint) == XmlNodeOrder.After)
navigator.DeleteSelf(); // moves to parent

// then move the rest
navigator.MoveTo(breakPoint);
while (navigator.MoveToFollowing(XPathNodeType.Element))
if (navigator.ComparePosition(breakPoint) == XmlNodeOrder.After)
navigator.DeleteSelf(); // moves to parent

// then remove *all* empty nodes to clean up (not necessary):
// TODO, add empty elements like <br />, <img /> as exclusion
navigator.MoveToRoot();
while (navigator.MoveToFollowing(XPathNodeType.Element))
while (!navigator.HasChildren && (navigator.Value ?? "").Trim() == "")
navigator.DeleteSelf(); // moves to parent

navigator.MoveToRoot();
Debug.WriteLine(navigator.InnerXml);

代码是如何工作的

代码按顺序执行以下操作:

  1. 它遍历所有文本节点,直到文本大小超出允许的限制,在这种情况下它会截断该节点。这会自动正确处理 &gt;等作为一个字符。
  2. 然后它会缩短“中断节点”的文本并重置它。它克隆了 XPathNavigator在这一点上,我们需要记住这个“突破点”。
  3. 要解决 MS 错误(实际上是一个古老的错误),我们必须先删除所有剩余的文本节点,遵循断点,否则我们会冒文本节点自动合并的风险最终成为彼此的 sibling 。注:DeleteSelf很方便,但是会将导航器位置移动到其父级,这就是为什么我们需要根据上一步中记住的“断点”位置检查当前位置。
  4. 然后我们首先做我们想做的事情:删除断点之后的任何节点。
  5. 不是必需的步骤:清理代码并删除所有空元素。此操作只是为了清理 HTML 和/或过滤特定(禁止)允许的元素。它可以被忽略。
  6. 返回到“root”并使用InnerXml 获取字符串形式的内容.

就这些了,相当简单,尽管乍一看可能有点令人生畏。

PS:如果您使用 XSLT,同样会更容易阅读和理解,XSLT 是此类工作的理想工具。

更新:根据已编辑的问题添加了扩展代码示例
更新:添加了一些解释

关于c# 为文章摘要安全地截断 HTML,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/1714764/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com