gpt4 book ai didi

c# - 如何在 C# 中将 HTML 转换为文本?

转载 作者:IT王子 更新时间:2023-10-29 03:42:51 26 4
gpt4 key购买 nike

我正在寻找将 HTML 文档转换为纯文本的 C# 代码。

我不是在寻找简单的标签剥离,而是在输出纯文本的同时合理保留原始布局的东西。

输出应该是这样的:

Html2Txt at W3C

我看过 HTML Agility Pack,但我认为这不是我需要的。还有其他建议吗?

编辑:我刚刚从 CodePlex 下载了 HTML Agility Pack ,并运行 Html2Txt 项目。多么令人失望(至少是将 html 转换为文本的模块)!它所做的只是剥离标签、展平表格等。输出看起来与 W3C 生成的 Html2Txt 完全不同。太糟糕了,该来源似乎不可用。我正在寻找是否有更多“固定”解决方案可用。

编辑 2: 谢谢大家的建议。 FlySwat 向我推荐了我想去的方向。我可以使用 System.Diagnostics.Process 类通过“-dump”开关运行 lynx.exe 以将文本发送到标准输出,并使用 ProcessStartInfo.UseShellExecute = false 捕获标准输出ProcessStartInfo.RedirectStandardOutput = true。我会将所有这些包装在一个 C# 类中。这段代码只会偶尔被调用,所以我不太关心生成一个新进程还是在代码中执行它。另外,Lynx 速度很快!!

最佳答案

只是一个关于 HtmlAgilityPack 的注释,供后代使用。该项目包含一个 example of parsing text to html ,正如 OP 所指出的,它根本不像任何编写 HTML 的人所设想的那样处理空白。那里有全文渲染解决方案,其他人注意到这个问题,这不是(它甚至不能处理当前形式的表格),但它轻巧且快速,这就是我创建简单文本所需要的HTML 电子邮件版本。

using System.IO;
using System.Text.RegularExpressions;
using HtmlAgilityPack;

//small but important modification to class https://github.com/zzzprojects/html-agility-pack/blob/master/src/Samples/Html2Txt/HtmlConvert.cs
public static class HtmlToText
{

public static string Convert(string path)
{
HtmlDocument doc = new HtmlDocument();
doc.Load(path);
return ConvertDoc(doc);
}

public static string ConvertHtml(string html)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
return ConvertDoc(doc);
}

public static string ConvertDoc (HtmlDocument doc)
{
using (StringWriter sw = new StringWriter())
{
ConvertTo(doc.DocumentNode, sw);
sw.Flush();
return sw.ToString();
}
}

internal static void ConvertContentTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo)
{
foreach (HtmlNode subnode in node.ChildNodes)
{
ConvertTo(subnode, outText, textInfo);
}
}
public static void ConvertTo(HtmlNode node, TextWriter outText)
{
ConvertTo(node, outText, new PreceedingDomTextInfo(false));
}
internal static void ConvertTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo)
{
string html;
switch (node.NodeType)
{
case HtmlNodeType.Comment:
// don't output comments
break;
case HtmlNodeType.Document:
ConvertContentTo(node, outText, textInfo);
break;
case HtmlNodeType.Text:
// script and style must not be output
string parentName = node.ParentNode.Name;
if ((parentName == "script") || (parentName == "style"))
{
break;
}
// get text
html = ((HtmlTextNode)node).Text;
// is it in fact a special closing node output as text?
if (HtmlNode.IsOverlappedClosingElement(html))
{
break;
}
// check the text is meaningful and not a bunch of whitespaces
if (html.Length == 0)
{
break;
}
if (!textInfo.WritePrecedingWhiteSpace || textInfo.LastCharWasSpace)
{
html= html.TrimStart();
if (html.Length == 0) { break; }
textInfo.IsFirstTextOfDocWritten.Value = textInfo.WritePrecedingWhiteSpace = true;
}
outText.Write(HtmlEntity.DeEntitize(Regex.Replace(html.TrimEnd(), @"\s{2,}", " ")));
if (textInfo.LastCharWasSpace = char.IsWhiteSpace(html[html.Length - 1]))
{
outText.Write(' ');
}
break;
case HtmlNodeType.Element:
string endElementString = null;
bool isInline;
bool skip = false;
int listIndex = 0;
switch (node.Name)
{
case "nav":
skip = true;
isInline = false;
break;
case "body":
case "section":
case "article":
case "aside":
case "h1":
case "h2":
case "header":
case "footer":
case "address":
case "main":
case "div":
case "p": // stylistic - adjust as you tend to use
if (textInfo.IsFirstTextOfDocWritten)
{
outText.Write("\r\n");
}
endElementString = "\r\n";
isInline = false;
break;
case "br":
outText.Write("\r\n");
skip = true;
textInfo.WritePrecedingWhiteSpace = false;
isInline = true;
break;
case "a":
if (node.Attributes.Contains("href"))
{
string href = node.Attributes["href"].Value.Trim();
if (node.InnerText.IndexOf(href, StringComparison.InvariantCultureIgnoreCase)==-1)
{
endElementString = "<" + href + ">";
}
}
isInline = true;
break;
case "li":
if(textInfo.ListIndex>0)
{
outText.Write("\r\n{0}.\t", textInfo.ListIndex++);
}
else
{
outText.Write("\r\n*\t"); //using '*' as bullet char, with tab after, but whatever you want eg "\t->", if utf-8 0x2022
}
isInline = false;
break;
case "ol":
listIndex = 1;
goto case "ul";
case "ul": //not handling nested lists any differently at this stage - that is getting close to rendering problems
endElementString = "\r\n";
isInline = false;
break;
case "img": //inline-block in reality
if (node.Attributes.Contains("alt"))
{
outText.Write('[' + node.Attributes["alt"].Value);
endElementString = "]";
}
if (node.Attributes.Contains("src"))
{
outText.Write('<' + node.Attributes["src"].Value + '>');
}
isInline = true;
break;
default:
isInline = true;
break;
}
if (!skip && node.HasChildNodes)
{
ConvertContentTo(node, outText, isInline ? textInfo : new PreceedingDomTextInfo(textInfo.IsFirstTextOfDocWritten){ ListIndex = listIndex });
}
if (endElementString != null)
{
outText.Write(endElementString);
}
break;
}
}
}
internal class PreceedingDomTextInfo
{
public PreceedingDomTextInfo(BoolWrapper isFirstTextOfDocWritten)
{
IsFirstTextOfDocWritten = isFirstTextOfDocWritten;
}
public bool WritePrecedingWhiteSpace {get;set;}
public bool LastCharWasSpace { get; set; }
public readonly BoolWrapper IsFirstTextOfDocWritten;
public int ListIndex { get; set; }
}
internal class BoolWrapper
{
public BoolWrapper() { }
public bool Value { get; set; }
public static implicit operator bool(BoolWrapper boolWrapper)
{
return boolWrapper.Value;
}
public static implicit operator BoolWrapper(bool boolWrapper)
{
return new BoolWrapper{ Value = boolWrapper };
}
}

例如,以下 HTML 代码...

<!DOCTYPE HTML>
<html>
<head>
</head>
<body>
<header>
Whatever Inc.
</header>
<main>
<p>
Thanks for your enquiry. As this is the 1<sup>st</sup> time you have contacted us, we would like to clarify a few things:
</p>
<ol>
<li>
Please confirm this is your email by replying.
</li>
<li>
Then perform this step.
</li>
</ol>
<p>
Please solve this <img alt="complex equation" src="http://upload.wikimedia.org/wikipedia/commons/8/8d/First_Equation_Ever.png"/>. Then, in any order, could you please:
</p>
<ul>
<li>
a point.
</li>
<li>
another point, with a <a href="http://en.wikipedia.org/wiki/Hyperlink">hyperlink</a>.
</li>
</ul>
<p>
Sincerely,
</p>
<p>
The whatever.com team
</p>
</main>
<footer>
Ph: 000 000 000<br/>
mail: whatever st
</footer>
</body>
</html>

...将转化为:

Whatever Inc. 


Thanks for your enquiry. As this is the 1st time you have contacted us, we would like to clarify a few things:

1. Please confirm this is your email by replying.
2. Then perform this step.

Please solve this [complex equation<http://upload.wikimedia.org/wikipedia/commons/8/8d/First_Equation_Ever.png>]. Then, in any order, could you please:

* a point.
* another point, with a hyperlink<http://en.wikipedia.org/wiki/Hyperlink>.

Sincerely,

The whatever.com team


Ph: 000 000 000
mail: whatever st

...相对于:

        Whatever Inc.


Thanks for your enquiry. As this is the 1st time you have contacted us, we would like to clarify a few things:

Please confirm this is your email by replying.

Then perform this step.


Please solve this . Then, in any order, could you please:

a point.

another point, with a hyperlink.


Sincerely,


The whatever.com team

Ph: 000 000 000
mail: whatever st

关于c# - 如何在 C# 中将 HTML 转换为文本?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/731649/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com