gpt4 book ai didi

c# - HtmlAgilityPack 给出格式错误的 html 问题

转载 作者:行者123 更新时间:2023-11-30 16:33:40 29 4
gpt4 key购买 nike

我想从 html 文档中提取有意义的文本,为此我使用了 html-agility-pack。这是我的代码:

string convertedContent = HttpUtility.HtmlDecode(
ConvertHtml(HtmlAgilityPack.HtmlEntity.DeEntitize(htmlAsString))
);

转换HTML:

public string ConvertHtml(string html)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);

StringWriter sw = new StringWriter();
ConvertTo(doc.DocumentNode, sw);
sw.Flush();
return sw.ToString();
}

转换为:

public void ConvertTo(HtmlAgilityPack.HtmlNode node, TextWriter outText)
{
string html;
switch (node.NodeType)
{
case HtmlAgilityPack.HtmlNodeType.Comment:
// don't output comments
break;

case HtmlAgilityPack.HtmlNodeType.Document:
foreach (HtmlNode subnode in node.ChildNodes)
{
ConvertTo(subnode, outText);
}
break;

case HtmlAgilityPack.HtmlNodeType.Text:
// script and style must not be output
string parentName = node.ParentNode.Name;
if ((parentName == "script") || (parentName == "style"))
break;

// get text
html = ((HtmlTextNode)node).Text;

// is it in fact a special closing node output as text?
if (HtmlNode.IsOverlappedClosingElement(html))
break;

// check the text is meaningful and not a bunch of whitespaces
if (html.Trim().Length > 0)
{
outText.Write(HtmlEntity.DeEntitize(html) + " ");
}
break;

case HtmlAgilityPack.HtmlNodeType.Element:
switch (node.Name)
{
case "p":
// treat paragraphs as crlf
outText.Write("\r\n");
break;
}

if (node.HasChildNodes)
{
foreach (HtmlNode subnode in node.ChildNodes)
{
ConvertTo(subnode, outText);
}
}
break;
}
}

现在在某些情况下,当 html 页面格式错误时(例如以下页面 - http://rareseeds.com/cart/products/Purple_of_Romagna_Artichoke-646-72.html 有一个格式错误的元标记,如 <meta content="text/html; charset=uft-8"http -equiv="Content-Type">) [注意“uft”而不是 utf] 在我尝试加载 html 文档时我的代码正在呕吐。

有人可以建议我如何克服这些格式错误的 html 页面并仍然从 html 文档中提取相关文本吗?

谢谢,卡 PIL

最佳答案

正如 HtmlAgilityPack 项目页面中所说,“解析器对‘真实世界’格式错误的 HTML 非常宽容”。但是你描述的那种错误太严重了,可能无法纠正。您可以设置默认编码:

 HtmlDocument doc = new HtmlDocument();
doc.OptionDefaultStreamEncoding = Encoding.UTF8;

关于c# - HtmlAgilityPack 给出格式错误的 html 问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/2944107/

29 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com