gpt4 book ai didi

c# - htmlagilitypack - 删除脚本和样式?

转载 作者:可可西里 更新时间:2023-11-01 03:03:18 25 4
gpt4 key购买 nike

我使用以下方法从 html 中提取文本:

    public string getAllText(string _html)
{
string _allText = "";
try
{
HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
document.LoadHtml(_html);


var root = document.DocumentNode;
var sb = new StringBuilder();
foreach (var node in root.DescendantNodesAndSelf())
{
if (!node.HasChildNodes)
{
string text = node.InnerText;
if (!string.IsNullOrEmpty(text))
sb.AppendLine(text.Trim());
}
}

_allText = sb.ToString();

}
catch (Exception)
{
}

_allText = System.Web.HttpUtility.HtmlDecode(_allText);

return _allText;
}

问题是我还得到了脚本和样式标签。

我怎样才能排除它们?

最佳答案

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);

doc.DocumentNode.Descendants()
.Where(n => n.Name == "script" || n.Name == "style")
.ToList()
.ForEach(n => n.Remove());

关于c# - htmlagilitypack - 删除脚本和样式?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/13441470/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com