gpt4 book ai didi

c# - 使用 ABOT 抓取站点地图

转载 作者:行者123 更新时间:2023-12-04 16:10:34 31 4
gpt4 key购买 nike

我尝试使用 ABOT 抓取站点地图。我的代码灵感来自 here .

抓取页面完成后,内容文本为空(Crawler_PageCrawlCompleted 中的e.CrawledPage)。此外,SiteMapFinder.GetLinks 从未接触过。

请告诉我我的问题在哪里。

using Abot.Core;
using Abot.Crawler;
using Abot.Poco;
using CsQuery.ExtensionMethods;
using System;
using System.Collections.Generic;

namespace WebCrawler
{


public class SiteMapFinder : IHyperLinkParser
{
private readonly HyperLinkParser _linkParser;
public SiteMapFinder()
{
_linkParser = new AngleSharpHyperlinkParser();
}

IEnumerable<Uri> IHyperLinkParser.GetLinks(CrawledPage crawledPage)
{
if (crawledPage.HttpWebResponse.ContentType == "text/xml")
{
Console.WriteLine(crawledPage.Uri.AbsoluteUri);

}



return _linkParser.GetLinks(crawledPage);

}
}
class Program
{
static void Main(string[] args)
{
SiteMapFinder finder = new SiteMapFinder();
PoliteWebCrawler crawler = new PoliteWebCrawler(null, null, null, null, null, finder, null, null, null);


crawler.PageCrawlCompleted += Crawler_PageCrawlCompleted;
CrawlResult result = crawler.Crawl(new Uri("http://www.example.com/sitemap/"));


}

private static void Crawler_PageCrawlCompleted(object sender, PageCrawlCompletedArgs e)
{
Console.WriteLine(e.CrawledPage.Uri.AbsoluteUri);
e.CrawledPage.HttpWebResponse.Headers.AllKeys.ForEach(k => Console.WriteLine($"{k}: {e.CrawledPage.HttpWebResponse.Headers[k]}"));
}
}

最佳答案

好的,我的问题在 app.config 上。应为 downloadableContentTypes 添加 text/XML

<abot>
<crawlBehavior
....
....
downloadableContentTypes="text/html, text/plain, text/xml"

这是我完成的加载 XML 和获取站点地图链接的代码。

using Abot.Core;
using Abot.Crawler;
using Abot.Poco;
using CsQuery.ExtensionMethods;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Xml;

namespace WebCrawler
{

public class SiteMapFinder : IHyperLinkParser
{
private readonly HyperLinkParser _linkParser;
public SiteMapFinder()
{
_linkParser = new AngleSharpHyperlinkParser();
}

IEnumerable<Uri> IHyperLinkParser.GetLinks(CrawledPage crawledPage)
{
if (crawledPage.HttpWebResponse.ContentType == "text/xml")
{
XmlDocument xml = new XmlDocument();
xml.LoadXml(crawledPage.Content.Text);

if (xml.DocumentElement == null) return new Uri[] {};


XmlNamespaceManager manager = new XmlNamespaceManager(xml.NameTable);
manager.AddNamespace("s", xml.DocumentElement.NamespaceURI);


var links = xml.SelectNodes("/s:sitemapindex/s:sitemap", manager);
if(links == null) return new Uri[] { };
return links
.Cast<XmlNode>()
.Select(x => new Uri(x.InnerText));




}



return _linkParser.GetLinks(crawledPage);

}
}
class Program
{
static void Main(string[] args)
{
SiteMapFinder finder = new SiteMapFinder();
PoliteWebCrawler crawler = new PoliteWebCrawler(null, null, null, null, null, finder, null, null, null);


crawler.PageCrawlCompleted += Crawler_PageCrawlCompleted;
CrawlResult result = crawler.Crawl(new Uri("http://tenders.rfpalertservices.com/sitemap/"));


}

private static void Crawler_PageCrawlCompleted(object sender, PageCrawlCompletedArgs e)
{
Console.WriteLine(e.CrawledPage.Uri.AbsoluteUri);
e.CrawledPage.HttpWebResponse.Headers.AllKeys.ForEach(k => Console.WriteLine($"{k}: {e.CrawledPage.HttpWebResponse.Headers[k]}"));
}
}
}

关于c# - 使用 ABOT 抓取站点地图,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42581658/

31 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com