gpt4 book ai didi

c# - C#中的简单网络爬虫

转载 作者:可可西里 更新时间:2023-11-01 09:00:26 25 4
gpt4 key购买 nike

我创建了一个简单的网络爬虫,但我想添加递归功能,以便打开的每个页面都可以获取该页面中的 URL,但我不知道该怎么做,我还想包括线程使其更快。这是我的代码

namespace Crawler
{
public partial class Form1 : Form
{
String Rstring;

public Form1()
{
InitializeComponent();
}

private void button1_Click(object sender, EventArgs e)
{

WebRequest myWebRequest;
WebResponse myWebResponse;
String URL = textBox1.Text;

myWebRequest = WebRequest.Create(URL);
myWebResponse = myWebRequest.GetResponse();//Returns a response from an Internet resource

Stream streamResponse = myWebResponse.GetResponseStream();//return the data stream from the internet
//and save it in the stream

StreamReader sreader = new StreamReader(streamResponse);//reads the data stream
Rstring = sreader.ReadToEnd();//reads it to the end
String Links = GetContent(Rstring);//gets the links only

textBox2.Text = Rstring;
textBox3.Text = Links;
streamResponse.Close();
sreader.Close();
myWebResponse.Close();




}

private String GetContent(String Rstring)
{
String sString="";
HTMLDocument d = new HTMLDocument();
IHTMLDocument2 doc = (IHTMLDocument2)d;
doc.write(Rstring);

IHTMLElementCollection L = doc.links;

foreach (IHTMLElement links in L)
{
sString += links.getAttribute("href", 0);
sString += "/n";
}
return sString;
}

最佳答案

我按如下方式修复了您的 GetContent 方法,以从已抓取的页面获取新链接:

public ISet<string> GetNewLinks(string content)
{
Regex regexLink = new Regex("(?<=<a\\s*?href=(?:'|\"))[^'\"]*?(?=(?:'|\"))");

ISet<string> newLinks = new HashSet<string>();
foreach (var match in regexLink.Matches(content))
{
if (!newLinks.Contains(match.ToString()))
newLinks.Add(match.ToString());
}

return newLinks;
}

已更新

修正:正则表达式应该是regexLink。感谢@shashlearner 指出这一点(我的错误输入)。

关于c# - C#中的简单网络爬虫,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/10452749/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com