gpt4 book ai didi

c# - 为什么有下载并发数限制?

转载 作者:太空狗 更新时间:2023-10-29 20:21:17 26 4
gpt4 key购买 nike

我正在尝试制作我自己的简单网络爬虫。我想从 URL 下载具有特定扩展名的文件。我编写了以下代码:

    private void button1_Click(object sender, RoutedEventArgs e)
{
if (bw.IsBusy) return;
bw.DoWork += new DoWorkEventHandler(bw_DoWork);
bw.RunWorkerAsync(new string[] { URL.Text, SavePath.Text, Filter.Text });
}
//--------------------------------------------------------------------------------------------
void bw_DoWork(object sender, DoWorkEventArgs e)
{
try
{
ThreadPool.SetMaxThreads(4, 4);
string[] strs = e.Argument as string[];
Regex reg = new Regex("<a(\\s*[^>]*?){0,1}\\s*href\\s*\\=\\s*\\\"([^>]*?)\\\"\\s*[^>]*>(.*?)</a>", RegexOptions.Compiled | RegexOptions.CultureInvariant | RegexOptions.IgnoreCase);
int i = 0;
string domainS = strs[0];
string Extensions = strs[2];
string OutDir = strs[1];
var domain = new Uri(domainS);
string[] Filters = Extensions.Split(new char[] { ';', ',', ' ' }, StringSplitOptions.RemoveEmptyEntries);
string outPath = System.IO.Path.Combine(OutDir, string.Format("File_{0}.html", i));

WebClient webClient = new WebClient();
string str = webClient.DownloadString(domainS);
str = str.Replace("\r\n", " ").Replace('\n', ' ');
MatchCollection mc = reg.Matches(str);
int NumOfThreads = mc.Count;

Parallel.ForEach(mc.Cast<Match>(), new ParallelOptions { MaxDegreeOfParallelism = 2, },
mat =>
{
string val = mat.Groups[2].Value;
var link = new Uri(domain, val);
foreach (string ext in Filters)
if (val.EndsWith("." + ext))
{
Download((object)new object[] { OutDir, link });
break;
}
});
throw new Exception("Finished !");

}
catch (System.Exception ex)
{
ReportException(ex);
}
finally
{

}
}
//--------------------------------------------------------------------------------------------
private static void Download(object o)
{
try
{
object[] objs = o as object[];
Uri link = (Uri)objs[1];
string outPath = System.IO.Path.Combine((string)objs[0], System.IO.Path.GetFileName(link.ToString()));
if (!File.Exists(outPath))
{
//WebClient webClient = new WebClient();
//webClient.DownloadFile(link, outPath);

DownloadFile(link.ToString(), outPath);
}
}
catch (System.Exception ex)
{
ReportException(ex);
}
}
//--------------------------------------------------------------------------------------------
private static bool DownloadFile(string url, string filePath)
{
try
{
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
request.UserAgent = "Web Crawler";
request.Timeout = 40000;
WebResponse response = request.GetResponse();
Stream stream = response.GetResponseStream();
using (FileStream fs = new FileStream(filePath, FileMode.CreateNew))
{
const int siz = 1000;
byte[] bytes = new byte[siz];
for (; ; )
{
int count = stream.Read(bytes, 0, siz);
fs.Write(bytes, 0, count);
if (count == 0) break;
}
fs.Flush();
fs.Close();
}
}
catch (System.Exception ex)
{
ReportException(ex);
return false;
}
finally
{

}
return true;
}

问题是,虽然它适用于 2 个并行下载:

        new ParallelOptions { MaxDegreeOfParallelism = 2,  }

...它不适用于更大程度的并行性,例如:

        new ParallelOptions { MaxDegreeOfParallelism = 5,  }

...我收到连接超时异常。

一开始我以为是WebClient的原因:

                //WebClient webClient = new WebClient();
//webClient.DownloadFile(link, outPath);

...但是当我将其替换为使用 HttpWebRequest 的函数 DownloadFile 时,我仍然遇到错误。

我在很多网页上测试过,没有任何变化。我还通过 chrome 的扩展“下载大师”确认了这些网络服务器允许多个并行下载。有谁知道为什么我在尝试并行下载多个文件时会出现超时异常?

最佳答案

您需要分配 ServicePointManager.DefaultConnectionLimit .到同一主机的默认并发连接数为 2。另见 related SO post关于使用 web.config connectionManagement .

关于c# - 为什么有下载并发数限制?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/11017981/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com