gpt4 book ai didi

c# - 如何通过指定搜索条件从网站中提取数据?

转载 作者:行者123 更新时间:2023-11-27 23:18:43 25 4
gpt4 key购买 nike

我有一个我不熟悉的新项目。一项任务是我需要浏览一些网站来收集一些数据。一个示例网站是这样的:https://www.hudhomestore.com/Home/Index.aspx

enter image description here

我已经阅读并观看了有关从网页“收集”数据的教程,例如:

但我的问题是我们通常如何设置偏好,根据我们的偏好“搜索”,然后使用上面的链接将结果加载到我的代码中?

编辑

这对于根据我的选择设置搜索条件是正确的。但是,搜索的总数(如果我为 MI 状态手动执行)是 223,但是我执行下面的代码时,tdNodeCollection 只有 121。你能告诉我我哪里出错了吗?

    HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();

string zipCode = "", city = "", county = "", street = "", sState = "MI", fromPrice = "0", toPrice = "0", fcaseNumber = "",
bed = "0", bath = "0", buyerType = "0", Status = "0", indoorAmenities = "", outdoorAmenities = "", housingType = "",
stories = "", parking = "", propertyAge = "", sLanguage = "ENGLISH";

var doc = await (Task.Factory.StartNew(() => web.Load("https://www.hudhomestore.com/Listing/PropertySearchResult.aspx?" +
"zipCode=" + zipCode + "&city=" + city + "&country=" + county + "&street=" + street + "&sState=" + sState +
"&fromPrice=" + fromPrice + "&toPrice=" + toPrice +
"&fcaseNumber=" + fcaseNumber + "&bed=" + bed + "&bath=" + bath +
"&buyerType=" + buyerType + "&Status=" + Status + "&indoorAmenities=" + indoorAmenities +
"&outdoorAmenities=" + outdoorAmenities + "&housingType=" + housingType + "&stories=" + stories +
"&parking=" + parking + "&propertyAge=" + propertyAge + "&sLanguage=" + sLanguage)));

HtmlNodeCollection tdNodeCollection = doc
.DocumentNode
.SelectNodes("//*[@id=\"dgPropertyList\"]//tr//td");

最佳答案

您可以为此目的使用 HTMLAgilityPack。我制作了一个小测试代码,并根据您可以设置的搜索条件对您希望废弃的第二页进行了测试。

        HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
HtmlWeb web = new HtmlWeb();
//string InitialUrl = "https://www.hudhomestore.com/Home/Index.aspx";
//Here you need to set the values of these variable to whatever user inputs
//after setting these values, add them to initial URL
string zipCode = "", city = "", county = "", street = "", sState = "AK", fromPrice = "0", toPrice = "0", fcaseNumber = "",
bed = "0", bath = "0", buyerType = "0", Status = "0", indoorAmenities = "", outdoorAmenities = "", housingType = "",
stories = "", parking = "", propertyAge = "", sLanguage = "ENGLISH";
HtmlAgilityPack.HtmlDocument document = web.Load("https://www.hudhomestore.com/Listing/PropertySearchResult.aspx?" +
"zipCode=" + zipCode + "&city=" + city + "&country=" + county + "&street=" + street + "&sState=" + sState +
"&fromPrice=" + fromPrice + "&toPrice=" + toPrice +
"&fcaseNumber=" + fcaseNumber + "&bed=" + bed + "&bath=" + bath +
"&buyerType=" + buyerType + "&Status=" + Status + "&indoorAmenities=" + indoorAmenities +
"&outdoorAmenities=" +outdoorAmenities + "&housingType=" + housingType + "&stories=" + stories +
"&parking=" + parking + "&propertyAge=" + propertyAge + "&sLanguage=" + sLanguage);
HtmlNodeCollection tdNodeCollection = document
.DocumentNode
.SelectNodes("//*[@id=\"dgPropertyList\"]//tr//td");

再数一遍,看看你的表情,在 tr 中,id="dgPropertyList" 正好有 121 个 td接下来,手动检查您的 td 并从该 td 中跟踪您需要的内容并获取该数据。

            foreach (HtmlAgilityPack.HtmlNode node in tdNodeCollection)
{
//Do you say you want to access to <h2>, <p> here?
//You can do:
HtmlNode h2Node = node.SelectSingleNode("./h2"); //That will get the first <h2> node
HtmlNodeCollection allH2Nodes = node.SelectNodes(".//h2"); //That will search in depth too

//And you can also take a look at the children, without using XPath (like in a tree):
HtmlNode h2Node_ = node.ChildNodes["h2"];
}

我已经测试了代码,它可以正常工作并解析整个文档以找到所需的表格。它将为您提供 div 中该表中的所有行。因此,您可以进一步深入这些行,找到您的 td 并获得您需要的东西。

另一种选择是使用 Selenium webdriverGet your hands on Selenium

如果您不希望浏览器可见但仍想使用类似 Selenium 的功能,那么您可以使用 PhantomJS

希望对您有所帮助。

关于c# - 如何通过指定搜索条件从网站中提取数据?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42081386/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com