gpt4 book ai didi

c# - HTML敏捷包 : Screen Scraping Unable to Find a Div with Hyphen in Class Name?

转载 作者:行者123 更新时间:2023-11-30 21:59:22 25 4
gpt4 key购买 nike

这是一种学习练习,但也有一部分“有趣”。基本上,我试图在 C# 控制台应用程序中解析“阳台”特等房的价格(目前为 1039 美元)。网址是:

http://www.carnival.com/BookingEngine/Stateroom/Stateroom2/?embkCode=PCV&itinCode=SC0&durDays=8&shipCode=SH&subRegionCode=CS&sailDate=08082015&sailingID=68791&numGuests=2&showDbl=False&isOver55=N&isPastGuest=N&stateCode=&isMilitary=N&evsel=&be_version=1

我已经将上面的 url 加载好了:

var document = getHtmlWeb.Load(web_address);

阳台价格的容器是一个类为“col”的 div,是 column-container clearfix 类中的第 3 个 div。我以为我需要做的就是用 class per 对所有 div 进行罚款:

var lowest_price = document.DocumentNode.SelectNodes("//div[@class='col-bottom']");

然后选择第 3 个节点以获取阳台价格。但是 lowest_price 变量一直返回 null。我知道文档本身已加载,如果我选择“col”,我可以在“col”中看到。是 col-bottom 中的连字符阻止了该 div 的查找吗?

还有其他方法可以做到这一点吗?正如我所说,这主要是一种学习练习。但我不得不创建一些需要屏幕抓取的自定义监控解决方案,所以这不仅仅是有趣。

谢谢!

编辑 包含相关信息的 HTML 片段:

    <div class="col">
<h2 data-cat-title="Balcony" class="uk-rate-title-OB"> Balcony </h2> <p>&nbsp;</p>
<div class="col-bottom">
<h3> From</h3>
<strong> $1,039.00* <span class="rate-compare-strike"> </span> </strong><a metacode="OB" href="#" class="select-btn">Select</a> </div>
</div>

最佳答案

atrribute 名称或值中的连字符是有效的 html,没有问题,您的来源的问题是他们在客户端使用 javascript 来呈现 html,以验证您可以下载 html 页面,您会注意到您要查找的元素不存在。

要解析需要首先执行 javascript 的页面,为此您可以使用网络浏览器控件,然后将 html 传递给 HAP。

这是一个关于如何使用 WinForms 网络浏览器控件的简单示例:

private void ParseSomeHtmlThatRenderedJavascript(){
var browser = new System.Windows.Forms.WebBrowser() { ScriptErrorsSuppressed = true };

string link = "yourLinkHere";

//This will be called when the web page loads, it better be a class member since this is just a simple demonstration
WebBrowserDocumentCompletedEventHandler onDocumentCompleted = new WebBrowserDocumentCompletedEventHandler((s, evt) => {
//Do your HtmlParsingHere
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(browser.DocumentText);
var someNode = doc.DocumentNode.SelectNodes("yourxpathHere");
});

//subscribe to the DocumentCompleted event using our above handler before navigating
browser.DocumentCompleted += onDocumentCompleted;

browser.Navigate(link);
}

你也可以看看Awesomium以及其他一些嵌入式 WebBrowser 控件。

此外,如果您想在控制台应用程序中运行 WebBrowser,这里有一个示例,如果您没有使用 Windows 窗体获取它,则此示例借助于此 SO 答案 WebBrowser Control in a new thread

    using System;
using System.Text;
using System.Threading;
using System.Threading.Tasks;
using System.Windows.Forms;
using HtmlAgilityPack;
namespace ConsoleApplication276
{

// a container for a url and a parser Action
public class Link
{
public string link{get;set;}
public Action<string> parser { get; set; }
}

public class Program
{

// Entry Point of the console app
public static void Main(string[] args)
{
try
{
// download each page and dump the content
// you can add more links here, associate each link with a parser action, as for what data should the parser generate create a property for that in the Link container

var task = MessageLoopWorker.Run(DoWorkAsync, new Link() {
link = "google.com",
parser = (string html) => {

//do what ever you need with hap here
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var someNodes = doc.DocumentNode.SelectSingleNode("//div");

} });


task.Wait();
Console.WriteLine("DoWorkAsync completed.");
}
catch (Exception ex)
{
Console.WriteLine("DoWorkAsync failed: " + ex.Message);
}

Console.WriteLine("Press Enter to exit.");
Console.ReadLine();
}

// navigate WebBrowser to the list of urls in a loop
public static async Task<Link> DoWorkAsync(Link[] args)
{
Console.WriteLine("Start working.");

using (var wb = new WebBrowser())
{
wb.ScriptErrorsSuppressed = true;

TaskCompletionSource<bool> tcs = null;
WebBrowserDocumentCompletedEventHandler documentCompletedHandler = (s, e) =>
tcs.TrySetResult(true);

// navigate to each URL in the list
foreach (var arg in args)
{
tcs = new TaskCompletionSource<bool>();
wb.DocumentCompleted += documentCompletedHandler;
try
{
wb.Navigate(arg.link.ToString());
// await for DocumentCompleted
await tcs.Task;
// after the page loads pass the html to the parser
arg.parser(wb.DocumentText);
}
finally
{
wb.DocumentCompleted -= documentCompletedHandler;
}
// the DOM is ready
Console.WriteLine(arg.link.ToString());
Console.WriteLine(wb.Document.Body.OuterHtml);
}
}

Console.WriteLine("End working.");
return null;
}

}

// a helper class to start the message loop and execute an asynchronous task
public static class MessageLoopWorker
{
public static async Task<Object> Run(Func<Link[], Task<Link>> worker, params Link[] args)
{
var tcs = new TaskCompletionSource<object>();

var thread = new Thread(() =>
{
EventHandler idleHandler = null;

idleHandler = async (s, e) =>
{
// handle Application.Idle just once
Application.Idle -= idleHandler;

// return to the message loop
await Task.Yield();

// and continue asynchronously
// propogate the result or exception
try
{
var result = await worker(args);
tcs.SetResult(result);
}
catch (Exception ex)
{
tcs.SetException(ex);
}

// signal to exit the message loop
// Application.Run will exit at this point
Application.ExitThread();
};

// handle Application.Idle just once
// to make sure we're inside the message loop
// and SynchronizationContext has been correctly installed
Application.Idle += idleHandler;
Application.Run();
});

// set STA model for the new thread
thread.SetApartmentState(ApartmentState.STA);

// start the thread and await for the task
thread.Start();
try
{
return await tcs.Task;
}
finally
{
thread.Join();
}
}
}
}

关于c# - HTML敏捷包 : Screen Scraping Unable to Find a Div with Hyphen in Class Name?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29302259/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com