gpt4 book ai didi

excel - 通过更改类名进行抓取

转载 作者:行者123 更新时间:2023-12-04 21:27:36 24 4
gpt4 key购买 nike

我正在尝试从网页中提取姓名、地址、角色、状态、任命时间、辞职时间(如果有),下面有一个代码示例。
问题是每家公司的董事人数可能不同,我不确定如何确定总董事人数(类 appointment-1)= x,所以我可以遍历它们。
HTLM 代码:

<div class="appointments-list">
<div class="appointment-1">
<h2 class="heading-medium">
<span id="officer-name-1">
<a href="/officers/Oo16GI3lS3HEgrIR-kCpmLYbDWw/appointments" onclick="javascript:_paq.push(['trackGoal', 5]);">BUCKSEY, Nicholas</a>
</span>
</h2>
<dl>
<dt id="officer-address-field-1">Correspondence address</dt>
<dd class="data" id="officer-address-value-1">
1 St James&#39;s Square, London, SW1Y 4PD </dd>
</dl>
<div class="grid-row">
<dl class="column-quarter">
<dt>Role
<span id="officer-status-tag-1" class="status-tag font-xsmall">Active</span>
</dt>
<dd id="officer-role-1" class="data">
Secretary
</dd>
</dl>
<dl class="column-quarter">
<dt>Appointed on</dt>
<dd id="officer-appointed-on-1" class="data">
1 June 2020
</dd>
</dl>
</div>
<div class="grid-row"></div>
<div class="grid-row"></div>
<div class="grid-row"></div>
</div>
<div class="appointment-2">
<h2 class="heading-medium heading-with-border">
<span id="officer-name-2">
<a href="/officers/IND_i3_G7Gqq3ZzC3P0rXYbUcNU/appointments" onclick="javascript:_paq.push(['trackGoal', 5]);">MATHEWS, Benedict John Spurway</a>
</span>
</h2>
</h2>

<dl>
<dt id="officer-address-field-2">Correspondence address</dt>
<dd class="data" id="officer-address-value-2">
1 St James&#39;s Square, London, SW1Y 4PD </dd>
</dl>

<div class="grid-row">
<dl class="column-quarter">
<dt>Role
<span id="officer-status-tag-2" class="status-tag font-xsmall">Active</span>
</dt>
<dd id="officer-role-2" class="data">
Secretary
</dd>
</dl>
<dl class="column-quarter">
<dt>Appointed on</dt>
<dd id="officer-appointed-on-2" class="data">
7 May 2019
</dd>
</dl>
</div>

<div class="grid-row"></div>
<div class="grid-row"></div>
<div class="grid-row"></div>
</div>
VBA 代码:我正在尝试使用 querySelectorall但无法“识别”正确的类 ID。
Sub ChangeTab()
Set ie = CreateObject("InternetExplorer.Application")
ie.Visible = True
ie.navigate "https://find-and-update.company-information.service.gov.uk/company/00102498/officers"

Do While ie.readyState <> 4: DoEvents: Loop

'Application.Wait (Now + TimeValue("0:00:02"))
' Dim i As Long, secNumberNodeList As Object, secNumberNode As Object

Set secNumberNodeList = ie.Document.querySelectorAll("appointments-list")

For Each sc In secNumberNodeList
Debug.Print sc.getElementById("officer-name-1")
Debug.Print sc.getElementById("officer-address-value-1")
Debug.Print sc.getElementById("officer-status-tag-1")
Debug.Print sc.getElementById("officer-appointed-on-1")
Debug.Print sc.getElementById("officer-appointed-on-1")
Debug.Print sc.getElementById("officer-resigned-on-16")
Next
End Sub

最佳答案

这是执行此操作的可靠方法之一。我使用 XMLHttpRequest 而不是 IE。我试图展示如何使用循环来访问所有容器的内容。尝试在循环中定义您感兴趣的其他字段来解析它们。

Option Explicit
Sub GetInformation()
Const URL = "https://find-and-update.company-information.service.gov.uk/company/00102498/officers"
Dim Http As Object, Html As HTMLDocument, I&
Dim HtmlDoc As HTMLDocument, sName$, sAddress$

Set Html = New HTMLDocument
Set HtmlDoc = New HTMLDocument
Set Http = CreateObject("MSXML2.XMLHTTP")

With Http
.Open "GET", URL, False
.send
Html.body.innerHTML = .responseText
End With

With Html.querySelectorAll(".appointments-list > [class^='appointment-']")
For I = 0 To .Length - 1
HtmlDoc.body.innerHTML = .Item(I).outerHTML
sName = HtmlDoc.querySelector("h2 > span > a").innerText
sAddress = HtmlDoc.querySelector(".data[id^='officer-address-value-']").innerText
Debug.Print sName, sAddress
Next I
End With
End Sub
执行上述脚本需要添加的引用:
1. Microsoft XML, v6.0
2. Microsoft HTML Object Library

关于excel - 通过更改类名进行抓取,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66834718/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com