gpt4 book ai didi

excel - 通过 TagName 进行网页抓取

转载 作者:行者123 更新时间:2023-12-04 21:50:26 25 4
gpt4 key购买 nike

我正在尝试从网站上提取一些数据,但由于我是网络抓取的新手,因此在标签名称、类代码和 ID 中感到困惑。我对此只有基本知识。
我想复制下面的数据,如果数据不存在,那么单元格应该留空,代码需要移入下一个值。

Class="container size" - 5*5,5*10 kind of value
Class="description" - Standard in this case also need to copy like Drive-up Access
Class="offer1" & "offer2" - Call for Availability
Class="price"

我试图构建一个代码,但无法准确判断需要选择哪个标签名称,下面是我的代码,请帮我复制这些数据。
Dim ie As New InternetExplorer, ws As Worksheet
Set ws = ThisWorkbook.Worksheets("Sheet1")
With ie
.Visible = True
.Navigate2 "" & Sheets("Home").Range("C3").Text

While .Busy Or .readyState < 4: DoEvents: Wend

Sheets("Unit Data").Select


Dim listings As Object, listing As Object, headers(), results()
Dim r As Long, list As Object, item As Object
headers = Array("size", "features")
Set list = .document.getElementsByClassName("units-table main")
'.unit_size medium, .features, .promo_offers, .board_rate_wrapper p, .board_rate
Dim rowCount As Long
rowCount = .document.querySelectorAll(".units-table main li").Length


ReDim results(1 To rowCount, 1 To UBound(headers) + 1)
For Each listing In list
For Each item In listing.getElementsByTagName("li")
r = r + 1
On Error Resume Next
results(r, 1) = item.getElementsByClassName("container size")(0).innerText
results(r, 2) = item.getElementsByClassName("description")(0).innerText

On Error GoTo 0


Next
Next
ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
.Quit
End With

最佳答案

XHR:

所有信息均可通过XMLHTTP (XHR) request获得- 比打开浏览器快得多。

我首先使用 .main li[class] 的 css 选择器检索行数."."class selector , litype selector[class]attribute selector .空间," " , 中间是 descendant combinator .这指定我要检索所有 li标签/类型元素,具有类属性,其父类名称为 main .

这匹配如下:



如您所见,这给了我行数; parent 数量li要从中检索结果集信息的元素。

此集合 li elementsquerySelectorAll 作为节点列表返回.我无法循环应用此列表 getElementsByClassName/querySelector到单个节点,如 li元素没有公开我可以使用的方法。

现在,由于我没有使用浏览器,我不得不依赖 HTMLDocument 可用的方法。目的。与浏览器不同,我无法访问受限的 pseudo class selectors他们支持,当通过 VBA 自动化时,这将允许我使用选择器语法,例如 :nth-of-type 访问各个行。这是使用 VBA 进行网络抓取的一个令人讨厌的限制。

所以,我们能做些什么?好吧,在这种情况下,我可以转储 innerHTML将每个节点转换为另一个 HTMLDocument变量,html2 ,这样我就可以访问 querySelector/querySelectorAll该对象的方法。 HTML 将仅限于当前的 li .

如果我们查看有问题的 HTML:



我们可以看到 li元素是一般的 sibling 。他们坐在同一水平线上。当我循环我的节点列表 listings ,我正在转移innerHTML从当前节点到html2 ;我的第二个 HTMLDocument多变的。

值得注意的是,我可能已经使用 children 对每个列表进行了下降。例如:

listings.item(i).Children(2)......

然后我可以在 newLines 等上进行拆分,以便访问所有信息。我认为我给定的方法更快,更健壮。

VBA:
Option Explicit  
Public Sub GetInfo()
Dim ws As Worksheet, html As HTMLDocument, s As String
Const URL As String = "https://www.neighborhoodselfstorage.net/self-storage-delmar-md-f1426"

Set ws = ThisWorkbook.Worksheets("Sheet1")
Set html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", URL, False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send
s = .responseText
html.body.innerHTML = s

Dim headers(), results(), listings As Object, amenities As String

headers = Array("Size", "Description", "Amenities", "Offer1", "Offer2", "RateType", "Price")
Set listings = html.querySelectorAll(".main li[class]")

Dim rowCount As Long, numColumns As Long, r As Long, c As Long
Dim icons As Object, icon As Long, amenitiesInfo(), i As Long, item As Long

rowCount = listings.Length
numColumns = UBound(headers) + 1

ReDim results(1 To rowCount, 1 To numColumns)
Dim html2 As HTMLDocument
Set html2 = New HTMLDocument
For item = 0 To listings.Length - 1
r = r + 1
html2.body.innerHTML = listings.item(item).innerHTML
'size,description, amenities,specials offer1 offer2, rate type, price

results(r, 1) = Trim$(html2.querySelector(".size").innerText)
results(r, 2) = Trim$(html2.querySelector(".description").innerText)
Set icons = html2.querySelectorAll("i[title]")

ReDim amenitiesInfo(0 To icons.Length - 1)

For icon = 0 To icons.Length - 1
amenitiesInfo(icon) = icons.item(icon).getAttribute("title")
Next

amenities = Join$(amenitiesInfo, ", ")

results(r, 3) = amenities
results(r, 4) = html2.querySelector(".offer1").innerText
results(r, 5) = html2.querySelector(".offer2").innerText
results(r, 6) = html2.querySelector(".rate-label").innerText
results(r, 7) = html2.querySelector(".price").innerText
Next

ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End With
End Sub

Internet Explorer:

假设没有从给定的 url 重定向。在这里,我使用 :nth-of-type 伪类选择器来定位列表的每一行。这些行是 li (list) 元素保存每个框列表的信息。我建立了一个 css 选择器字符串,它指定行,然后是我所追求的行中的元素。我将该字符串传递给 querySelector , 或 querySelectorAll返回匹配的元素。
Option Explicit

Public Sub UseIE()
Dim ie As New InternetExplorerm, ws As Worksheet
Const Url As String = "https://www.neighborhoodselfstorage.net/self-storage-delmar-md-f142"

Set ws = ThisWorkbook.Worksheets("Sheet1")

With ie
.Visible = True
.Navigate2 Url

While .Busy Or .readyState < 4: DoEvents: Wend

Dim headers(), results(), listings As Object, listing As Object, amenities As String

headers = Array("Size", "Description", "Amenities", "Offer1", "Offer2", "RateType", "Price")

Set listings = .document.querySelectorAll(".main li[class]")

Dim rowCount As Long, numColumns As Long, r As Long, c As Long
Dim icons As Object, icon As Long, amenitiesInfo(), i As Long

rowCount = listings.Length
numColumns = UBound(headers) + 1
ReDim results(1 To rowCount, 1 To numColumns)
For Each listing In listings
r = r + 1
'size,description, amenities,specials offer1 offer2, rate type, price
With .document

results(r, 1) = Trim$(.querySelector(".main li:nth-of-type(" & r & ") .size").innerText)
results(r, 2) = Trim$(.querySelector(".main li:nth-of-type(" & r & ") .description").innerText)

Set icons = .querySelectorAll("." & Join$(Split(listing.className, Chr$(32)), ".") & ":nth-of-type(" & r & ") i[title]")

ReDim amenitiesInfo(0 To icons.Length - 1)

For icon = 0 To icons.Length - 1
amenitiesInfo(icon) = icons.item(icon).getAttribute("title")
Next

amenities = Join$(amenitiesInfo, ",")
results(r, 3) = amenities
results(r, 4) = .querySelector(".main li:nth-of-type(" & r & ") .offer1").innerText
results(r, 5) = .querySelector(".main li:nth-of-type(" & r & ") .offer2").innerText
results(r, 6) = .querySelector(".main li:nth-of-type(" & r & ") .rate-label").innerText
results(r, 7) = .querySelector(".main li:nth-of-type(" & r & ") .price").innerText
End With
Next
.Quit
ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End With
End Sub



引用资料(VBE > 工具 > 引用资料):
  • Microsoft HTML 对象库
  • 微软互联网控制
  • 关于excel - 通过 TagName 进行网页抓取,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55549841/

    25 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com